LINUX // MM // VAS // PTE // BUDDY // SLAB // OOM // NUMA // 2026

Linux Memory Management

VAS · Page Tables · MMU · Allocators // x86_64 // Kernel 6.x

The Two Memories virtual · physical

Every running process believes it owns the entire machine. It opens at the same addresses as every other process, references the same symbols, walks the same stack-pointer values. The trick that makes this work is virtual memory: the process sees a flat private address space, and the kernel plus the CPU's Memory Management Unit lazily map regions of that virtual space onto the physical DRAM that actually exists.

Three things fall out of this for free.

Isolation
memory protection
Process A cannot see B's memory because A's page tables don't include B's mappings. A bug in one process can't trash another.
Larger than RAM
256 TiB user space
The virtual space is decoupled from physical capacity. You can address far more memory than you have, and the kernel pages in only what's touched.
Lazy Backing
demand paging
Pages get backed by physical frames only on first access. malloc(1 GiB) costs almost nothing until you touch it.

Two shapes to keep in your head: page (virtual, 4 KiB by default on x86_64) and page frame (physical, the same 4 KiB chunk of DRAM). Translation happens at page granularity. Anything finer is the allocator's problem.

PROCESS A VIRTUAL ADDR SPACE stack heap .data .text PROCESS B VIRTUAL ADDR SPACE stack .text PAGE TABLES MMU + KERNEL PHYSICAL RAM 4 KiB FRAMES A:stk A.txt + B.txt free A:heap free B:stk free A:.data free / page cache / kernel slabs kernel image / direct map
FIG_01 // VIRT-TO-PHYS INDIRECTION

The x86_64 Virtual Address Space canonical · 48 bits

x86_64 hardware uses 48-bit virtual addresses (5-level paging extends this to 57, but most kernels still ship 4-level). The high 16 bits must be a sign-extension of bit 47 — addresses must be canonical. Anything else faults. This carves the 64-bit space into two halves with a giant unmappable hole in the middle.

Page size
4 KiB (0x1000)
Virt bits used
48 / 57
User max
0x00007FFF.FFFF.FFFF
Kernel min
0xFFFF8000.0000.0000
User space
128 TiB
Kernel space
128 TiB

Layout of a typical user process

HIGH ADDRESSES LOW ADDRESSES KERNEL SPACE unmapped from user CR3 entries 0xFFFF.FFFF.FFFF.FFFF 0xFFFF.8000.0000.0000 NON-CANONICAL / FAULTS / 64 - 48 BIT GAP 0xFFFF.7FFF.FFFF.FFFF 0x0000.8000.0000.0000 STACK ↓ GROWS DOWN argv / envp / auxv / locals / saved RBP / RIP 0x0000.7FFF.FFFF.FFFF RLIMIT_STACK below UNMAPPED GAP / GUARD MMAP REGION ↓ GROWS DOWN shared libraries (libc, libm, ld.so) large malloc() chunks (> M_MMAP_THRESHOLD) file mappings, shared memory, anonymous ASLR randomizes base on each exec VARIABLE GAP HEAP ↑ GROWS UP brk()-extended region, malloc small allocations program break = end of heap .BSS / ZERO-INIT GLOBALS .DATA / INIT GLOBALS .RODATA / CONST + STRINGS .TEXT / EXECUTABLE CODE 0x0000.0000.0040.0000 (PIE will randomize) NULL PAGE / UNMAPPED / SIGSEGV 0x0000.0000.0000.0000 PERMS rw- stack varies rw- heap rw- bss/data r-- rodata r-x .text
FIG_02 // x86_64 USER VAS / 4-LEVEL PAGING

What lives in each segment

SegmentSourcePermsBacked byNotes
.textELF LOADr-xfile (private)shared between procs of same binary
.rodataELF LOADr--filestring literals, const tables
.dataELF LOADrw-file (COW)initialized globals/statics
.bsskernelrw-anonymous (zero)no file bytes; zeroed on first touch
heapbrk/sbrkrw-anonymousextended in pages, never released piecemeal
mmapmmap()variesfile or anonunmappable individually via munmap()
stackkernel (auto)rw-anonymousgrows on fault, capped by RLIMIT_STACK
vDSO/vvarkernelr-x / r--shared kernelfast-path syscalls (gettimeofday, etc.)
ASLR // address space layout randomization With PIE binaries (default on most modern distros), the kernel randomizes the base of the executable, the mmap region, the stack top, and the heap start on every execve(). Two runs of the same binary won't share addresses for anything except shared .text pages of the underlying ELF, which the kernel happily deduplicates in the page cache.
Inspect your own address space SHELL
# raw VMA list of the current shell
cat /proc/self/maps

# any process by pid (readable, with sizes and perms)
pmap -X $PID

# just the named regions
grep -E '\[(heap|stack|vdso|vvar)\]' /proc/$PID/maps

# is ASLR on? 0=off  1=conservative  2=full
cat /proc/sys/kernel/randomize_va_space

# run the same binary twice and diff the layout
diff <(./myapp & sleep 0.1; cat /proc/$!/maps; kill $!) \
     <(./myapp & sleep 0.1; cat /proc/$!/maps; kill $!)

# max stack size for this shell
ulimit -s

Walking /proc/self/maps VMA list

The kernel exposes the live VMA list (Virtual Memory Areas — the kernel's internal representation of contiguous mapped regions) through procfs. Each line is one VMA. Compile and run this:

C // dump_maps.c#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

char  initialized_global[] = "hello";     /* .data   */
char  zero_global[4096];                       /* .bss    */
const char *ro_string = "read me only";       /* .rodata */

int main(int argc, char **argv) {
    int   on_stack = 42;
    char *on_heap  = malloc(128);
    char *big_heap = malloc(8 * 1024 * 1024);    /* > mmap threshold */

    printf("main()         = %p   .text\n",        (void*)main);
    printf("ro_string      = %p   .rodata\n",      (void*)ro_string);
    printf("init_global    = %p   .data\n",        (void*)initialized_global);
    printf("zero_global    = %p   .bss\n",         (void*)zero_global);
    printf("on_heap (sml)  = %p   heap (brk)\n",   (void*)on_heap);
    printf("big_heap (lrg) = %p   mmap region\n",  (void*)big_heap);
    printf("on_stack       = %p   stack\n",        (void*)&on_stack);

    FILE *m = fopen("/proc/self/maps", "r");
    char line[512];
    while (fgets(line, sizeof line, m)) fputs(line, stdout);
    fclose(m);
    return 0;
}

Sample output (truncated, addresses shown in low form for clarity):

stdoutmain()         = 0x55b2c4a01169   .text
ro_string      = 0x55b2c4a02008   .rodata
init_global    = 0x55b2c4a04010   .data
zero_global    = 0x55b2c4a04020   .bss
on_heap (sml)  = 0x55b2c5e7a2a0   heap (brk)
big_heap (lrg) = 0x7f3e9b800010   mmap region
on_stack       = 0x7ffd4f3a8b8c   stack

55b2c4a00000-55b2c4a01000 r--p  00000000  fd:00  123  /home/u/dump_maps   # ELF header
55b2c4a01000-55b2c4a02000 r-xp  00001000  fd:00  123  /home/u/dump_maps   # .text
55b2c4a02000-55b2c4a03000 r--p  00002000  fd:00  123  /home/u/dump_maps   # .rodata
55b2c4a03000-55b2c4a04000 rw-p  00003000  fd:00  123  /home/u/dump_maps   # .data
55b2c4a04000-55b2c4a05000 rw-p  00000000  00:00  0    # .bss (anon)
55b2c5e7a000-55b2c5e9b000 rw-p  00000000  00:00  0    [heap]
7f3e9b7ff000-7f3e9c000000 rw-p  00000000  00:00  0    # 8 MiB anon mmap
7f3e9c1a0000-7f3e9c1c8000 r--p  00000000  fd:00  456  /usr/lib64/libc.so.6
7f3e9c1c8000-7f3e9c34f000 r-xp  00028000  fd:00  456  /usr/lib64/libc.so.6
7ffd4f389000-7ffd4f3aa000 rw-p  00000000  00:00  0    [stack]
7ffd4f3d4000-7ffd4f3d8000 r--p  00000000  00:00  0    [vvar]
7ffd4f3d8000-7ffd4f3da000 r-xp  00000000  00:00  0    [vdso]

Decoding a maps line

Each row is start-end perms offset dev inode pathname. Permissions are rwxp or rwxs (private vs shared). A private writable file mapping starts as a copy-on-write of the file pages — the moment you write, the kernel hands you a private anon page and forgets about the file for that location.

Want richer info? /proc/[pid]/smaps breaks down each VMA's RSS, PSS, swap, anon vs file pages, hugepage backing, locked status.

Inspect smaps and per-VMA detail SHELL
# first VMA's full breakdown — Size, Rss, Pss, Swap, Anon, etc.
awk '/^[0-9a-f]+-[0-9a-f]+/{n++} n==2{exit} {print}' /proc/$PID/smaps

# single-shot summary across all mappings
cat /proc/$PID/smaps_rollup

# biggest mappings by RSS
pmap -x $PID | sort -k3 -n -r | head

# every shared library actually loaded
awk '/\.so/{print $6}' /proc/$PID/maps | sort -u

# trace ld.so as it pulls libraries in
LD_DEBUG=files ./myapp 2>&1 | grep 'calling init'

Page Tables / The 4-Level Walk 9 + 9 + 9 + 9 + 12

A virtual address is not a number that gets added to a base register. It's a structured key into a multi-level radix tree. On x86_64 with 4-level paging, a 48-bit virtual address breaks into five fields:

VIRTUAL ADDRESS / 64 BITS / HIGH 16 SIGN-EXTENDED 63 : 48 sign-ext 47 : 39 PGD idx (9b) 38 : 30 PUD idx (9b) 29 : 21 PMD idx (9b) 20 : 12 PTE idx (9b) 11 : 0 offset (12b) CR3 phys ptr PGD 512 entries PUD 512 entries PMD 512 entries PTE 512 entries FRAME + offset 4 KiB EACH ENTRY = 8 BYTES / TABLE = 4 KiB / WALK = 4 LOADS
FIG_03 // 4-LEVEL PAGE TABLE WALK

Each level is a 4 KiB table of 512 64-bit entries. The CPU walks the tree on every translation that misses the TLB.

STEP 1
Read CR3
phys ptr to PGD
STEP 2
Index PGD
bits 47:39
STEP 3
Index PUD
bits 38:30
STEP 4
Index PMD
bits 29:21
STEP 5
Index PTE
bits 20:12
RESULT
PFN + offset
phys address

Anatomy of a PTE (x86_64)

BitNameMeaning
0Ppresent — if 0, fault on access (other bits may still be meaningful to the kernel)
1R/Wwritable when 1, read-only when 0
2U/Suser accessible when 1, kernel-only when 0
3PWTwrite-through caching
4PCDcache disable
5Aaccessed — set by hardware on read/write, cleared by kernel for LRU
6Ddirty — set by hardware on write, used for writeback
7PSpage size — if 1 at PMD level, this is a 2 MiB hugepage; at PUD, a 1 GiB page
8Gglobal — not flushed from TLB on CR3 reload (kernel mappings)
12-51PFNphysical page frame number, shifted left 12 bits gives physical address
63NXno-execute — faults on instruction fetch when set
Why 9 + 9 + 9 + 9 + 12? A 4 KiB table holds 4096 / 8 = 512 entries, and 512 = 29. So each level consumes 9 bits of address. Four levels give 36 bits of indexing, plus 12 bits of in-page offset = 48 bits total. That's where the 256 TiB user space comes from: 247 = 128 TiB on each half.

Hugepages skip levels

If the PMD entry has bit 7 (PS) set, the walk stops there: bits 20:0 of the virtual address become a 21-bit offset into a 2 MiB hugepage. Same idea one level up gives 1 GiB pages. Hugepages dramatically reduce TLB pressure for workloads with large working sets — one TLB entry covers 512x or 262144x more memory.

The MMU and TLB translation cache

The Memory Management Unit is hardware. Every load and store goes through it. The page-table walk above costs four serial DRAM loads, which is obscenely slow at modern clock rates — so the MMU caches recent translations in the Translation Lookaside Buffer.

L1 TLB size
~64 entries
L2 TLB size
~1500 entries
TLB miss cost
50-200 cycles
TLB hit cost
~1 cycle
Walk on miss
4 cache-line loads
Flush on
CR3 write

Context switches and TLB shootdowns

Loading CR3 with a new process's PGD physical address invalidates every non-global TLB entry. Kernel mappings have the G bit set so they survive context switches.

If a kernel changes a page-table entry on one CPU, it needs to invalidate the cached translation on every CPU that might have it — this is a TLB shootdown, implemented via inter-processor interrupts. They're expensive. Workloads that munmap or mprotect aggressively pay for them visibly.

PCID / ASID // process-context identifiers Modern x86_64 CPUs support PCIDs, which tag TLB entries with the address space they belong to. Linux enables this when available, allowing the kernel to switch CR3 without flushing the TLB for short-lived switches (KPTI being a notable consumer).
Measure TLB pressure SHELL
# TLB miss rates while running a binary
perf stat -e dTLB-load-misses,dTLB-loads,iTLB-load-misses,iTLB-loads ./myapp

# expressed as miss ratio
perf stat -x, -e dTLB-loads,dTLB-load-misses ./myapp 2>&1 \
  | awk -F, 'NR==1{loads=$1} NR==2{print "dTLB miss ratio:", $1/loads}'

# is THP coalescing pages?
grep -E 'AnonHugePages|ShmemHugePages' /proc/meminfo
cat /sys/kernel/mm/transparent_hugepage/enabled

# hugepage pools (2 MiB / 1 GiB)
ls /sys/kernel/mm/hugepages

# is the CPU using PCIDs?
grep -m1 -o 'pcid' /proc/cpuinfo

Page Faults minor · major · invalid

A page fault is the CPU's way of telling the kernel "I tried to translate an address and the PTE didn't satisfy the access." It's not always an error — faults are how Linux implements demand paging, copy-on-write, lazy stack growth, and swap-in.

Minor
microseconds
PTE empty but VMA exists. Kernel allocates a frame (or finds a cached one), wires up the PTE, retries the instruction. Includes first-touch of .bss, fresh anon mmap, stack growth.
Major
milliseconds
Page lives on disk — either swapped out or never read in from a file mapping. Kernel issues a block I/O, sleeps the process, retries on completion.
Invalid
SIGSEGV / SIGBUS
No VMA covers the address, or the access violates VMA permissions (write to r--, exec from rw-). Kernel sends a signal. Process death by default.

Watch faults happen

C // demand_paging.c#define _GNU_SOURCE
#include <stdio.h>
#include <sys/resource.h>
#include <sys/mman.h>

static void show(const char *tag) {
    struct rusage r;
    getrusage(RUSAGE_SELF, &r);
    printf("%-14s minflt=%-8ld majflt=%-4ld maxrss=%ld KiB\n",
           tag, r.ru_minflt, r.ru_majflt, r.ru_maxrss);
}

int main(void) {
    size_t N = 256 * 1024 * 1024;     /* 256 MiB */
    show("baseline");

    char *p = mmap(NULL, N, PROT_READ | PROT_WRITE,
                   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    show("after mmap");                /* still ~0 RSS, no faults */

    for (size_t i = 0; i < N; i += 4096)
        p[i] = 1;                       /* one fault per page */
    show("after touch");
    return 0;
}
stdoutbaseline       minflt=92       majflt=0    maxrss=2944 KiB
after mmap     minflt=92       majflt=0    maxrss=2944 KiB
after touch    minflt=65628    majflt=0    maxrss=265728 KiB

The mmap itself is virtually free. The 256 MiB only gets paid for when you touch it — one minor fault per 4 KiB page (65536 of them). No major faults, because nothing comes from disk. Anonymous pages are filled from the kernel's zero page on first read, then COW'd on first write.

Watch faults on a live process SHELL
# instantaneous fault counts for a pid (cols 10-13 of /proc/[pid]/stat)
awk '{print "minflt="$10, "cminflt="$11, "majflt="$12, "cmajflt="$13}' /proc/$PID/stat

# cleaner version
ps -o pid,comm,min_flt,maj_flt -p $PID

# top processes by major faults right now
ps -eo pid,comm,maj_flt --sort=-maj_flt | head

# system-wide rate, per second
sar -B 1

# raw kernel counters (cumulative since boot)
grep -E '^pgfault|^pgmajfault|^pgreuse' /proc/vmstat

# trace mmap / brk / mprotect calls during a run
strace -e mmap,mprotect,brk,munmap ./myapp 2>&1 | head -40

# live fault tracing with bcc / bpftrace
sudo bpftrace -e 'tracepoint:exceptions:page_fault_user { @[comm] = count(); }'

brk() vs mmap() how kernels hand you memory

Two syscalls actually create new VMAs in user space. Everything else — malloc, calloc, new, the garbage collector in your runtime — is a library on top of these.

brk() — the program break

Each process has a single contiguous heap whose end is the program break. brk(addr) sets it; sbrk(delta) is the SysV-style relative variant. The heap can only grow or shrink contiguously — you cannot poke a hole in the middle. This is why classic free lists in malloc rarely actually return memory to the OS: you can't free the middle of the heap, only the tail.

C // brk_demo.c#include <stdio.h>
#include <unistd.h>

int main(void) {
    void *initial = sbrk(0);                /* current break */
    printf("initial brk = %p\n", initial);

    void *grown = sbrk(4096);                  /* extend by 1 page */
    printf("grown brk   = %p (got back %p)\n", sbrk(0), grown);

    char *p = (char*)grown;
    p[0] = 'X';                              /* triggers minor fault */
    sbrk(-4096);                              /* return the page */
    return 0;
}

mmap() — arbitrary mappings, anywhere

mmap creates a fresh VMA at any free address in the mmap region (the kernel picks unless you pass MAP_FIXED). Each call yields a region you can independently munmap. This is what glibc's malloc uses for any allocation above M_MMAP_THRESHOLD (default 128 KiB, but it's adaptive — grows up to 32 MiB based on observed usage).

Featurebrk / sbrkmmap
VMA shapesingle, contiguous, grows/shrinks at endany number, anywhere in mmap region
Per-call costcheap syscallcheap syscall + VMA insertion
Returnableonly the tailper-mapping via munmap
Best formany small short-lived allocslarge, long-lived, or shareable allocs
File-backed?noyes

malloc() Internals ptmalloc2 // glibc

Glibc's malloc is ptmalloc2, a heap allocator derived from Doug Lea's dlmalloc. It carves brk/mmap regions into chunks, organizes free chunks into bins, and serves user requests with metadata-prefixed pointers.

Chunk layout

malloc chunk+----------------------+ <- chunk start
| prev_size            |   only valid if prev is free
+----------------------+
| size | A | M | P     |   bottom 3 bits = flags
+----------------------+ <- pointer returned to user
| user data ...        |
|                      |
| (when freed: fd, bk, |
|  fd_nextsize, bk_..) |
+----------------------+
| chunk N+1 prev_size  |
+----------------------+

flags:  P = PREV_INUSE   (1 if prev chunk is in use)
        M = IS_MMAPPED   (1 if chunk came from mmap)
        A = NON_MAIN_ARENA

The user pointer is just past the size header. free(p) reads p[-8] to find size and flags. This is why writing past the end of a buffer corrupts the next chunk's header — the classic heap overflow.

Bins — where free chunks live

Bin typeSizesStructureNotes
tcacheper-thread, ≤ 0x420 defaultsingly-linked, 7 entries each, 64 sizesfastest path, no locks, glibc 2.26+
fastbins16-160 bytes defaultLIFO singly-linkedno coalescing on free
unsorted binanydoubly-linkedrecently freed; first stop on alloc
smallbins16 ... 1008 bytes62 bins, FIFO doubly-linkedexact-size match per bin
largebins1024+ bytes63 bins, sorted by sizebest-fit search within bin
top chunktail of arenasingle chunkextended via brk/mmap when too small

Arenas — per-thread heaps

The main arena sits on the brk heap. Additional threads get their own arenas (mmap'd 64 MiB regions on x86_64 by default), reducing lock contention. Each arena has its own bin structure. The number caps at 8 * ncpu for 64-bit systems — beyond that, threads share.

1
tcache
O(1), no lock
2
fastbin
O(1) per size
3
unsorted
cleanup pass
4
small/large bins
scan + coalesce
5
top chunk
grow heap
6
mmap
if > threshold

Watch the threshold flip

C // mmap_threshold.c#include <stdio.h>
#include <stdlib.h>

int main(void) {
    size_t sizes[] = { 64, 4096, 100000, 200000, 8*1024*1024 };
    for (int i = 0; i < 5; i++) {
        char *p = malloc(sizes[i]);
        printf("malloc(%-8zu) = %p\n", sizes[i], p);
    }
    return 0;
}
stdoutmalloc(64      ) = 0x55b8c7a8b2a0   <- main arena (brk heap)
malloc(4096    ) = 0x55b8c7a8b2f0   <- main arena (brk heap)
malloc(100000  ) = 0x55b8c7a8c310   <- main arena (brk heap)
malloc(200000  ) = 0x7f9a5c800010   <- mmap region!
malloc(8388608 ) = 0x7f9a5c000010   <- mmap region (separate)

The address jump from 0x55b8... (main heap) to 0x7f9a... (mmap region) is the threshold flip. When you free() the mmap'd ones, glibc actually calls munmap — that memory genuinely goes back to the kernel. free() on heap chunks just rebinds them.

Probe glibc's heap and arenas SHELL
# dump arena stats from a running process (glibc malloc_stats via gdb)
gdb -p $PID -batch -ex 'call (void)malloc_stats()' -ex detach 2>&1 | tail -20

# total heap (brk) size and the data segment
grep -E 'VmData|VmStk|VmExe|VmLib' /proc/$PID/status

# tweak the mmap threshold so even small allocs go to mmap
MALLOC_MMAP_THRESHOLD_=$((1<<14)) ./myapp   # 16 KiB

# kill tcache to see classic ptmalloc behavior
GLIBC_TUNABLES=glibc.malloc.tcache_count=0 ./myapp

# trace every malloc / free / realloc
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libmemusage.so ./myapp
ltrace -e malloc+free+realloc ./myapp 2>&1 | head

# or use a real heap profiler
heaptrack ./myapp && heaptrack_print heaptrack.myapp.*.gz | head -30

mmap() Deep Dive flags · advice

Four flag-pair combinations dominate everything you'll see in /proc/[pid]/maps:

file + private
MAP_PRIVATE | fd
Load file pages, COW on write. The classic libraries / config / data file mapping. Writes go to anonymous COW frames, never the file.
file + shared
MAP_SHARED | fd
Changes write back to the file via the page cache. Used for shared databases, scratch files, IPC via memory-mapped files.
anon + private
MAP_PRIVATE | MAP_ANONYMOUS
Fresh zero-filled pages. The malloc>128k path. Big buffers, scratch space, anything the program owns alone.
anon + shared
MAP_SHARED | MAP_ANONYMOUS
Shared between fork'd children. Classic IPC mechanism without needing a backing file. Survives fork(), dies with the last mapper.

Anonymous private mapping (the malloc workhorse)

C // anon_private.c#include <stdio.h>
#include <sys/mman.h>

int main(void) {
    size_t N = 2 * 1024 * 1024;            /* 2 MiB */
    char *m = mmap(NULL, N,
                   PROT_READ | PROT_WRITE,
                   MAP_PRIVATE | MAP_ANONYMOUS,
                   -1, 0);
    if (m == MAP_FAILED) { perror("mmap"); return 1; }

    m[0] = 'A';                              /* zero-page COW -> first frame */
    munmap(m, N);
    return 0;
}

File-backed shared mapping

C // mmap_file.c#include <fcntl.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <unistd.h>

int main(int argc, char **argv) {
    int fd = open(argv[1], O_RDWR);
    struct stat st;
    fstat(fd, &st);

    char *m = mmap(NULL, st.st_size,
                   PROT_READ | PROT_WRITE,
                   MAP_SHARED, fd, 0);

    /* edits hit the page cache; kernel writes back asynchronously */
    memset(m, 'X', 16);
    msync(m, st.st_size, MS_SYNC);          /* force flush for durability */

    munmap(m, st.st_size);
    close(fd);
    return 0;
}

This skips the explicit read/write dance. Reads materialize as page faults that pull from the page cache (or disk on miss). Writes mark pages dirty; the kernel writes them back via writeback threads or on msync.

Useful flags

FlagWhat it does
MAP_POPULATEpre-faults all pages (no demand paging) — useful when you know you'll touch everything
MAP_LOCKEDlike mlock: pages won't be swapped out (needs RLIMIT_MEMLOCK)
MAP_HUGETLBallocate from the hugepage pool (2 MiB or 1 GiB pages)
MAP_FIXED_NOREPLACEtry a specific address but fail rather than displace existing mappings (use this, not raw MAP_FIXED)
MAP_NORESERVEdon't pre-reserve swap; allows huge sparse mappings; risk: SIGBUS on fault if swap is full
MAP_STACKhint that this mapping will be used as a stack

madvise() — tell the kernel what you'll do

C // madvise hintsmadvise(p, n, MADV_SEQUENTIAL);   /* expect linear access; aggressive readahead */
madvise(p, n, MADV_RANDOM);       /* random access; disable readahead */
madvise(p, n, MADV_WILLNEED);     /* prefetch into page cache */
madvise(p, n, MADV_DONTNEED);     /* drop pages now (anon: zeros on next read!) */
madvise(p, n, MADV_FREE);         /* lazy reclaim: pages may persist or be zeroed */
madvise(p, n, MADV_HUGEPAGE);     /* opt this region into THP coalescing */
madvise(p, n, MADV_NOHUGEPAGE);   /* and out */

fork() and Copy-on-Write page-table magic

fork() doesn't actually copy the parent's memory. It copies the page tables, marks every writable PTE in both parent and child as read-only, and increments a refcount on each backing page. The first write by either process triggers a fault; the kernel allocates a new frame, copies the contents, points the writer's PTE at the new frame, and restores RW.

PARENT PTE (rw) FRAME (refcnt=1) [1] before fork PARENT CHILD PTE (r-) [COW] PTE (r-) [COW] FRAME (refcnt=2) [2] right after fork() PARENT (writes) CHILD PTE (rw) [new] PTE (rw) NEW FRAME orig (rc=1) [3] after parent write
FIG_06 // FORK + COW LIFECYCLE

Demonstrating COW

C // cow_demo.c#define _GNU_SOURCE
#include <stdio.h>
#include <string.h>
#include <sys/wait.h>
#include <sys/resource.h>
#include <unistd.h>

int main(void) {
    size_t N = 64 * 1024 * 1024;
    char *buf = malloc(N);
    memset(buf, 'A', N);                  /* parent allocates + faults all pages */

    struct rusage r;
    getrusage(RUSAGE_SELF, &r);
    printf("parent pre-fork  minflt=%ld\n", r.ru_minflt);

    pid_t pid = fork();
    if (pid == 0) {
        /* child: read-only access touches nothing physically */
        size_t sum = 0;
        for (size_t i = 0; i < N; i += 4096) sum += buf[i];
        getrusage(RUSAGE_SELF, &r);
        printf("child  read-only  minflt=%ld\n", r.ru_minflt);

        /* now write half: triggers COW for half the pages */
        for (size_t i = 0; i < N/2; i += 4096) buf[i] = 'B';
        getrusage(RUSAGE_SELF, &r);
        printf("child  half-write minflt=%ld\n", r.ru_minflt);
        _exit(0);
    }
    waitpid(pid, NULL, 0);
    return 0;
}
stdoutparent pre-fork  minflt=16472
child  read-only  minflt=132          # only kernel/copy bookkeeping
child  half-write minflt=8324         # ~16384/2 = 8192 COW faults + noise

Read-only access in the child does not allocate a single new physical frame. The write phase triggers exactly one minor fault per touched 4 KiB page, and only the touched pages get duplicated. This is what makes fork() + exec() cheap and what made shared read-mostly server architectures viable for decades.

RSS lies after fork // know your acronyms Both processes show the same RSS in top immediately after fork(). They aren't actually using 2x the memory — the pages are shared. Look at PSS (Proportional Set Size) in /proc/[pid]/smaps for a fairer accounting that divides shared pages across the procs that map them.

Kernel Allocators buddy · slab · vmalloc

The kernel itself needs to allocate memory: page tables, inode caches, network buffers, scheduler runqueues, your VMA structures. It has its own zoo of allocators because the constraints (interrupt context, atomic vs sleeping, physically contiguous, NUMA-local) are different from user space.

Buddy allocator — the foundation

The buddy system is the bottom of the kernel allocator stack. It manages physical page frames in power-of-two-sized blocks called orders, from order-0 (1 page = 4 KiB) up to order-10 (1024 pages = 4 MiB).

BUDDY ALLOCATOR / POWER-OF-TWO BLOCKS ord 4 16 pages = 64 KiB ord 3 8p / 32K buddy ord 2 4p / 16K ord 1 2p ord 0 1p SPLIT ON ALLOC · COALESCE WITH BUDDY ON FREE · PHYS-CONTIG
FIG_07 // BUDDY ALLOCATOR ORDERS

Need 12 KiB? Round up to 4 pages = order 2. The allocator finds a free order-2 block. If none exist, it splits an order-3 into two order-2 buddies, hands one out, queues the other. On free, if the buddy is also free, they recombine. This is the source of memory fragmentation — a system can have plenty of free order-0 pages and still fail an order-5 allocation if no contiguous run exists.

See /proc/buddyinfo for the live state per zone:

$ cat /proc/buddyinfoNode 0, zone      DMA      1      0      0      1      2  ...
Node 0, zone    DMA32  10234   8501   6442   3201    998  ...
Node 0, zone   Normal  88412  41203   9871    412     22  ...

Slab / SLUB — cache of fixed-size objects

Most kernel allocations aren't full pages — they're little structs (256 bytes for a task_struct, 192 for a file, etc). The slab allocator sits on top of buddy: it asks for pages in bulk, then carves them into fixed-size object slots, keeping per-CPU and per-NUMA-node caches to avoid lock contention.

$ sudo head /proc/slabinfoslabinfo - version: 2.1
# name              active_objs num_objs objsize objperslab pagesperslab
dentry              412091      412290      192        21            1
inode_cache         206301      206301      584        14            2
kmalloc-8k             280         280     8192         4            8
kmalloc-1k            3104        3104     1024        16            4
task_struct           1242        1242     5824         5            8
mm_struct              312         312     1024        16            4
vm_area_struct        9412        9412      216        18            1

kmalloc / kvmalloc / vmalloc

APIReturnsUsed for
kmalloc(n, GFP_*)physically contiguoussmall kernel allocs (DMA buffers, structs); backed by SLUB
vmalloc(n)virtually contiguous, physically scatteredlarge kernel buffers where physical contiguity isn't needed
kvmalloc(n)tries kmalloc, falls back to vmallocgeneric large alloc, "I don't care which"
alloc_pages(order)raw page framespage-table builders, network ring buffers

GFP flags — get-free-pages context

FlagMeans
GFP_KERNELcan sleep, can do I/O, can reclaim — the default for process context
GFP_ATOMICcannot sleep (interrupt context); fail fast if no free page available
GFP_NOIOcan sleep but not start I/O (avoid recursion in I/O paths)
GFP_DMAmust come from low-memory zone usable by legacy DMA controllers
__GFP_ZEROzero the returned page
__GFP_NOWARNdon't dump a stacktrace if alloc fails

Page cache

Every byte you read from a regular file lives, transiently, in the page cache — the kernel's RAM-backed cache of file pages indexed by (inode, offset). It's not a separate pool; it's just pages from the buddy allocator that happen to be reclaimable. free's "available" number includes most of it.

This is also why cat largefile > /dev/null warms the cache, why running a benchmark twice usually gets faster the second time, and why echo 3 > /proc/sys/vm/drop_caches is the standard way to invalidate it before measurement.

Memory Pressure, Swap, OOM reclaim · watermarks

When demand for pages exceeds supply, Linux reclaims. Reclaim is handled by per-NUMA-node kswapd threads (background, low watermark) and direct reclaim (foreground, allocation-time, when watermarks crash).

What gets reclaimed

Reclaim uses an active/inactive LRU per zone, biased by the PTE accessed bit and adjusted by vm.swappiness (0-200, default 60). Lower values discourage anon swap-out in favor of reclaiming file cache.

Watermarks

Each zone has three watermarks: min, low, high. Allocations below min only succeed for kernel atomic contexts. Crossing low wakes kswapd. Below high, kswapd works until it's restored. See /proc/zoneinfo.

The OOM killer

If reclaim can't free enough memory, the OOM killer picks a victim and sends SIGKILL. The score is computed in /proc/[pid]/oom_score, derived from RSS and tunable via /proc/[pid]/oom_score_adj (-1000 to +1000; -1000 = immune).

C // oom_protect.c#include <fcntl.h>
#include <unistd.h>

int main(void) {
    int fd = open("/proc/self/oom_score_adj", O_WRONLY);
    write(fd, "-500", 4);                    /* less likely to be killed */
    close(fd);
    /* ... do important work ... */
    return 0;
}

/proc/meminfo — what the system thinks it has

/proc/meminfoMemTotal:       32896132 kB    # all RAM the kernel can see
MemFree:         1204896 kB    # truly unused
MemAvailable:   18472304 kB    # estimate of what allocs *could* get
Buffers:          412004 kB    # block-device cache
Cached:         15823008 kB    # page cache (file pages)
SwapCached:        12492 kB    # pages once in swap, now back in memory
Active:         12409832 kB    # recently used (anon + file)
Inactive:        4988124 kB    # candidates for reclaim
AnonPages:       9912004 kB    # process anonymous (heap, stack, anon mmap)
Mapped:          1284820 kB    # mapped from files (libs, mmap'd files)
Slab:             982408 kB    # kernel slab caches
SReclaimable:     682112 kB    # slabs the kernel can drop
SUnreclaim:       300296 kB    # slabs that must stay
KernelStack:       28192 kB    # per-task kernel stacks
PageTables:       142884 kB    # the actual page-table pages!
SwapTotal:       8388604 kB
SwapFree:        8261940 kB
Dirty:              2884 kB    # waiting to be written back
Writeback:             0 kB    # currently being written

Note PageTables — that's the kernel's accounting of RAM consumed by your process's PGDs/PUDs/PMDs/PTEs. A process with very sparse mappings can spend non-trivial RAM just on the page tables.

cgroup memory controller

cgroups v2 lets you cap memory per group (typically per container or per service). Hitting memory.max triggers reclaim within the cgroup; failing reclaim invokes the cgroup OOM killer (only kills processes in the cgroup).

SHELL // cgroups v2# limit a service to 512 MiB
mkdir /sys/fs/cgroup/myservice
echo 536870912 > /sys/fs/cgroup/myservice/memory.max
echo $$        > /sys/fs/cgroup/myservice/cgroup.procs
# now this shell + descendants are capped
Inspect pressure, swap, OOM SHELL
# PSI: how much time we're stalled on memory
cat /proc/pressure/memory
# some avg10=0.00 ... full avg10=0.00 ...   "full" non-zero = real pain

# actively swapping right now? si/so columns are kB/s
vmstat 1 5

# current watermarks per zone
grep -E 'Node|min|low|high' /proc/zoneinfo | head -20

# did the OOM killer run since boot?
dmesg -T | grep -i 'killed process\|out of memory'
journalctl -k --since boot | grep -i 'invoked oom'

# who would the OOM killer pick right now?
for p in /proc/[0-9]*; do
  s=$(cat $p/oom_score 2>/dev/null) || continue
  echo "$s $(cat $p/comm 2>/dev/null) $(basename $p)"
done | sort -n | tail -10

# cgroup-scoped pressure and OOM events
cat /sys/fs/cgroup/myservice/memory.pressure
cat /sys/fs/cgroup/myservice/memory.events

NUMA // When Memory Has Geography multi-socket

Multi-socket servers don't have one big memory pool. Each socket has its own attached DRAM (the local node); accessing another socket's memory traverses the inter-socket link (UPI on Intel, Infinity Fabric on AMD). The latency penalty is real — typically 1.5x to 2x for remote access.

Linux models this as NUMA nodes. numactl --hardware shows the topology:

$ numactl --hardwareavailable: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 16345 MB
node 1 cpus: 8 9 10 11 12 13 14 15
node 1 size: 16384 MB
node distances:
node     0    1
  0:    10   21
  1:    21   10

The distance matrix is in arbitrary units; 10 is local, 21 here means a remote access costs ~2.1x the local cost.

Allocation policies

default
first-touch local
Allocate on the node where the touching thread runs. Good default. Can go wrong if one thread allocates and another consumes.
--membind
strict
Only alloc from listed nodes; fail if none have room. Forces locality at cost of flexibility.
--preferred
soft preference
Try one node, fall back to others. Most useful for explicit pinning of one process to one node.
--interleave
round-robin
Striped across nodes. Good for big shared buffers where bandwidth matters more than latency.
C // numa_aware.c // -lnuma#include <numa.h>
#include <numaif.h>
#include <stdio.h>

int main(void) {
    if (numa_available() < 0) return 1;

    int nnodes = numa_max_node() + 1;
    printf("%d NUMA nodes\n", nnodes);

    /* allocate 1 GiB strictly on node 0 */
    void *p = numa_alloc_onnode(1L << 30, 0);

    /* pin this thread to node 0 too */
    numa_run_on_node(0);

    /* now touching p is fast (local DRAM) */
    numa_free(p, 1L << 30);
    return 0;
}

For the common case (single-process service, multiple threads), Linux's default first-touch + autonuma balancing usually does the right thing — it migrates pages toward the cores that touch them most. For high-end workloads (databases, in-memory analytics), explicit pinning with numactl or libnuma is the difference between 60% and 100% of memory bandwidth.

Inspect per-process NUMA placement SHELL
# topology + node memory totals
numactl --hardware

# RSS broken down by node, per process
numastat -p $PID

# every mapping with its current node distribution
cat /proc/$PID/numa_maps | head

# node-level allocator stats: hits, misses, foreign
numastat -m

# pin a workload manually
numactl --cpunodebind=0 --membind=0 ./myapp

# remote-access penalty in cycles
perf stat -e mem_load_l3_miss_retired.remote_dram ./myapp

# is autonuma migrating pages around?
grep numa_ /proc/vmstat

Tooling // Where to Poke procfs · perf · bcc

Per-process

Path / toolWhat you get
/proc/[pid]/mapsVMA list, perms, file backing
/proc/[pid]/smapsper-VMA RSS / PSS / swap / hugepages / locked
/proc/[pid]/statusVmSize / VmRSS / VmHWM / VmData / VmStk / VmExe
/proc/[pid]/statmcompact: size, resident, shared, text, data, library, dirty
/proc/[pid]/pagemapvirt-to-PFN lookup, swap status, present bit (root only)
/proc/[pid]/oom_score_adjtune OOM-kill priority
pmap -x [pid]readable VMA summary

System-wide

Path / toolWhat you get
/proc/meminfotop-level memory accounting
/proc/buddyinfofree pages per order per zone
/proc/zoneinfowatermarks, pages, vmstats per zone
/proc/slabinfoper-cache slab usage (root)
/proc/vmstatcounters: pgfault, pgmajfault, pgscan, pswpin, etc.
free -hquick total / used / free / buff/cache / available
vmstat 1live si/so (swap I/O), bi/bo, free, buff, cache
sar -B 1paging stats: pgpgin, pgpgout, fault, majflt, pgfree, pgscan
numastat -mper-node memory breakdown
slabtoplive slab cache top-N
perf mem recordload/store address sampling, NUMA hit/miss
bcc tools (memleak, slabratetop)eBPF-based dynamic tracing

Quick recipes

SHELL# What's allocating in the kernel right now?
sudo slabtop -o | head -20

# What's the actual physical frame backing a virtual address?
# (root only; uses /proc/[pid]/pagemap)
sudo ./pagemap-dump $(pidof myapp) 0x7f3e9c1a0000

# Force-flush page cache for benchmarking:
sync && echo 3 | sudo tee /proc/sys/vm/drop_caches

# Watch a process's RSS climb:
while true; do grep VmRSS /proc/$PID/status; sleep 1; done

# Major-fault hot processes:
ps -eo pid,comm,maj_flt,min_flt --sort=-maj_flt | head

# Are we swapping right now?
vmstat 1 5   # si/so columns; non-zero = active swap I/O

# Is THP coalescing pages?
grep AnonHugePages /proc/meminfo
cat /sys/kernel/mm/transparent_hugepage/enabled

Further reading worth your time