Linux Memory Management

The Two Memories virtual · physical

Every running process believes it owns the entire machine. It opens at the same addresses as every other process, references the same symbols, walks the same stack-pointer values. The trick that makes this work is virtual memory: the process sees a flat private address space, and the kernel plus the CPU's Memory Management Unit lazily map regions of that virtual space onto the physical DRAM that actually exists.

Three things fall out of this for free.

Isolation

memory protection

Process A cannot see B's memory because A's page tables don't include B's mappings. A bug in one process can't trash another.

Larger than RAM

256 TiB user space

The virtual space is decoupled from physical capacity. You can address far more memory than you have, and the kernel pages in only what's touched.

Lazy Backing

demand paging

Pages get backed by physical frames only on first access. malloc(1 GiB) costs almost nothing until you touch it.

Two shapes to keep in your head: page (virtual, 4 KiB by default on x86_64) and page frame (physical, the same 4 KiB chunk of DRAM). Translation happens at page granularity. Anything finer is the allocator's problem.

FIG_01 // VIRT-TO-PHYS INDIRECTION

The x86_64 Virtual Address Space canonical · 48 bits

x86_64 hardware uses 48-bit virtual addresses (5-level paging extends this to 57, but most kernels still ship 4-level). The high 16 bits must be a sign-extension of bit 47 — addresses must be canonical. Anything else faults. This carves the 64-bit space into two halves with a giant unmappable hole in the middle.

Page size

4 KiB (0x1000)

Virt bits used

48 / 57

User max

0x00007FFF.FFFF.FFFF

Kernel min

0xFFFF8000.0000.0000

User space

128 TiB

Kernel space

128 TiB

Layout of a typical user process

FIG_02 // x86_64 USER VAS / 4-LEVEL PAGING

What lives in each segment

Segment	Source	Perms	Backed by	Notes
.text	ELF LOAD	r-x	file (private)	shared between procs of same binary
.rodata	ELF LOAD	r--	file	string literals, const tables
.data	ELF LOAD	rw-	file (COW)	initialized globals/statics
.bss	kernel	rw-	anonymous (zero)	no file bytes; zeroed on first touch
heap	brk/sbrk	rw-	anonymous	extended in pages, never released piecemeal
mmap	mmap()	varies	file or anon	unmappable individually via munmap()
stack	kernel (auto)	rw-	anonymous	grows on fault, capped by RLIMIT_STACK
vDSO/vvar	kernel	r-x / r--	shared kernel	fast-path syscalls (gettimeofday, etc.)

ASLR // address space layout randomization With PIE binaries (default on most modern distros), the kernel randomizes the base of the executable, the mmap region, the stack top, and the heap start on every execve(). Two runs of the same binary won't share addresses for anything except shared .text pages of the underlying ELF, which the kernel happily deduplicates in the page cache.

Inspect your own address space SHELL

# raw VMA list of the current shell
cat /proc/self/maps

# any process by pid (readable, with sizes and perms)
pmap -X $PID

# just the named regions
grep -E '\[(heap|stack|vdso|vvar)\]' /proc/$PID/maps

# is ASLR on? 0=off  1=conservative  2=full
cat /proc/sys/kernel/randomize_va_space

# run the same binary twice and diff the layout
diff <(./myapp & sleep 0.1; cat /proc/$!/maps; kill $!) \
     <(./myapp & sleep 0.1; cat /proc/$!/maps; kill $!)

# max stack size for this shell
ulimit -s

Walking /proc/self/maps VMA list

The kernel exposes the live VMA list (Virtual Memory Areas — the kernel's internal representation of contiguous mapped regions) through procfs. Each line is one VMA. Compile and run this:

C // dump_maps.c#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

char  initialized_global[] = "hello";     /* .data   */
char  zero_global[4096];                       /* .bss    */
const char *ro_string = "read me only";       /* .rodata */

int main(int argc, char **argv) {
    int   on_stack = 42;
    char *on_heap  = malloc(128);
    char *big_heap = malloc(8 * 1024 * 1024);    /* > mmap threshold */

    printf("main()         = %p   .text\n",        (void*)main);
    printf("ro_string      = %p   .rodata\n",      (void*)ro_string);
    printf("init_global    = %p   .data\n",        (void*)initialized_global);
    printf("zero_global    = %p   .bss\n",         (void*)zero_global);
    printf("on_heap (sml)  = %p   heap (brk)\n",   (void*)on_heap);
    printf("big_heap (lrg) = %p   mmap region\n",  (void*)big_heap);
    printf("on_stack       = %p   stack\n",        (void*)&on_stack);

    FILE *m = fopen("/proc/self/maps", "r");
    char line[512];
    while (fgets(line, sizeof line, m)) fputs(line, stdout);
    fclose(m);
    return 0;
}

Sample output (truncated, addresses shown in low form for clarity):

stdoutmain()         = 0x55b2c4a01169   .text
ro_string      = 0x55b2c4a02008   .rodata
init_global    = 0x55b2c4a04010   .data
zero_global    = 0x55b2c4a04020   .bss
on_heap (sml)  = 0x55b2c5e7a2a0   heap (brk)
big_heap (lrg) = 0x7f3e9b800010   mmap region
on_stack       = 0x7ffd4f3a8b8c   stack

55b2c4a00000-55b2c4a01000 r--p  00000000  fd:00  123  /home/u/dump_maps   # ELF header
55b2c4a01000-55b2c4a02000 r-xp  00001000  fd:00  123  /home/u/dump_maps   # .text
55b2c4a02000-55b2c4a03000 r--p  00002000  fd:00  123  /home/u/dump_maps   # .rodata
55b2c4a03000-55b2c4a04000 rw-p  00003000  fd:00  123  /home/u/dump_maps   # .data
55b2c4a04000-55b2c4a05000 rw-p  00000000  00:00  0    # .bss (anon)
55b2c5e7a000-55b2c5e9b000 rw-p  00000000  00:00  0    [heap]
7f3e9b7ff000-7f3e9c000000 rw-p  00000000  00:00  0    # 8 MiB anon mmap
7f3e9c1a0000-7f3e9c1c8000 r--p  00000000  fd:00  456  /usr/lib64/libc.so.6
7f3e9c1c8000-7f3e9c34f000 r-xp  00028000  fd:00  456  /usr/lib64/libc.so.6
7ffd4f389000-7ffd4f3aa000 rw-p  00000000  00:00  0    [stack]
7ffd4f3d4000-7ffd4f3d8000 r--p  00000000  00:00  0    [vvar]
7ffd4f3d8000-7ffd4f3da000 r-xp  00000000  00:00  0    [vdso]

Decoding a maps line

Each row is start-end perms offset dev inode pathname. Permissions are rwxp or rwxs (private vs shared). A private writable file mapping starts as a copy-on-write of the file pages — the moment you write, the kernel hands you a private anon page and forgets about the file for that location.

Want richer info? /proc/[pid]/smaps breaks down each VMA's RSS, PSS, swap, anon vs file pages, hugepage backing, locked status.

Inspect smaps and per-VMA detail SHELL

# first VMA's full breakdown — Size, Rss, Pss, Swap, Anon, etc.
awk '/^[0-9a-f]+-[0-9a-f]+/{n++} n==2{exit} {print}' /proc/$PID/smaps

# single-shot summary across all mappings
cat /proc/$PID/smaps_rollup

# biggest mappings by RSS
pmap -x $PID | sort -k3 -n -r | head

# every shared library actually loaded
awk '/\.so/{print $6}' /proc/$PID/maps | sort -u

# trace ld.so as it pulls libraries in
LD_DEBUG=files ./myapp 2>&1 | grep 'calling init'

Page Tables / The 4-Level Walk 9 + 9 + 9 + 9 + 12

A virtual address is not a number that gets added to a base register. It's a structured key into a multi-level radix tree. On x86_64 with 4-level paging, a 48-bit virtual address breaks into five fields:

FIG_03 // 4-LEVEL PAGE TABLE WALK

Each level is a 4 KiB table of 512 64-bit entries. The CPU walks the tree on every translation that misses the TLB.

STEP 1

Read CR3

phys ptr to PGD

→

STEP 2

Index PGD

bits 47:39

→

STEP 3

Index PUD

bits 38:30

→

STEP 4

Index PMD

bits 29:21

→

STEP 5

Index PTE

bits 20:12

→

RESULT

PFN + offset

phys address

Anatomy of a PTE (x86_64)

Bit	Name	Meaning
0	P	present — if 0, fault on access (other bits may still be meaningful to the kernel)
1	R/W	writable when 1, read-only when 0
2	U/S	user accessible when 1, kernel-only when 0
3	PWT	write-through caching
4	PCD	cache disable
5	A	accessed — set by hardware on read/write, cleared by kernel for LRU
6	D	dirty — set by hardware on write, used for writeback
7	PS	page size — if 1 at PMD level, this is a 2 MiB hugepage; at PUD, a 1 GiB page
8	G	global — not flushed from TLB on CR3 reload (kernel mappings)
12-51	PFN	physical page frame number, shifted left 12 bits gives physical address
63	NX	no-execute — faults on instruction fetch when set

Why 9 + 9 + 9 + 9 + 12? A 4 KiB table holds 4096 / 8 = 512 entries, and 512 = 2⁹. So each level consumes 9 bits of address. Four levels give 36 bits of indexing, plus 12 bits of in-page offset = 48 bits total. That's where the 256 TiB user space comes from: 2⁴⁷ = 128 TiB on each half.

Hugepages skip levels

If the PMD entry has bit 7 (PS) set, the walk stops there: bits 20:0 of the virtual address become a 21-bit offset into a 2 MiB hugepage. Same idea one level up gives 1 GiB pages. Hugepages dramatically reduce TLB pressure for workloads with large working sets — one TLB entry covers 512x or 262144x more memory.

The MMU and TLB translation cache

The Memory Management Unit is hardware. Every load and store goes through it. The page-table walk above costs four serial DRAM loads, which is obscenely slow at modern clock rates — so the MMU caches recent translations in the Translation Lookaside Buffer.

L1 TLB size

~64 entries

L2 TLB size

~1500 entries

TLB miss cost

50-200 cycles

TLB hit cost

~1 cycle

Walk on miss

4 cache-line loads

Flush on

CR3 write

Context switches and TLB shootdowns

Loading CR3 with a new process's PGD physical address invalidates every non-global TLB entry. Kernel mappings have the G bit set so they survive context switches.

If a kernel changes a page-table entry on one CPU, it needs to invalidate the cached translation on every CPU that might have it — this is a TLB shootdown, implemented via inter-processor interrupts. They're expensive. Workloads that munmap or mprotect aggressively pay for them visibly.

PCID / ASID // process-context identifiers Modern x86_64 CPUs support PCIDs, which tag TLB entries with the address space they belong to. Linux enables this when available, allowing the kernel to switch CR3 without flushing the TLB for short-lived switches (KPTI being a notable consumer).

Measure TLB pressure SHELL

# TLB miss rates while running a binary
perf stat -e dTLB-load-misses,dTLB-loads,iTLB-load-misses,iTLB-loads ./myapp

# expressed as miss ratio
perf stat -x, -e dTLB-loads,dTLB-load-misses ./myapp 2>&1 \
  | awk -F, 'NR==1{loads=$1} NR==2{print "dTLB miss ratio:", $1/loads}'

# is THP coalescing pages?
grep -E 'AnonHugePages|ShmemHugePages' /proc/meminfo
cat /sys/kernel/mm/transparent_hugepage/enabled

# hugepage pools (2 MiB / 1 GiB)
ls /sys/kernel/mm/hugepages

# is the CPU using PCIDs?
grep -m1 -o 'pcid' /proc/cpuinfo

Page Faults minor · major · invalid

A page fault is the CPU's way of telling the kernel "I tried to translate an address and the PTE didn't satisfy the access." It's not always an error — faults are how Linux implements demand paging, copy-on-write, lazy stack growth, and swap-in.

Minor

microseconds

PTE empty but VMA exists. Kernel allocates a frame (or finds a cached one), wires up the PTE, retries the instruction. Includes first-touch of .bss, fresh anon mmap, stack growth.

Major

milliseconds

Page lives on disk — either swapped out or never read in from a file mapping. Kernel issues a block I/O, sleeps the process, retries on completion.

Invalid

SIGSEGV / SIGBUS

No VMA covers the address, or the access violates VMA permissions (write to r--, exec from rw-). Kernel sends a signal. Process death by default.

Watch faults happen

C // demand_paging.c#define _GNU_SOURCE
#include <stdio.h>
#include <sys/resource.h>
#include <sys/mman.h>

static void show(const char *tag) {
    struct rusage r;
    getrusage(RUSAGE_SELF, &r);
    printf("%-14s minflt=%-8ld majflt=%-4ld maxrss=%ld KiB\n",
           tag, r.ru_minflt, r.ru_majflt, r.ru_maxrss);
}

int main(void) {
    size_t N = 256 * 1024 * 1024;     /* 256 MiB */
    show("baseline");

    char *p = mmap(NULL, N, PROT_READ | PROT_WRITE,
                   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    show("after mmap");                /* still ~0 RSS, no faults */

    for (size_t i = 0; i < N; i += 4096)
        p[i] = 1;                       /* one fault per page */
    show("after touch");
    return 0;
}

stdoutbaseline       minflt=92       majflt=0    maxrss=2944 KiB
after mmap     minflt=92       majflt=0    maxrss=2944 KiB
after touch    minflt=65628    majflt=0    maxrss=265728 KiB

The mmap itself is virtually free. The 256 MiB only gets paid for when you touch it — one minor fault per 4 KiB page (65536 of them). No major faults, because nothing comes from disk. Anonymous pages are filled from the kernel's zero page on first read, then COW'd on first write.

Watch faults on a live process SHELL

# instantaneous fault counts for a pid (cols 10-13 of /proc/[pid]/stat)
awk '{print "minflt="$10, "cminflt="$11, "majflt="$12, "cmajflt="$13}' /proc/$PID/stat

# cleaner version
ps -o pid,comm,min_flt,maj_flt -p $PID

# top processes by major faults right now
ps -eo pid,comm,maj_flt --sort=-maj_flt | head

# system-wide rate, per second
sar -B 1

# raw kernel counters (cumulative since boot)
grep -E '^pgfault|^pgmajfault|^pgreuse' /proc/vmstat

# trace mmap / brk / mprotect calls during a run
strace -e mmap,mprotect,brk,munmap ./myapp 2>&1 | head -40

# live fault tracing with bcc / bpftrace
sudo bpftrace -e 'tracepoint:exceptions:page_fault_user { @[comm] = count(); }'

brk() vs mmap() how kernels hand you memory

Two syscalls actually create new VMAs in user space. Everything else — malloc, calloc, new, the garbage collector in your runtime — is a library on top of these.

brk() — the program break

Each process has a single contiguous heap whose end is the program break. brk(addr) sets it; sbrk(delta) is the SysV-style relative variant. The heap can only grow or shrink contiguously — you cannot poke a hole in the middle. This is why classic free lists in malloc rarely actually return memory to the OS: you can't free the middle of the heap, only the tail.

C // brk_demo.c#include <stdio.h>
#include <unistd.h>

int main(void) {
    void *initial = sbrk(0);                /* current break */
    printf("initial brk = %p\n", initial);

    void *grown = sbrk(4096);                  /* extend by 1 page */
    printf("grown brk   = %p (got back %p)\n", sbrk(0), grown);

    char *p = (char*)grown;
    p[0] = 'X';                              /* triggers minor fault */
    sbrk(-4096);                              /* return the page */
    return 0;
}

mmap() — arbitrary mappings, anywhere

mmap creates a fresh VMA at any free address in the mmap region (the kernel picks unless you pass MAP_FIXED). Each call yields a region you can independently munmap. This is what glibc's malloc uses for any allocation above M_MMAP_THRESHOLD (default 128 KiB, but it's adaptive — grows up to 32 MiB based on observed usage).

Feature	brk / sbrk	mmap
VMA shape	single, contiguous, grows/shrinks at end	any number, anywhere in mmap region
Per-call cost	cheap syscall	cheap syscall + VMA insertion
Returnable	only the tail	per-mapping via munmap
Best for	many small short-lived allocs	large, long-lived, or shareable allocs
File-backed?	no	yes

malloc() Internals ptmalloc2 // glibc

Glibc's malloc is ptmalloc2, a heap allocator derived from Doug Lea's dlmalloc. It carves brk/mmap regions into chunks, organizes free chunks into bins, and serves user requests with metadata-prefixed pointers.

Chunk layout

malloc chunk+----------------------+ <- chunk start
| prev_size            |   only valid if prev is free
+----------------------+
| size | A | M | P     |   bottom 3 bits = flags
+----------------------+ <- pointer returned to user
| user data ...        |
|                      |
| (when freed: fd, bk, |
|  fd_nextsize, bk_..) |
+----------------------+
| chunk N+1 prev_size  |
+----------------------+

flags:  P = PREV_INUSE   (1 if prev chunk is in use)
        M = IS_MMAPPED   (1 if chunk came from mmap)
        A = NON_MAIN_ARENA

The user pointer is just past the size header. free(p) reads p[-8] to find size and flags. This is why writing past the end of a buffer corrupts the next chunk's header — the classic heap overflow.

Bins — where free chunks live

Bin type	Sizes	Structure	Notes
tcache	per-thread, ≤ 0x420 default	singly-linked, 7 entries each, 64 sizes	fastest path, no locks, glibc 2.26+
fastbins	16-160 bytes default	LIFO singly-linked	no coalescing on free
unsorted bin	any	doubly-linked	recently freed; first stop on alloc
smallbins	16 ... 1008 bytes	62 bins, FIFO doubly-linked	exact-size match per bin
largebins	1024+ bytes	63 bins, sorted by size	best-fit search within bin
top chunk	tail of arena	single chunk	extended via brk/mmap when too small

Arenas — per-thread heaps

The main arena sits on the brk heap. Additional threads get their own arenas (mmap'd 64 MiB regions on x86_64 by default), reducing lock contention. Each arena has its own bin structure. The number caps at 8 * ncpu for 64-bit systems — beyond that, threads share.

tcache

O(1), no lock

→

fastbin

O(1) per size

→

unsorted

cleanup pass

→

small/large bins

scan + coalesce

→

top chunk

grow heap

→

mmap

if > threshold

Watch the threshold flip

C // mmap_threshold.c#include <stdio.h>
#include <stdlib.h>

int main(void) {
    size_t sizes[] = { 64, 4096, 100000, 200000, 8*1024*1024 };
    for (int i = 0; i < 5; i++) {
        char *p = malloc(sizes[i]);
        printf("malloc(%-8zu) = %p\n", sizes[i], p);
    }
    return 0;
}

stdoutmalloc(64      ) = 0x55b8c7a8b2a0   <- main arena (brk heap)
malloc(4096    ) = 0x55b8c7a8b2f0   <- main arena (brk heap)
malloc(100000  ) = 0x55b8c7a8c310   <- main arena (brk heap)
malloc(200000  ) = 0x7f9a5c800010   <- mmap region!
malloc(8388608 ) = 0x7f9a5c000010   <- mmap region (separate)

The address jump from 0x55b8... (main heap) to 0x7f9a... (mmap region) is the threshold flip. When you free() the mmap'd ones, glibc actually calls munmap — that memory genuinely goes back to the kernel. free() on heap chunks just rebinds them.

Probe glibc's heap and arenas SHELL

# dump arena stats from a running process (glibc malloc_stats via gdb)
gdb -p $PID -batch -ex 'call (void)malloc_stats()' -ex detach 2>&1 | tail -20

# total heap (brk) size and the data segment
grep -E 'VmData|VmStk|VmExe|VmLib' /proc/$PID/status

# tweak the mmap threshold so even small allocs go to mmap
MALLOC_MMAP_THRESHOLD_=$((1<<14)) ./myapp   # 16 KiB

# kill tcache to see classic ptmalloc behavior
GLIBC_TUNABLES=glibc.malloc.tcache_count=0 ./myapp

# trace every malloc / free / realloc
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libmemusage.so ./myapp
ltrace -e malloc+free+realloc ./myapp 2>&1 | head

# or use a real heap profiler
heaptrack ./myapp && heaptrack_print heaptrack.myapp.*.gz | head -30

mmap() Deep Dive flags · advice

Four flag-pair combinations dominate everything you'll see in /proc/[pid]/maps:

file + private

MAP_PRIVATE | fd

Load file pages, COW on write. The classic libraries / config / data file mapping. Writes go to anonymous COW frames, never the file.

file + shared

MAP_SHARED | fd

Changes write back to the file via the page cache. Used for shared databases, scratch files, IPC via memory-mapped files.

anon + private

MAP_PRIVATE | MAP_ANONYMOUS

Fresh zero-filled pages. The malloc>128k path. Big buffers, scratch space, anything the program owns alone.

anon + shared

MAP_SHARED | MAP_ANONYMOUS

Shared between fork'd children. Classic IPC mechanism without needing a backing file. Survives fork(), dies with the last mapper.

Anonymous private mapping (the malloc workhorse)

C // anon_private.c#include <stdio.h>
#include <sys/mman.h>

int main(void) {
    size_t N = 2 * 1024 * 1024;            /* 2 MiB */
    char *m = mmap(NULL, N,
                   PROT_READ | PROT_WRITE,
                   MAP_PRIVATE | MAP_ANONYMOUS,
                   -1, 0);
    if (m == MAP_FAILED) { perror("mmap"); return 1; }

    m[0] = 'A';                              /* zero-page COW -> first frame */
    munmap(m, N);
    return 0;
}

File-backed shared mapping

C // mmap_file.c#include <fcntl.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <unistd.h>

int main(int argc, char **argv) {
    int fd = open(argv[1], O_RDWR);
    struct stat st;
    fstat(fd, &st);

    char *m = mmap(NULL, st.st_size,
                   PROT_READ | PROT_WRITE,
                   MAP_SHARED, fd, 0);

    /* edits hit the page cache; kernel writes back asynchronously */
    memset(m, 'X', 16);
    msync(m, st.st_size, MS_SYNC);          /* force flush for durability */

    munmap(m, st.st_size);
    close(fd);
    return 0;
}

This skips the explicit read/write dance. Reads materialize as page faults that pull from the page cache (or disk on miss). Writes mark pages dirty; the kernel writes them back via writeback threads or on msync.

Useful flags

Flag	What it does
MAP_POPULATE	pre-faults all pages (no demand paging) — useful when you know you'll touch everything
MAP_LOCKED	like mlock: pages won't be swapped out (needs RLIMIT_MEMLOCK)
MAP_HUGETLB	allocate from the hugepage pool (2 MiB or 1 GiB pages)
MAP_FIXED_NOREPLACE	try a specific address but fail rather than displace existing mappings (use this, not raw MAP_FIXED)
MAP_NORESERVE	don't pre-reserve swap; allows huge sparse mappings; risk: SIGBUS on fault if swap is full
MAP_STACK	hint that this mapping will be used as a stack

madvise() — tell the kernel what you'll do

C // madvise hintsmadvise(p, n, MADV_SEQUENTIAL);   /* expect linear access; aggressive readahead */
madvise(p, n, MADV_RANDOM);       /* random access; disable readahead */
madvise(p, n, MADV_WILLNEED);     /* prefetch into page cache */
madvise(p, n, MADV_DONTNEED);     /* drop pages now (anon: zeros on next read!) */
madvise(p, n, MADV_FREE);         /* lazy reclaim: pages may persist or be zeroed */
madvise(p, n, MADV_HUGEPAGE);     /* opt this region into THP coalescing */
madvise(p, n, MADV_NOHUGEPAGE);   /* and out */

fork() and Copy-on-Write page-table magic

fork() doesn't actually copy the parent's memory. It copies the page tables, marks every writable PTE in both parent and child as read-only, and increments a refcount on each backing page. The first write by either process triggers a fault; the kernel allocates a new frame, copies the contents, points the writer's PTE at the new frame, and restores RW.

FIG_06 // FORK + COW LIFECYCLE

Demonstrating COW

C // cow_demo.c#define _GNU_SOURCE
#include <stdio.h>
#include <string.h>
#include <sys/wait.h>
#include <sys/resource.h>
#include <unistd.h>

int main(void) {
    size_t N = 64 * 1024 * 1024;
    char *buf = malloc(N);
    memset(buf, 'A', N);                  /* parent allocates + faults all pages */

    struct rusage r;
    getrusage(RUSAGE_SELF, &r);
    printf("parent pre-fork  minflt=%ld\n", r.ru_minflt);

    pid_t pid = fork();
    if (pid == 0) {
        /* child: read-only access touches nothing physically */
        size_t sum = 0;
        for (size_t i = 0; i < N; i += 4096) sum += buf[i];
        getrusage(RUSAGE_SELF, &r);
        printf("child  read-only  minflt=%ld\n", r.ru_minflt);

        /* now write half: triggers COW for half the pages */
        for (size_t i = 0; i < N/2; i += 4096) buf[i] = 'B';
        getrusage(RUSAGE_SELF, &r);
        printf("child  half-write minflt=%ld\n", r.ru_minflt);
        _exit(0);
    }
    waitpid(pid, NULL, 0);
    return 0;
}

stdoutparent pre-fork  minflt=16472
child  read-only  minflt=132          # only kernel/copy bookkeeping
child  half-write minflt=8324         # ~16384/2 = 8192 COW faults + noise

Read-only access in the child does not allocate a single new physical frame. The write phase triggers exactly one minor fault per touched 4 KiB page, and only the touched pages get duplicated. This is what makes fork() + exec() cheap and what made shared read-mostly server architectures viable for decades.

RSS lies after fork // know your acronyms Both processes show the same RSS in top immediately after fork(). They aren't actually using 2x the memory — the pages are shared. Look at PSS (Proportional Set Size) in /proc/[pid]/smaps for a fairer accounting that divides shared pages across the procs that map them.

Kernel Allocators buddy · slab · vmalloc

The kernel itself needs to allocate memory: page tables, inode caches, network buffers, scheduler runqueues, your VMA structures. It has its own zoo of allocators because the constraints (interrupt context, atomic vs sleeping, physically contiguous, NUMA-local) are different from user space.

Buddy allocator — the foundation

The buddy system is the bottom of the kernel allocator stack. It manages physical page frames in power-of-two-sized blocks called orders, from order-0 (1 page = 4 KiB) up to order-10 (1024 pages = 4 MiB).

FIG_07 // BUDDY ALLOCATOR ORDERS

Need 12 KiB? Round up to 4 pages = order 2. The allocator finds a free order-2 block. If none exist, it splits an order-3 into two order-2 buddies, hands one out, queues the other. On free, if the buddy is also free, they recombine. This is the source of memory fragmentation — a system can have plenty of free order-0 pages and still fail an order-5 allocation if no contiguous run exists.

See /proc/buddyinfo for the live state per zone:

$ cat /proc/buddyinfoNode 0, zone      DMA      1      0      0      1      2  ...
Node 0, zone    DMA32  10234   8501   6442   3201    998  ...
Node 0, zone   Normal  88412  41203   9871    412     22  ...

Slab / SLUB — cache of fixed-size objects

Most kernel allocations aren't full pages — they're little structs (256 bytes for a task_struct, 192 for a file, etc). The slab allocator sits on top of buddy: it asks for pages in bulk, then carves them into fixed-size object slots, keeping per-CPU and per-NUMA-node caches to avoid lock contention.

$ sudo head /proc/slabinfoslabinfo - version: 2.1
# name              active_objs num_objs objsize objperslab pagesperslab
dentry              412091      412290      192        21            1
inode_cache         206301      206301      584        14            2
kmalloc-8k             280         280     8192         4            8
kmalloc-1k            3104        3104     1024        16            4
task_struct           1242        1242     5824         5            8
mm_struct              312         312     1024        16            4
vm_area_struct        9412        9412      216        18            1

kmalloc / kvmalloc / vmalloc

API	Returns	Used for
kmalloc(n, GFP_*)	physically contiguous	small kernel allocs (DMA buffers, structs); backed by SLUB
vmalloc(n)	virtually contiguous, physically scattered	large kernel buffers where physical contiguity isn't needed
kvmalloc(n)	tries kmalloc, falls back to vmalloc	generic large alloc, "I don't care which"
alloc_pages(order)	raw page frames	page-table builders, network ring buffers

GFP flags — get-free-pages context

Flag	Means
GFP_KERNEL	can sleep, can do I/O, can reclaim — the default for process context
GFP_ATOMIC	cannot sleep (interrupt context); fail fast if no free page available
GFP_NOIO	can sleep but not start I/O (avoid recursion in I/O paths)
GFP_DMA	must come from low-memory zone usable by legacy DMA controllers
__GFP_ZERO	zero the returned page
__GFP_NOWARN	don't dump a stacktrace if alloc fails

Page cache

Every byte you read from a regular file lives, transiently, in the page cache — the kernel's RAM-backed cache of file pages indexed by (inode, offset). It's not a separate pool; it's just pages from the buddy allocator that happen to be reclaimable. free's "available" number includes most of it.

This is also why cat largefile > /dev/null warms the cache, why running a benchmark twice usually gets faster the second time, and why echo 3 > /proc/sys/vm/drop_caches is the standard way to invalidate it before measurement.

Memory Pressure, Swap, OOM reclaim · watermarks

When demand for pages exceeds supply, Linux reclaims. Reclaim is handled by per-NUMA-node kswapd threads (background, low watermark) and direct reclaim (foreground, allocation-time, when watermarks crash).

What gets reclaimed

Clean file-backed pages — just drop them. Source is on disk.
Dirty file-backed pages — write back, then drop.
Anonymous pages — write to swap, then drop. If no swap, can't reclaim — this is when OOM looms.

Reclaim uses an active/inactive LRU per zone, biased by the PTE accessed bit and adjusted by vm.swappiness (0-200, default 60). Lower values discourage anon swap-out in favor of reclaiming file cache.

Watermarks

Each zone has three watermarks: min, low, high. Allocations below min only succeed for kernel atomic contexts. Crossing low wakes kswapd. Below high, kswapd works until it's restored. See /proc/zoneinfo.

The OOM killer

If reclaim can't free enough memory, the OOM killer picks a victim and sends SIGKILL. The score is computed in /proc/[pid]/oom_score, derived from RSS and tunable via /proc/[pid]/oom_score_adj (-1000 to +1000; -1000 = immune).

C // oom_protect.c#include <fcntl.h>
#include <unistd.h>

int main(void) {
    int fd = open("/proc/self/oom_score_adj", O_WRONLY);
    write(fd, "-500", 4);                    /* less likely to be killed */
    close(fd);
    /* ... do important work ... */
    return 0;
}

/proc/meminfo — what the system thinks it has

/proc/meminfoMemTotal:       32896132 kB    # all RAM the kernel can see
MemFree:         1204896 kB    # truly unused
MemAvailable:   18472304 kB    # estimate of what allocs *could* get
Buffers:          412004 kB    # block-device cache
Cached:         15823008 kB    # page cache (file pages)
SwapCached:        12492 kB    # pages once in swap, now back in memory
Active:         12409832 kB    # recently used (anon + file)
Inactive:        4988124 kB    # candidates for reclaim
AnonPages:       9912004 kB    # process anonymous (heap, stack, anon mmap)
Mapped:          1284820 kB    # mapped from files (libs, mmap'd files)
Slab:             982408 kB    # kernel slab caches
SReclaimable:     682112 kB    # slabs the kernel can drop
SUnreclaim:       300296 kB    # slabs that must stay
KernelStack:       28192 kB    # per-task kernel stacks
PageTables:       142884 kB    # the actual page-table pages!
SwapTotal:       8388604 kB
SwapFree:        8261940 kB
Dirty:              2884 kB    # waiting to be written back
Writeback:             0 kB    # currently being written

Note PageTables — that's the kernel's accounting of RAM consumed by your process's PGDs/PUDs/PMDs/PTEs. A process with very sparse mappings can spend non-trivial RAM just on the page tables.

cgroup memory controller

cgroups v2 lets you cap memory per group (typically per container or per service). Hitting memory.max triggers reclaim within the cgroup; failing reclaim invokes the cgroup OOM killer (only kills processes in the cgroup).

SHELL // cgroups v2# limit a service to 512 MiB
mkdir /sys/fs/cgroup/myservice
echo 536870912 > /sys/fs/cgroup/myservice/memory.max
echo $$        > /sys/fs/cgroup/myservice/cgroup.procs
# now this shell + descendants are capped

Inspect pressure, swap, OOM SHELL

# PSI: how much time we're stalled on memory
cat /proc/pressure/memory
# some avg10=0.00 ... full avg10=0.00 ...   "full" non-zero = real pain

# actively swapping right now? si/so columns are kB/s
vmstat 1 5

# current watermarks per zone
grep -E 'Node|min|low|high' /proc/zoneinfo | head -20

# did the OOM killer run since boot?
dmesg -T | grep -i 'killed process\|out of memory'
journalctl -k --since boot | grep -i 'invoked oom'

# who would the OOM killer pick right now?
for p in /proc/[0-9]*; do
  s=$(cat $p/oom_score 2>/dev/null) || continue
  echo "$s $(cat $p/comm 2>/dev/null) $(basename $p)"
done | sort -n | tail -10

# cgroup-scoped pressure and OOM events
cat /sys/fs/cgroup/myservice/memory.pressure
cat /sys/fs/cgroup/myservice/memory.events

NUMA // When Memory Has Geography multi-socket

Multi-socket servers don't have one big memory pool. Each socket has its own attached DRAM (the local node); accessing another socket's memory traverses the inter-socket link (UPI on Intel, Infinity Fabric on AMD). The latency penalty is real — typically 1.5x to 2x for remote access.

Linux models this as NUMA nodes. numactl --hardware shows the topology:

$ numactl --hardwareavailable: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 16345 MB
node 1 cpus: 8 9 10 11 12 13 14 15
node 1 size: 16384 MB
node distances:
node     0    1
  0:    10   21
  1:    21   10

The distance matrix is in arbitrary units; 10 is local, 21 here means a remote access costs ~2.1x the local cost.

Allocation policies

default

first-touch local

Allocate on the node where the touching thread runs. Good default. Can go wrong if one thread allocates and another consumes.

--membind

strict

Only alloc from listed nodes; fail if none have room. Forces locality at cost of flexibility.

--preferred

soft preference

Try one node, fall back to others. Most useful for explicit pinning of one process to one node.

--interleave

round-robin

Striped across nodes. Good for big shared buffers where bandwidth matters more than latency.

C // numa_aware.c // -lnuma#include <numa.h>
#include <numaif.h>
#include <stdio.h>

int main(void) {
    if (numa_available() < 0) return 1;

    int nnodes = numa_max_node() + 1;
    printf("%d NUMA nodes\n", nnodes);

    /* allocate 1 GiB strictly on node 0 */
    void *p = numa_alloc_onnode(1L << 30, 0);

    /* pin this thread to node 0 too */
    numa_run_on_node(0);

    /* now touching p is fast (local DRAM) */
    numa_free(p, 1L << 30);
    return 0;
}

For the common case (single-process service, multiple threads), Linux's default first-touch + autonuma balancing usually does the right thing — it migrates pages toward the cores that touch them most. For high-end workloads (databases, in-memory analytics), explicit pinning with numactl or libnuma is the difference between 60% and 100% of memory bandwidth.

Inspect per-process NUMA placement SHELL

# topology + node memory totals
numactl --hardware

# RSS broken down by node, per process
numastat -p $PID

# every mapping with its current node distribution
cat /proc/$PID/numa_maps | head

# node-level allocator stats: hits, misses, foreign
numastat -m

# pin a workload manually
numactl --cpunodebind=0 --membind=0 ./myapp

# remote-access penalty in cycles
perf stat -e mem_load_l3_miss_retired.remote_dram ./myapp

# is autonuma migrating pages around?
grep numa_ /proc/vmstat

Tooling // Where to Poke procfs · perf · bcc

Per-process

Path / tool	What you get
/proc/[pid]/maps	VMA list, perms, file backing
/proc/[pid]/smaps	per-VMA RSS / PSS / swap / hugepages / locked
/proc/[pid]/status	VmSize / VmRSS / VmHWM / VmData / VmStk / VmExe
/proc/[pid]/statm	compact: size, resident, shared, text, data, library, dirty
/proc/[pid]/pagemap	virt-to-PFN lookup, swap status, present bit (root only)
/proc/[pid]/oom_score_adj	tune OOM-kill priority
pmap -x [pid]	readable VMA summary

System-wide

Path / tool	What you get
/proc/meminfo	top-level memory accounting
/proc/buddyinfo	free pages per order per zone
/proc/zoneinfo	watermarks, pages, vmstats per zone
/proc/slabinfo	per-cache slab usage (root)
/proc/vmstat	counters: pgfault, pgmajfault, pgscan, pswpin, etc.
free -h	quick total / used / free / buff/cache / available
vmstat 1	live si/so (swap I/O), bi/bo, free, buff, cache
sar -B 1	paging stats: pgpgin, pgpgout, fault, majflt, pgfree, pgscan
numastat -m	per-node memory breakdown
slabtop	live slab cache top-N
perf mem record	load/store address sampling, NUMA hit/miss
bcc tools (memleak, slabratetop)	eBPF-based dynamic tracing

Quick recipes

SHELL# What's allocating in the kernel right now?
sudo slabtop -o | head -20

# What's the actual physical frame backing a virtual address?
# (root only; uses /proc/[pid]/pagemap)
sudo ./pagemap-dump $(pidof myapp) 0x7f3e9c1a0000

# Force-flush page cache for benchmarking:
sync && echo 3 | sudo tee /proc/sys/vm/drop_caches

# Watch a process's RSS climb:
while true; do grep VmRSS /proc/$PID/status; sleep 1; done

# Major-fault hot processes:
ps -eo pid,comm,maj_flt,min_flt --sort=-maj_flt | head

# Are we swapping right now?
vmstat 1 5   # si/so columns; non-zero = active swap I/O

# Is THP coalescing pages?
grep AnonHugePages /proc/meminfo
cat /sys/kernel/mm/transparent_hugepage/enabled