Every running process believes it owns the entire machine. It opens at the same addresses as every other process, references the same symbols, walks the same stack-pointer values. The trick that makes this work is virtual memory: the process sees a flat private address space, and the kernel plus the CPU's Memory Management Unit lazily map regions of that virtual space onto the physical DRAM that actually exists.
Three things fall out of this for free.
Two shapes to keep in your head: page (virtual, 4 KiB by default on x86_64) and page frame (physical, the same 4 KiB chunk of DRAM). Translation happens at page granularity. Anything finer is the allocator's problem.
x86_64 hardware uses 48-bit virtual addresses (5-level paging extends this to 57, but most kernels still ship 4-level). The high 16 bits must be a sign-extension of bit 47 — addresses must be canonical. Anything else faults. This carves the 64-bit space into two halves with a giant unmappable hole in the middle.
| Segment | Source | Perms | Backed by | Notes |
|---|---|---|---|---|
| .text | ELF LOAD | r-x | file (private) | shared between procs of same binary |
| .rodata | ELF LOAD | r-- | file | string literals, const tables |
| .data | ELF LOAD | rw- | file (COW) | initialized globals/statics |
| .bss | kernel | rw- | anonymous (zero) | no file bytes; zeroed on first touch |
| heap | brk/sbrk | rw- | anonymous | extended in pages, never released piecemeal |
| mmap | mmap() | varies | file or anon | unmappable individually via munmap() |
| stack | kernel (auto) | rw- | anonymous | grows on fault, capped by RLIMIT_STACK |
| vDSO/vvar | kernel | r-x / r-- | shared kernel | fast-path syscalls (gettimeofday, etc.) |
execve(). Two runs of the same binary won't share addresses for anything except shared .text pages of the underlying ELF, which the kernel happily deduplicates in the page cache.
# raw VMA list of the current shell cat /proc/self/maps # any process by pid (readable, with sizes and perms) pmap -X $PID # just the named regions grep -E '\[(heap|stack|vdso|vvar)\]' /proc/$PID/maps # is ASLR on? 0=off 1=conservative 2=full cat /proc/sys/kernel/randomize_va_space # run the same binary twice and diff the layout diff <(./myapp & sleep 0.1; cat /proc/$!/maps; kill $!) \ <(./myapp & sleep 0.1; cat /proc/$!/maps; kill $!) # max stack size for this shell ulimit -s
The kernel exposes the live VMA list (Virtual Memory Areas — the kernel's internal representation of contiguous mapped regions) through procfs. Each line is one VMA. Compile and run this:
C // dump_maps.c#define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> #include <unistd.h> char initialized_global[] = "hello"; /* .data */ char zero_global[4096]; /* .bss */ const char *ro_string = "read me only"; /* .rodata */ int main(int argc, char **argv) { int on_stack = 42; char *on_heap = malloc(128); char *big_heap = malloc(8 * 1024 * 1024); /* > mmap threshold */ printf("main() = %p .text\n", (void*)main); printf("ro_string = %p .rodata\n", (void*)ro_string); printf("init_global = %p .data\n", (void*)initialized_global); printf("zero_global = %p .bss\n", (void*)zero_global); printf("on_heap (sml) = %p heap (brk)\n", (void*)on_heap); printf("big_heap (lrg) = %p mmap region\n", (void*)big_heap); printf("on_stack = %p stack\n", (void*)&on_stack); FILE *m = fopen("/proc/self/maps", "r"); char line[512]; while (fgets(line, sizeof line, m)) fputs(line, stdout); fclose(m); return 0; }
Sample output (truncated, addresses shown in low form for clarity):
stdoutmain() = 0x55b2c4a01169 .text ro_string = 0x55b2c4a02008 .rodata init_global = 0x55b2c4a04010 .data zero_global = 0x55b2c4a04020 .bss on_heap (sml) = 0x55b2c5e7a2a0 heap (brk) big_heap (lrg) = 0x7f3e9b800010 mmap region on_stack = 0x7ffd4f3a8b8c stack 55b2c4a00000-55b2c4a01000 r--p 00000000 fd:00 123 /home/u/dump_maps # ELF header 55b2c4a01000-55b2c4a02000 r-xp 00001000 fd:00 123 /home/u/dump_maps # .text 55b2c4a02000-55b2c4a03000 r--p 00002000 fd:00 123 /home/u/dump_maps # .rodata 55b2c4a03000-55b2c4a04000 rw-p 00003000 fd:00 123 /home/u/dump_maps # .data 55b2c4a04000-55b2c4a05000 rw-p 00000000 00:00 0 # .bss (anon) 55b2c5e7a000-55b2c5e9b000 rw-p 00000000 00:00 0 [heap] 7f3e9b7ff000-7f3e9c000000 rw-p 00000000 00:00 0 # 8 MiB anon mmap 7f3e9c1a0000-7f3e9c1c8000 r--p 00000000 fd:00 456 /usr/lib64/libc.so.6 7f3e9c1c8000-7f3e9c34f000 r-xp 00028000 fd:00 456 /usr/lib64/libc.so.6 7ffd4f389000-7ffd4f3aa000 rw-p 00000000 00:00 0 [stack] 7ffd4f3d4000-7ffd4f3d8000 r--p 00000000 00:00 0 [vvar] 7ffd4f3d8000-7ffd4f3da000 r-xp 00000000 00:00 0 [vdso]
Each row is start-end perms offset dev inode pathname. Permissions are rwxp or rwxs (private vs shared). A private writable file mapping starts as a copy-on-write of the file pages — the moment you write, the kernel hands you a private anon page and forgets about the file for that location.
Want richer info? /proc/[pid]/smaps breaks down each VMA's RSS, PSS, swap, anon vs file pages, hugepage backing, locked status.
# first VMA's full breakdown — Size, Rss, Pss, Swap, Anon, etc. awk '/^[0-9a-f]+-[0-9a-f]+/{n++} n==2{exit} {print}' /proc/$PID/smaps # single-shot summary across all mappings cat /proc/$PID/smaps_rollup # biggest mappings by RSS pmap -x $PID | sort -k3 -n -r | head # every shared library actually loaded awk '/\.so/{print $6}' /proc/$PID/maps | sort -u # trace ld.so as it pulls libraries in LD_DEBUG=files ./myapp 2>&1 | grep 'calling init'
A virtual address is not a number that gets added to a base register. It's a structured key into a multi-level radix tree. On x86_64 with 4-level paging, a 48-bit virtual address breaks into five fields:
Each level is a 4 KiB table of 512 64-bit entries. The CPU walks the tree on every translation that misses the TLB.
| Bit | Name | Meaning |
|---|---|---|
| 0 | P | present — if 0, fault on access (other bits may still be meaningful to the kernel) |
| 1 | R/W | writable when 1, read-only when 0 |
| 2 | U/S | user accessible when 1, kernel-only when 0 |
| 3 | PWT | write-through caching |
| 4 | PCD | cache disable |
| 5 | A | accessed — set by hardware on read/write, cleared by kernel for LRU |
| 6 | D | dirty — set by hardware on write, used for writeback |
| 7 | PS | page size — if 1 at PMD level, this is a 2 MiB hugepage; at PUD, a 1 GiB page |
| 8 | G | global — not flushed from TLB on CR3 reload (kernel mappings) |
| 12-51 | PFN | physical page frame number, shifted left 12 bits gives physical address |
| 63 | NX | no-execute — faults on instruction fetch when set |
If the PMD entry has bit 7 (PS) set, the walk stops there: bits 20:0 of the virtual address become a 21-bit offset into a 2 MiB hugepage. Same idea one level up gives 1 GiB pages. Hugepages dramatically reduce TLB pressure for workloads with large working sets — one TLB entry covers 512x or 262144x more memory.
The Memory Management Unit is hardware. Every load and store goes through it. The page-table walk above costs four serial DRAM loads, which is obscenely slow at modern clock rates — so the MMU caches recent translations in the Translation Lookaside Buffer.
Loading CR3 with a new process's PGD physical address invalidates every non-global TLB entry. Kernel mappings have the G bit set so they survive context switches.
If a kernel changes a page-table entry on one CPU, it needs to invalidate the cached translation on every CPU that might have it — this is a TLB shootdown, implemented via inter-processor interrupts. They're expensive. Workloads that munmap or mprotect aggressively pay for them visibly.
# TLB miss rates while running a binary perf stat -e dTLB-load-misses,dTLB-loads,iTLB-load-misses,iTLB-loads ./myapp # expressed as miss ratio perf stat -x, -e dTLB-loads,dTLB-load-misses ./myapp 2>&1 \ | awk -F, 'NR==1{loads=$1} NR==2{print "dTLB miss ratio:", $1/loads}' # is THP coalescing pages? grep -E 'AnonHugePages|ShmemHugePages' /proc/meminfo cat /sys/kernel/mm/transparent_hugepage/enabled # hugepage pools (2 MiB / 1 GiB) ls /sys/kernel/mm/hugepages # is the CPU using PCIDs? grep -m1 -o 'pcid' /proc/cpuinfo
A page fault is the CPU's way of telling the kernel "I tried to translate an address and the PTE didn't satisfy the access." It's not always an error — faults are how Linux implements demand paging, copy-on-write, lazy stack growth, and swap-in.
C // demand_paging.c#define _GNU_SOURCE #include <stdio.h> #include <sys/resource.h> #include <sys/mman.h> static void show(const char *tag) { struct rusage r; getrusage(RUSAGE_SELF, &r); printf("%-14s minflt=%-8ld majflt=%-4ld maxrss=%ld KiB\n", tag, r.ru_minflt, r.ru_majflt, r.ru_maxrss); } int main(void) { size_t N = 256 * 1024 * 1024; /* 256 MiB */ show("baseline"); char *p = mmap(NULL, N, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); show("after mmap"); /* still ~0 RSS, no faults */ for (size_t i = 0; i < N; i += 4096) p[i] = 1; /* one fault per page */ show("after touch"); return 0; }
stdoutbaseline minflt=92 majflt=0 maxrss=2944 KiB after mmap minflt=92 majflt=0 maxrss=2944 KiB after touch minflt=65628 majflt=0 maxrss=265728 KiB
The mmap itself is virtually free. The 256 MiB only gets paid for when you touch it — one minor fault per 4 KiB page (65536 of them). No major faults, because nothing comes from disk. Anonymous pages are filled from the kernel's zero page on first read, then COW'd on first write.
# instantaneous fault counts for a pid (cols 10-13 of /proc/[pid]/stat) awk '{print "minflt="$10, "cminflt="$11, "majflt="$12, "cmajflt="$13}' /proc/$PID/stat # cleaner version ps -o pid,comm,min_flt,maj_flt -p $PID # top processes by major faults right now ps -eo pid,comm,maj_flt --sort=-maj_flt | head # system-wide rate, per second sar -B 1 # raw kernel counters (cumulative since boot) grep -E '^pgfault|^pgmajfault|^pgreuse' /proc/vmstat # trace mmap / brk / mprotect calls during a run strace -e mmap,mprotect,brk,munmap ./myapp 2>&1 | head -40 # live fault tracing with bcc / bpftrace sudo bpftrace -e 'tracepoint:exceptions:page_fault_user { @[comm] = count(); }'
Two syscalls actually create new VMAs in user space. Everything else — malloc, calloc, new, the garbage collector in your runtime — is a library on top of these.
Each process has a single contiguous heap whose end is the program break. brk(addr) sets it; sbrk(delta) is the SysV-style relative variant. The heap can only grow or shrink contiguously — you cannot poke a hole in the middle. This is why classic free lists in malloc rarely actually return memory to the OS: you can't free the middle of the heap, only the tail.
C // brk_demo.c#include <stdio.h> #include <unistd.h> int main(void) { void *initial = sbrk(0); /* current break */ printf("initial brk = %p\n", initial); void *grown = sbrk(4096); /* extend by 1 page */ printf("grown brk = %p (got back %p)\n", sbrk(0), grown); char *p = (char*)grown; p[0] = 'X'; /* triggers minor fault */ sbrk(-4096); /* return the page */ return 0; }
mmap creates a fresh VMA at any free address in the mmap region (the kernel picks unless you pass MAP_FIXED). Each call yields a region you can independently munmap. This is what glibc's malloc uses for any allocation above M_MMAP_THRESHOLD (default 128 KiB, but it's adaptive — grows up to 32 MiB based on observed usage).
| Feature | brk / sbrk | mmap |
|---|---|---|
| VMA shape | single, contiguous, grows/shrinks at end | any number, anywhere in mmap region |
| Per-call cost | cheap syscall | cheap syscall + VMA insertion |
| Returnable | only the tail | per-mapping via munmap |
| Best for | many small short-lived allocs | large, long-lived, or shareable allocs |
| File-backed? | no | yes |
Glibc's malloc is ptmalloc2, a heap allocator derived from Doug Lea's dlmalloc. It carves brk/mmap regions into chunks, organizes free chunks into bins, and serves user requests with metadata-prefixed pointers.
malloc chunk+----------------------+ <- chunk start | prev_size | only valid if prev is free +----------------------+ | size | A | M | P | bottom 3 bits = flags +----------------------+ <- pointer returned to user | user data ... | | | | (when freed: fd, bk, | | fd_nextsize, bk_..) | +----------------------+ | chunk N+1 prev_size | +----------------------+ flags: P = PREV_INUSE (1 if prev chunk is in use) M = IS_MMAPPED (1 if chunk came from mmap) A = NON_MAIN_ARENA
The user pointer is just past the size header. free(p) reads p[-8] to find size and flags. This is why writing past the end of a buffer corrupts the next chunk's header — the classic heap overflow.
| Bin type | Sizes | Structure | Notes |
|---|---|---|---|
| tcache | per-thread, ≤ 0x420 default | singly-linked, 7 entries each, 64 sizes | fastest path, no locks, glibc 2.26+ |
| fastbins | 16-160 bytes default | LIFO singly-linked | no coalescing on free |
| unsorted bin | any | doubly-linked | recently freed; first stop on alloc |
| smallbins | 16 ... 1008 bytes | 62 bins, FIFO doubly-linked | exact-size match per bin |
| largebins | 1024+ bytes | 63 bins, sorted by size | best-fit search within bin |
| top chunk | tail of arena | single chunk | extended via brk/mmap when too small |
The main arena sits on the brk heap. Additional threads get their own arenas (mmap'd 64 MiB regions on x86_64 by default), reducing lock contention. Each arena has its own bin structure. The number caps at 8 * ncpu for 64-bit systems — beyond that, threads share.
C // mmap_threshold.c#include <stdio.h> #include <stdlib.h> int main(void) { size_t sizes[] = { 64, 4096, 100000, 200000, 8*1024*1024 }; for (int i = 0; i < 5; i++) { char *p = malloc(sizes[i]); printf("malloc(%-8zu) = %p\n", sizes[i], p); } return 0; }
stdoutmalloc(64 ) = 0x55b8c7a8b2a0 <- main arena (brk heap) malloc(4096 ) = 0x55b8c7a8b2f0 <- main arena (brk heap) malloc(100000 ) = 0x55b8c7a8c310 <- main arena (brk heap) malloc(200000 ) = 0x7f9a5c800010 <- mmap region! malloc(8388608 ) = 0x7f9a5c000010 <- mmap region (separate)
The address jump from 0x55b8... (main heap) to 0x7f9a... (mmap region) is the threshold flip. When you free() the mmap'd ones, glibc actually calls munmap — that memory genuinely goes back to the kernel. free() on heap chunks just rebinds them.
# dump arena stats from a running process (glibc malloc_stats via gdb) gdb -p $PID -batch -ex 'call (void)malloc_stats()' -ex detach 2>&1 | tail -20 # total heap (brk) size and the data segment grep -E 'VmData|VmStk|VmExe|VmLib' /proc/$PID/status # tweak the mmap threshold so even small allocs go to mmap MALLOC_MMAP_THRESHOLD_=$((1<<14)) ./myapp # 16 KiB # kill tcache to see classic ptmalloc behavior GLIBC_TUNABLES=glibc.malloc.tcache_count=0 ./myapp # trace every malloc / free / realloc LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libmemusage.so ./myapp ltrace -e malloc+free+realloc ./myapp 2>&1 | head # or use a real heap profiler heaptrack ./myapp && heaptrack_print heaptrack.myapp.*.gz | head -30
Four flag-pair combinations dominate everything you'll see in /proc/[pid]/maps:
C // anon_private.c#include <stdio.h> #include <sys/mman.h> int main(void) { size_t N = 2 * 1024 * 1024; /* 2 MiB */ char *m = mmap(NULL, N, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (m == MAP_FAILED) { perror("mmap"); return 1; } m[0] = 'A'; /* zero-page COW -> first frame */ munmap(m, N); return 0; }
C // mmap_file.c#include <fcntl.h> #include <string.h> #include <sys/mman.h> #include <sys/stat.h> #include <unistd.h> int main(int argc, char **argv) { int fd = open(argv[1], O_RDWR); struct stat st; fstat(fd, &st); char *m = mmap(NULL, st.st_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); /* edits hit the page cache; kernel writes back asynchronously */ memset(m, 'X', 16); msync(m, st.st_size, MS_SYNC); /* force flush for durability */ munmap(m, st.st_size); close(fd); return 0; }
This skips the explicit read/write dance. Reads materialize as page faults that pull from the page cache (or disk on miss). Writes mark pages dirty; the kernel writes them back via writeback threads or on msync.
| Flag | What it does |
|---|---|
| MAP_POPULATE | pre-faults all pages (no demand paging) — useful when you know you'll touch everything |
| MAP_LOCKED | like mlock: pages won't be swapped out (needs RLIMIT_MEMLOCK) |
| MAP_HUGETLB | allocate from the hugepage pool (2 MiB or 1 GiB pages) |
| MAP_FIXED_NOREPLACE | try a specific address but fail rather than displace existing mappings (use this, not raw MAP_FIXED) |
| MAP_NORESERVE | don't pre-reserve swap; allows huge sparse mappings; risk: SIGBUS on fault if swap is full |
| MAP_STACK | hint that this mapping will be used as a stack |
C // madvise hintsmadvise(p, n, MADV_SEQUENTIAL); /* expect linear access; aggressive readahead */ madvise(p, n, MADV_RANDOM); /* random access; disable readahead */ madvise(p, n, MADV_WILLNEED); /* prefetch into page cache */ madvise(p, n, MADV_DONTNEED); /* drop pages now (anon: zeros on next read!) */ madvise(p, n, MADV_FREE); /* lazy reclaim: pages may persist or be zeroed */ madvise(p, n, MADV_HUGEPAGE); /* opt this region into THP coalescing */ madvise(p, n, MADV_NOHUGEPAGE); /* and out */
fork() doesn't actually copy the parent's memory. It copies the page tables, marks every writable PTE in both parent and child as read-only, and increments a refcount on each backing page. The first write by either process triggers a fault; the kernel allocates a new frame, copies the contents, points the writer's PTE at the new frame, and restores RW.
C // cow_demo.c#define _GNU_SOURCE #include <stdio.h> #include <string.h> #include <sys/wait.h> #include <sys/resource.h> #include <unistd.h> int main(void) { size_t N = 64 * 1024 * 1024; char *buf = malloc(N); memset(buf, 'A', N); /* parent allocates + faults all pages */ struct rusage r; getrusage(RUSAGE_SELF, &r); printf("parent pre-fork minflt=%ld\n", r.ru_minflt); pid_t pid = fork(); if (pid == 0) { /* child: read-only access touches nothing physically */ size_t sum = 0; for (size_t i = 0; i < N; i += 4096) sum += buf[i]; getrusage(RUSAGE_SELF, &r); printf("child read-only minflt=%ld\n", r.ru_minflt); /* now write half: triggers COW for half the pages */ for (size_t i = 0; i < N/2; i += 4096) buf[i] = 'B'; getrusage(RUSAGE_SELF, &r); printf("child half-write minflt=%ld\n", r.ru_minflt); _exit(0); } waitpid(pid, NULL, 0); return 0; }
stdoutparent pre-fork minflt=16472 child read-only minflt=132 # only kernel/copy bookkeeping child half-write minflt=8324 # ~16384/2 = 8192 COW faults + noise
Read-only access in the child does not allocate a single new physical frame. The write phase triggers exactly one minor fault per touched 4 KiB page, and only the touched pages get duplicated. This is what makes fork() + exec() cheap and what made shared read-mostly server architectures viable for decades.
top immediately after fork(). They aren't actually using 2x the memory — the pages are shared. Look at PSS (Proportional Set Size) in /proc/[pid]/smaps for a fairer accounting that divides shared pages across the procs that map them.
The kernel itself needs to allocate memory: page tables, inode caches, network buffers, scheduler runqueues, your VMA structures. It has its own zoo of allocators because the constraints (interrupt context, atomic vs sleeping, physically contiguous, NUMA-local) are different from user space.
The buddy system is the bottom of the kernel allocator stack. It manages physical page frames in power-of-two-sized blocks called orders, from order-0 (1 page = 4 KiB) up to order-10 (1024 pages = 4 MiB).
Need 12 KiB? Round up to 4 pages = order 2. The allocator finds a free order-2 block. If none exist, it splits an order-3 into two order-2 buddies, hands one out, queues the other. On free, if the buddy is also free, they recombine. This is the source of memory fragmentation — a system can have plenty of free order-0 pages and still fail an order-5 allocation if no contiguous run exists.
See /proc/buddyinfo for the live state per zone:
$ cat /proc/buddyinfoNode 0, zone DMA 1 0 0 1 2 ... Node 0, zone DMA32 10234 8501 6442 3201 998 ... Node 0, zone Normal 88412 41203 9871 412 22 ...
Most kernel allocations aren't full pages — they're little structs (256 bytes for a task_struct, 192 for a file, etc). The slab allocator sits on top of buddy: it asks for pages in bulk, then carves them into fixed-size object slots, keeping per-CPU and per-NUMA-node caches to avoid lock contention.
$ sudo head /proc/slabinfoslabinfo - version: 2.1 # name active_objs num_objs objsize objperslab pagesperslab dentry 412091 412290 192 21 1 inode_cache 206301 206301 584 14 2 kmalloc-8k 280 280 8192 4 8 kmalloc-1k 3104 3104 1024 16 4 task_struct 1242 1242 5824 5 8 mm_struct 312 312 1024 16 4 vm_area_struct 9412 9412 216 18 1
| API | Returns | Used for |
|---|---|---|
| kmalloc(n, GFP_*) | physically contiguous | small kernel allocs (DMA buffers, structs); backed by SLUB |
| vmalloc(n) | virtually contiguous, physically scattered | large kernel buffers where physical contiguity isn't needed |
| kvmalloc(n) | tries kmalloc, falls back to vmalloc | generic large alloc, "I don't care which" |
| alloc_pages(order) | raw page frames | page-table builders, network ring buffers |
| Flag | Means |
|---|---|
| GFP_KERNEL | can sleep, can do I/O, can reclaim — the default for process context |
| GFP_ATOMIC | cannot sleep (interrupt context); fail fast if no free page available |
| GFP_NOIO | can sleep but not start I/O (avoid recursion in I/O paths) |
| GFP_DMA | must come from low-memory zone usable by legacy DMA controllers |
| __GFP_ZERO | zero the returned page |
| __GFP_NOWARN | don't dump a stacktrace if alloc fails |
Every byte you read from a regular file lives, transiently, in the page cache — the kernel's RAM-backed cache of file pages indexed by (inode, offset). It's not a separate pool; it's just pages from the buddy allocator that happen to be reclaimable. free's "available" number includes most of it.
This is also why cat largefile > /dev/null warms the cache, why running a benchmark twice usually gets faster the second time, and why echo 3 > /proc/sys/vm/drop_caches is the standard way to invalidate it before measurement.
When demand for pages exceeds supply, Linux reclaims. Reclaim is handled by per-NUMA-node kswapd threads (background, low watermark) and direct reclaim (foreground, allocation-time, when watermarks crash).
Reclaim uses an active/inactive LRU per zone, biased by the PTE accessed bit and adjusted by vm.swappiness (0-200, default 60). Lower values discourage anon swap-out in favor of reclaiming file cache.
Each zone has three watermarks: min, low, high. Allocations below min only succeed for kernel atomic contexts. Crossing low wakes kswapd. Below high, kswapd works until it's restored. See /proc/zoneinfo.
If reclaim can't free enough memory, the OOM killer picks a victim and sends SIGKILL. The score is computed in /proc/[pid]/oom_score, derived from RSS and tunable via /proc/[pid]/oom_score_adj (-1000 to +1000; -1000 = immune).
C // oom_protect.c#include <fcntl.h> #include <unistd.h> int main(void) { int fd = open("/proc/self/oom_score_adj", O_WRONLY); write(fd, "-500", 4); /* less likely to be killed */ close(fd); /* ... do important work ... */ return 0; }
/proc/meminfoMemTotal: 32896132 kB # all RAM the kernel can see MemFree: 1204896 kB # truly unused MemAvailable: 18472304 kB # estimate of what allocs *could* get Buffers: 412004 kB # block-device cache Cached: 15823008 kB # page cache (file pages) SwapCached: 12492 kB # pages once in swap, now back in memory Active: 12409832 kB # recently used (anon + file) Inactive: 4988124 kB # candidates for reclaim AnonPages: 9912004 kB # process anonymous (heap, stack, anon mmap) Mapped: 1284820 kB # mapped from files (libs, mmap'd files) Slab: 982408 kB # kernel slab caches SReclaimable: 682112 kB # slabs the kernel can drop SUnreclaim: 300296 kB # slabs that must stay KernelStack: 28192 kB # per-task kernel stacks PageTables: 142884 kB # the actual page-table pages! SwapTotal: 8388604 kB SwapFree: 8261940 kB Dirty: 2884 kB # waiting to be written back Writeback: 0 kB # currently being written
Note PageTables — that's the kernel's accounting of RAM consumed by your process's PGDs/PUDs/PMDs/PTEs. A process with very sparse mappings can spend non-trivial RAM just on the page tables.
cgroups v2 lets you cap memory per group (typically per container or per service). Hitting memory.max triggers reclaim within the cgroup; failing reclaim invokes the cgroup OOM killer (only kills processes in the cgroup).
SHELL // cgroups v2# limit a service to 512 MiB mkdir /sys/fs/cgroup/myservice echo 536870912 > /sys/fs/cgroup/myservice/memory.max echo $$ > /sys/fs/cgroup/myservice/cgroup.procs # now this shell + descendants are capped
# PSI: how much time we're stalled on memory cat /proc/pressure/memory # some avg10=0.00 ... full avg10=0.00 ... "full" non-zero = real pain # actively swapping right now? si/so columns are kB/s vmstat 1 5 # current watermarks per zone grep -E 'Node|min|low|high' /proc/zoneinfo | head -20 # did the OOM killer run since boot? dmesg -T | grep -i 'killed process\|out of memory' journalctl -k --since boot | grep -i 'invoked oom' # who would the OOM killer pick right now? for p in /proc/[0-9]*; do s=$(cat $p/oom_score 2>/dev/null) || continue echo "$s $(cat $p/comm 2>/dev/null) $(basename $p)" done | sort -n | tail -10 # cgroup-scoped pressure and OOM events cat /sys/fs/cgroup/myservice/memory.pressure cat /sys/fs/cgroup/myservice/memory.events
Multi-socket servers don't have one big memory pool. Each socket has its own attached DRAM (the local node); accessing another socket's memory traverses the inter-socket link (UPI on Intel, Infinity Fabric on AMD). The latency penalty is real — typically 1.5x to 2x for remote access.
Linux models this as NUMA nodes. numactl --hardware shows the topology:
$ numactl --hardwareavailable: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 node 0 size: 16345 MB node 1 cpus: 8 9 10 11 12 13 14 15 node 1 size: 16384 MB node distances: node 0 1 0: 10 21 1: 21 10
The distance matrix is in arbitrary units; 10 is local, 21 here means a remote access costs ~2.1x the local cost.
C // numa_aware.c // -lnuma#include <numa.h> #include <numaif.h> #include <stdio.h> int main(void) { if (numa_available() < 0) return 1; int nnodes = numa_max_node() + 1; printf("%d NUMA nodes\n", nnodes); /* allocate 1 GiB strictly on node 0 */ void *p = numa_alloc_onnode(1L << 30, 0); /* pin this thread to node 0 too */ numa_run_on_node(0); /* now touching p is fast (local DRAM) */ numa_free(p, 1L << 30); return 0; }
For the common case (single-process service, multiple threads), Linux's default first-touch + autonuma balancing usually does the right thing — it migrates pages toward the cores that touch them most. For high-end workloads (databases, in-memory analytics), explicit pinning with numactl or libnuma is the difference between 60% and 100% of memory bandwidth.
# topology + node memory totals numactl --hardware # RSS broken down by node, per process numastat -p $PID # every mapping with its current node distribution cat /proc/$PID/numa_maps | head # node-level allocator stats: hits, misses, foreign numastat -m # pin a workload manually numactl --cpunodebind=0 --membind=0 ./myapp # remote-access penalty in cycles perf stat -e mem_load_l3_miss_retired.remote_dram ./myapp # is autonuma migrating pages around? grep numa_ /proc/vmstat
| Path / tool | What you get |
|---|---|
| /proc/[pid]/maps | VMA list, perms, file backing |
| /proc/[pid]/smaps | per-VMA RSS / PSS / swap / hugepages / locked |
| /proc/[pid]/status | VmSize / VmRSS / VmHWM / VmData / VmStk / VmExe |
| /proc/[pid]/statm | compact: size, resident, shared, text, data, library, dirty |
| /proc/[pid]/pagemap | virt-to-PFN lookup, swap status, present bit (root only) |
| /proc/[pid]/oom_score_adj | tune OOM-kill priority |
| pmap -x [pid] | readable VMA summary |
| Path / tool | What you get |
|---|---|
| /proc/meminfo | top-level memory accounting |
| /proc/buddyinfo | free pages per order per zone |
| /proc/zoneinfo | watermarks, pages, vmstats per zone |
| /proc/slabinfo | per-cache slab usage (root) |
| /proc/vmstat | counters: pgfault, pgmajfault, pgscan, pswpin, etc. |
| free -h | quick total / used / free / buff/cache / available |
| vmstat 1 | live si/so (swap I/O), bi/bo, free, buff, cache |
| sar -B 1 | paging stats: pgpgin, pgpgout, fault, majflt, pgfree, pgscan |
| numastat -m | per-node memory breakdown |
| slabtop | live slab cache top-N |
| perf mem record | load/store address sampling, NUMA hit/miss |
| bcc tools (memleak, slabratetop) | eBPF-based dynamic tracing |
SHELL# What's allocating in the kernel right now? sudo slabtop -o | head -20 # What's the actual physical frame backing a virtual address? # (root only; uses /proc/[pid]/pagemap) sudo ./pagemap-dump $(pidof myapp) 0x7f3e9c1a0000 # Force-flush page cache for benchmarking: sync && echo 3 | sudo tee /proc/sys/vm/drop_caches # Watch a process's RSS climb: while true; do grep VmRSS /proc/$PID/status; sleep 1; done # Major-fault hot processes: ps -eo pid,comm,maj_flt,min_flt --sort=-maj_flt | head # Are we swapping right now? vmstat 1 5 # si/so columns; non-zero = active swap I/O # Is THP coalescing pages? grep AnonHugePages /proc/meminfo cat /sys/kernel/mm/transparent_hugepage/enabled
mm/page_alloc.c (buddy), mm/slub.c, mm/memory.c (fault handling), mm/mmap.c, arch/x86/mm/fault.c.malloc/malloc.c — ptmalloc2 source, well-commented for what it is.