Reference Brief / Linux Kernel Namespaces

The kernel primitives behind every container

kernel 4.6+util-linuxOCI runtime spec Eight namespace types partition kernel state, accessed through three syscalls (clone, unshare, setns). Most types require CAP_SYS_ADMIN to create; the user namespace is the exception, which is why rootless containers are built on it. A container is one process per namespace plus cgroups plus an image. This brief covers the types, the syscalls, the inode model, and the ops paths, with hover-context on every underlined term.

The eight namespace types

mnt

filesystem mounts

Independent mount table; bind mounts, tmpfs, pivot_root, propagation modes.

uts

hostname

Hostname and NIS domain. Smallest, easiest to play with.

ipc

SysV IPC + mqueues

Message queues, semaphores, shared memory segments. Legacy but live.

pid

process IDs

Independent PID space; first child becomes PID 1 with init semantics.

net

network stack

Interfaces, routes, iptables/nftables, sockets, sysctls, conntrack.

user

UIDs + capabilities

UID/GID maps and per-namespace capability sets. Foundation of rootless.

cgroup

cgroup root view

Virtualises /proc/self/cgroup so the container sees its cgroup as /.

time

monotonic + boot clocks

Per-namespace offsets for CLOCK_MONOTONIC and CLOCK_BOOTTIME. CRIU uses it.

Runtime models, in one paragraph each

runc / Docker / containerd: clone(2) with all CLONE_NEW* flags, capabilities trimmed, seccomp applied, pivot_root into a layered overlayfs rootfs, network plugged in by CNI.

Podman rootless: the user namespace is created first using your subuid range from /etc/subuid; all other namespaces are nested inside it so kernel capability checks pass against the namespaced credentials. No setuid binary on the hot path.

Kubernetes pod: the pause container holds the net and ipc namespaces; app containers join via setns(2). Mount, PID, UTS are per-container by default; PID can be shared with shareProcessNamespace: true.

From scratch (this repo): unshare --pid --uts --mount --ipc --net --cgroup --fork, mount /proc /sys /dev inside the rootfs, pivot_root. See demos/07-mini-container.sh. The rest of any runtime is image layout and policy.

Inode model

A namespace is a kernel object identified by an inode number. Two processes share a namespace iff their /proc/<pid>/ns/<type> symlinks resolve to the same inode. That is the entire ownership model.

Two processes, eight namespaces each. Sharing is per-type and decided at clone-time.

Syscalls

syscall	signature	purpose
`clone`	`clone(fn, stack, flags\|CLONE_NEW*, ...)`	fork a child directly into new namespaces. What runtimes use.
`unshare`	`unshare(flags)`	move the calling process into new namespaces in place. What the CLI uses.
`setns`	`setns(fd, nstype)`	join an existing namespace by fd to `/proc/<pid>/ns/<type>`. What `nsenter` uses.

Type matrix

type	CLONE flag	unshare	kernel	isolates
`mnt`	`CLONE_NEWNS`	`-m`	2.4.19 (2002)	mount table
`uts`	`CLONE_NEWUTS`	`-u`	2.6.19 (2006)	hostname, NIS domain
`ipc`	`CLONE_NEWIPC`	`-i`	2.6.19 (2006)	SysV IPC, POSIX mqueues
`pid`	`CLONE_NEWPID`	`-p`	2.6.24 (2008)	process IDs
`net`	`CLONE_NEWNET`	`-n`	2.6.29 (2009)	full network stack
`user`	`CLONE_NEWUSER`	`-U`	3.8 (2013)	UIDs, GIDs, capabilities
`cgroup`	`CLONE_NEWCGROUP`	`-C`	4.6 (2016)	cgroup root path
`time`	`CLONE_NEWTIME`	`-T`	5.6 (2020)	monotonic + boot clocks

PID namespace anatomy

Inside a new PID namespace, the first forked child becomes PID 1 with init semantics: it must reap orphans, and if it dies the kernel sends SIGKILL to every other process in the namespace. This is why tini, dumb-init, and Docker --init exist.

A process has one PID per containing namespace. NSpid: in /proc/<pid>/status lists them inside-out.

Network namespace anatomy

A new net namespace starts with a single lo interface, in the DOWN state. Connectivity is built by hand: a veth pair is a virtual ethernet cable; one end stays on the host, the other is moved into the namespace by ip link set ... netns .... Bridges, NAT, and routing finish the job.

Red line is the veth pair: one virtual cable, one end in each namespace. The bridge is just an in-kernel switch.

User namespace and the UID map

The user namespace is the only one an unprivileged user can create directly. Inside the namespace, the kernel checks capabilities against the namespaced credentials, so the process can mount, change hostname, and create other namespaces. From the host the process is still its real UID and has no extra power on host-owned objects.

First line of uid_map: 0 1000 1 means inside-uid 0 maps to outside-uid 1000 for one id.

/proc layout

# every process exposes its namespace identity here $ ls -la /proc/$$/ns/ cgroup -> cgroup:[4026531835] ipc -> ipc:[4026531839] mnt -> mnt:[4026531841] net -> net:[4026531840] pid -> pid:[4026531836] pid_for_children time -> time:[4026531834] time_for_children user -> user:[4026531837] uts -> uts:[4026531838] # two processes in the same namespace iff inodes match $ readlink /proc/1/ns/pid /proc/$$/ns/pid # pid_for_children is what new children will inherit; it differs from pid # after unshare(CLONE_NEWPID) until the next fork.

Persistence

A namespace lives as long as something references it: a process inside it, an open fd to its /proc/<pid>/ns/<type>, or a bind mount of that file onto a stable path. ip netns add foo uses the bind-mount trick under /var/run/netns/foo. nsm create uses a long-lived holder process. Both work, both leak if you forget to clean them up.

method	used by	holds via	cleanup
bind mount	`ip netns`, CRIU	`mount --bind /proc/<pid>/ns/X /var/run/...`	`umount`
holder process	nsm, runc, podman	long-lived `sleep infinity` or container init	kill the holder
open fd	library code	process keeps an fd to the ns file	close the fd

Building a container, phase by phase

Prepare a rootfs. debootstrap, docker export, or anything that drops a Linux userland into a directory. demo 07 step 0
Create a cgroup and set limits. cgroups v2: write to memory.max, cpu.max, pids.max. demo 07 step 1
unshare into all namespace types. --pid --uts --mount --ipc --net --cgroup --fork --map-root-user. demo 07 step 3
Mount /proc /sys /dev inside the rootfs. Without /proc, ps sees the host. inside the unshared shell
pivot_root into the rootfs. Real containers use this, not chroot. After umount -l /.old_root the host fs is gone. demo 07
Configure network. Create a veth pair, move one end in, assign IPs, add NAT on the host. demo 03
Drop capabilities, apply seccomp, exec the entrypoint. Educational demos skip this; production runtimes do not. runc, not bash

Operations and debugging

Find a container's PID: docker inspect -f '{{.State.Pid}}' <name>, then everything is at /proc/<pid>/ns/. Same for podman, crictl, k8s.
Enter without the daemon: nsenter --target <pid> --all bash works even when dockerd is wedged. The kernel doesn't care which userspace asked.
Network debugging: nsenter --target <pid> --net bash then tcpdump, ss, iptables-save, ip route all run in the container's network view.
Compare to host: read both readlink /proc/1/ns/X and readlink /proc/<pid>/ns/X; equal inodes mean the container is sharing that subsystem with the host.
List system-wide: lsns, optionally -t net to filter, gives you holder-process and member counts.
Watch for change: nsm monitor diff-walks /proc/*/ns/* once per second and prints additions and removals. Useful to catch a runtime spawning short-lived helper namespaces.

Pitfalls

PID 1 dies, everything dies. If the container entrypoint is a shell that execs your app, the app becomes PID 1 and signal handling changes. Use tini or --init.
Mount propagation surprises. A new mount namespace inherits propagation from the parent. If the parent is shared, mounts inside the namespace can leak to the host. Containers usually mount --make-rprivate / first.
User namespaces interact with everything. Files written from inside as "root" are owned by your subuid range on the host. chown across the namespace boundary is gated by the map.
cgroupv1 vs v2. The cgroup namespace is much more useful under v2, where the entire cgroup hierarchy is one tree. Under v1, hide-the-path matters less because controllers were already separate.
net namespace teardown is not free. Closing a netns blocks until kernel cleanup completes. On busy hosts this can stall ip netns del for seconds.
Time namespace is opt-in. Tools that read /proc/uptime or CLOCK_BOOTTIME see different values; tools that use CLOCK_REALTIME do not. The wall clock is shared.

References

manman 7 namespacesthe definitive reference, kept current by Michael Kerrisk

manman 7 user_namespacesUID mapping rules and capability semantics

manman 7 cgroupsthe other half of containers

manman 2 unshare / clone / setnsthe syscalls themselves

manman 8 ip-netnsnetwork namespace management via iproute2

lwnLWN namespaces seriesthe long-form deep dive, in seven parts

specOCI runtime-specwhat runc/crun/youki actually implement on top of these primitives

tourdeep-dive (index)long-form companion: sectioned walk through the eight types, with diagrams

cheatone-page cheatsheetdense reference, screenshot-friendly

reponamespaces-fun on GitHubdemo scripts, the nsm CLI, cheatsheet, deep-dive markdown