Reference Brief / Linux Kernel Namespaces

The kernel primitives behind every container

kernel 4.6+util-linuxOCI runtime spec Eight namespace types partition kernel state, accessed through three syscalls (clone, unshare, setns). Most types require CAP_SYS_ADMIN to create; the user namespace is the exception, which is why rootless containers are built on it. A container is one process per namespace plus cgroups plus an image. This brief covers the types, the syscalls, the inode model, and the ops paths, with hover-context on every underlined term.

The eight namespace types

mnt
filesystem mounts
Independent mount table; bind mounts, tmpfs, pivot_root, propagation modes.
uts
hostname
Hostname and NIS domain. Smallest, easiest to play with.
ipc
SysV IPC + mqueues
Message queues, semaphores, shared memory segments. Legacy but live.
pid
process IDs
Independent PID space; first child becomes PID 1 with init semantics.
net
network stack
Interfaces, routes, iptables/nftables, sockets, sysctls, conntrack.
user
UIDs + capabilities
UID/GID maps and per-namespace capability sets. Foundation of rootless.
cgroup
cgroup root view
Virtualises /proc/self/cgroup so the container sees its cgroup as /.
time
monotonic + boot clocks
Per-namespace offsets for CLOCK_MONOTONIC and CLOCK_BOOTTIME. CRIU uses it.

Runtime models, in one paragraph each

runc / Docker / containerd: clone(2) with all CLONE_NEW* flags, capabilities trimmed, seccomp applied, pivot_root into a layered overlayfs rootfs, network plugged in by CNI.

Podman rootless: the user namespace is created first using your subuid range from /etc/subuid; all other namespaces are nested inside it so kernel capability checks pass against the namespaced credentials. No setuid binary on the hot path.

Kubernetes pod: the pause container holds the net and ipc namespaces; app containers join via setns(2). Mount, PID, UTS are per-container by default; PID can be shared with shareProcessNamespace: true.

From scratch (this repo): unshare --pid --uts --mount --ipc --net --cgroup --fork, mount /proc /sys /dev inside the rootfs, pivot_root. See demos/07-mini-container.sh. The rest of any runtime is image layout and policy.

Inode model

A namespace is a kernel object identified by an inode number. Two processes share a namespace iff their /proc/<pid>/ns/<type> symlinks resolve to the same inode. That is the entire ownership model.

PID 1 (host init) /proc/1/ns/ mnt -> mnt:[40...41] net -> net:[40...40] pid -> pid:[40...36] uts -> uts:[40...38] user -> user:[40...37] ipc -> ipc:[40...39] cgr -> cgr:[40...35] time -> time:[40...34] PID 9241 (container) /proc/9241/ns/ mnt -> mnt:[40...91] net -> net:[40...92] pid -> pid:[40...93] uts -> uts:[40...94] user -> user:[40...37] ipc -> ipc:[40...95] cgr -> cgr:[40...96] time -> time:[40...34] kernel namespaces user:[40...37] shared time:[40...34] shared mnt:[40...41] host mnt:[40...91] container net:[40...92] container pid:[40...93] container red = same inode (shared) - black = independent
Two processes, eight namespaces each. Sharing is per-type and decided at clone-time.

Syscalls

syscallsignaturepurpose
cloneclone(fn, stack, flags|CLONE_NEW*, ...)fork a child directly into new namespaces. What runtimes use.
unshareunshare(flags)move the calling process into new namespaces in place. What the CLI uses.
setnssetns(fd, nstype)join an existing namespace by fd to /proc/<pid>/ns/<type>. What nsenter uses.

Type matrix

typeCLONE flagunsharekernelisolates
mntCLONE_NEWNS-m2.4.19 (2002)mount table
utsCLONE_NEWUTS-u2.6.19 (2006)hostname, NIS domain
ipcCLONE_NEWIPC-i2.6.19 (2006)SysV IPC, POSIX mqueues
pidCLONE_NEWPID-p2.6.24 (2008)process IDs
netCLONE_NEWNET-n2.6.29 (2009)full network stack
userCLONE_NEWUSER-U3.8 (2013)UIDs, GIDs, capabilities
cgroupCLONE_NEWCGROUP-C4.6 (2016)cgroup root path
timeCLONE_NEWTIME-T5.6 (2020)monotonic + boot clocks

PID namespace anatomy

Inside a new PID namespace, the first forked child becomes PID 1 with init semantics: it must reap orphans, and if it dies the kernel sends SIGKILL to every other process in the namespace. This is why tini, dumb-init, and Docker --init exist.

host pid namespace PID 1 systemd PID 842 dockerd PID 901 containerd-shim PID 9241 bash (container init) PID 9243 python (worker) PID 9248 redis (sidecar) cat /proc/9241/status -> NSpid: 9241 1 container pid namespace PID 1 bash (was 9241) PID 4 python (was 9243) PID 7 redis (was 9248) if PID 1 dies, kernel SIGKILLs the rest host PIDs still visible from outside
A process has one PID per containing namespace. NSpid: in /proc/<pid>/status lists them inside-out.

Network namespace anatomy

A new net namespace starts with a single lo interface, in the DOWN state. Connectivity is built by hand: a veth pair is a virtual ethernet cable; one end stays on the host, the other is moved into the namespace by ip link set ... netns .... Bridges, NAT, and routing finish the job.

host network namespace eth0 192.0.2.10 docker0 (bridge) 10.200.1.1/24 veth-h iptables -t nat -A POSTROUTING -s 10.200.1.0/24 -j MASQUERADE echo 1 > /proc/sys/net/ipv4/ip_forward container netns veth-c 10.200.1.2/24 lo 127.0.0.1 default route via 10.200.1.1
Red line is the veth pair: one virtual cable, one end in each namespace. The bridge is just an in-kernel switch.

User namespace and the UID map

The user namespace is the only one an unprivileged user can create directly. Inside the namespace, the kernel checks capabilities against the namespaced credentials, so the process can mount, change hostname, and create other namespaces. From the host the process is still its real UID and has no extra power on host-owned objects.

host (outside) uid 1000 your shell uid 100000..165535 /etc/subuid range inside namespace uid 0 (root) caps: CAP_SYS_ADMIN, ... uid 1..65535 unprivileged users /proc/<pid>/uid_map subuid range
First line of uid_map: 0 1000 1 means inside-uid 0 maps to outside-uid 1000 for one id.

/proc layout

# every process exposes its namespace identity here $ ls -la /proc/$$/ns/ cgroup -> cgroup:[4026531835] ipc -> ipc:[4026531839] mnt -> mnt:[4026531841] net -> net:[4026531840] pid -> pid:[4026531836] pid_for_children time -> time:[4026531834] time_for_children user -> user:[4026531837] uts -> uts:[4026531838] # two processes in the same namespace iff inodes match $ readlink /proc/1/ns/pid /proc/$$/ns/pid # pid_for_children is what new children will inherit; it differs from pid # after unshare(CLONE_NEWPID) until the next fork.

Persistence

A namespace lives as long as something references it: a process inside it, an open fd to its /proc/<pid>/ns/<type>, or a bind mount of that file onto a stable path. ip netns add foo uses the bind-mount trick under /var/run/netns/foo. nsm create uses a long-lived holder process. Both work, both leak if you forget to clean them up.

methodused byholds viacleanup
bind mountip netns, CRIUmount --bind /proc/<pid>/ns/X /var/run/...umount
holder processnsm, runc, podmanlong-lived sleep infinity or container initkill the holder
open fdlibrary codeprocess keeps an fd to the ns fileclose the fd

Building a container, phase by phase

  1. Prepare a rootfs. debootstrap, docker export, or anything that drops a Linux userland into a directory. demo 07 step 0
  2. Create a cgroup and set limits. cgroups v2: write to memory.max, cpu.max, pids.max. demo 07 step 1
  3. unshare into all namespace types. --pid --uts --mount --ipc --net --cgroup --fork --map-root-user. demo 07 step 3
  4. Mount /proc /sys /dev inside the rootfs. Without /proc, ps sees the host. inside the unshared shell
  5. pivot_root into the rootfs. Real containers use this, not chroot. After umount -l /.old_root the host fs is gone. demo 07
  6. Configure network. Create a veth pair, move one end in, assign IPs, add NAT on the host. demo 03
  7. Drop capabilities, apply seccomp, exec the entrypoint. Educational demos skip this; production runtimes do not. runc, not bash

Operations and debugging

  • Find a container's PID: docker inspect -f '{{.State.Pid}}' <name>, then everything is at /proc/<pid>/ns/. Same for podman, crictl, k8s.
  • Enter without the daemon: nsenter --target <pid> --all bash works even when dockerd is wedged. The kernel doesn't care which userspace asked.
  • Network debugging: nsenter --target <pid> --net bash then tcpdump, ss, iptables-save, ip route all run in the container's network view.
  • Compare to host: read both readlink /proc/1/ns/X and readlink /proc/<pid>/ns/X; equal inodes mean the container is sharing that subsystem with the host.
  • List system-wide: lsns, optionally -t net to filter, gives you holder-process and member counts.
  • Watch for change: nsm monitor diff-walks /proc/*/ns/* once per second and prints additions and removals. Useful to catch a runtime spawning short-lived helper namespaces.

Pitfalls

  • PID 1 dies, everything dies. If the container entrypoint is a shell that execs your app, the app becomes PID 1 and signal handling changes. Use tini or --init.
  • Mount propagation surprises. A new mount namespace inherits propagation from the parent. If the parent is shared, mounts inside the namespace can leak to the host. Containers usually mount --make-rprivate / first.
  • User namespaces interact with everything. Files written from inside as "root" are owned by your subuid range on the host. chown across the namespace boundary is gated by the map.
  • cgroupv1 vs v2. The cgroup namespace is much more useful under v2, where the entire cgroup hierarchy is one tree. Under v1, hide-the-path matters less because controllers were already separate.
  • net namespace teardown is not free. Closing a netns blocks until kernel cleanup completes. On busy hosts this can stall ip netns del for seconds.
  • Time namespace is opt-in. Tools that read /proc/uptime or CLOCK_BOOTTIME see different values; tools that use CLOCK_REALTIME do not. The wall clock is shared.

References

manman 7 namespacesthe definitive reference, kept current by Michael Kerrisk
manman 7 user_namespacesUID mapping rules and capability semantics
manman 7 cgroupsthe other half of containers
manman 2 unshare / clone / setnsthe syscalls themselves
manman 8 ip-netnsnetwork namespace management via iproute2
lwnLWN namespaces seriesthe long-form deep dive, in seven parts
specOCI runtime-specwhat runc/crun/youki actually implement on top of these primitives
tourdeep-dive (index)long-form companion: sectioned walk through the eight types, with diagrams
cheatone-page cheatsheetdense reference, screenshot-friendly
reponamespaces-fun on GitHubdemo scripts, the nsm CLI, cheatsheet, deep-dive markdown