Reference Brief / Linux Kernel Namespaces
The kernel primitives behind every container
kernel 4.6+util-linuxOCI runtime spec Eight namespace types partition kernel state, accessed through three syscalls (clone, unshare, setns). Most types require CAP_SYS_ADMIN to create; the user namespace is the exception, which is why rootless containers are built on it. A container is one process per namespace plus cgroups plus an image. This brief covers the types, the syscalls, the inode model, and the ops paths, with hover-context on every underlined term.
The eight namespace types
/proc/self/cgroup so the container sees its cgroup as /.CLOCK_MONOTONIC and CLOCK_BOOTTIME. CRIU uses it.Runtime models, in one paragraph each
runc / Docker / containerd: clone(2) with all CLONE_NEW* flags, capabilities trimmed, seccomp applied, pivot_root into a layered overlayfs rootfs, network plugged in by CNI.
Podman rootless: the user namespace is created first using your subuid range from /etc/subuid; all other namespaces are nested inside it so kernel capability checks pass against the namespaced credentials. No setuid binary on the hot path.
Kubernetes pod: the pause container holds the net and ipc namespaces; app containers join via setns(2). Mount, PID, UTS are per-container by default; PID can be shared with shareProcessNamespace: true.
From scratch (this repo): unshare --pid --uts --mount --ipc --net --cgroup --fork, mount /proc /sys /dev inside the rootfs, pivot_root. See demos/07-mini-container.sh. The rest of any runtime is image layout and policy.
Inode model
A namespace is a kernel object identified by an inode number. Two processes share a namespace iff their /proc/<pid>/ns/<type> symlinks resolve to the same inode. That is the entire ownership model.
Syscalls
| syscall | signature | purpose |
|---|---|---|
clone | clone(fn, stack, flags|CLONE_NEW*, ...) | fork a child directly into new namespaces. What runtimes use. |
unshare | unshare(flags) | move the calling process into new namespaces in place. What the CLI uses. |
setns | setns(fd, nstype) | join an existing namespace by fd to /proc/<pid>/ns/<type>. What nsenter uses. |
Type matrix
| type | CLONE flag | unshare | kernel | isolates |
|---|---|---|---|---|
mnt | CLONE_NEWNS | -m | 2.4.19 (2002) | mount table |
uts | CLONE_NEWUTS | -u | 2.6.19 (2006) | hostname, NIS domain |
ipc | CLONE_NEWIPC | -i | 2.6.19 (2006) | SysV IPC, POSIX mqueues |
pid | CLONE_NEWPID | -p | 2.6.24 (2008) | process IDs |
net | CLONE_NEWNET | -n | 2.6.29 (2009) | full network stack |
user | CLONE_NEWUSER | -U | 3.8 (2013) | UIDs, GIDs, capabilities |
cgroup | CLONE_NEWCGROUP | -C | 4.6 (2016) | cgroup root path |
time | CLONE_NEWTIME | -T | 5.6 (2020) | monotonic + boot clocks |
PID namespace anatomy
Inside a new PID namespace, the first forked child becomes PID 1 with init semantics: it must reap orphans, and if it dies the kernel sends SIGKILL to every other process in the namespace. This is why tini, dumb-init, and Docker --init exist.
NSpid: in /proc/<pid>/status lists them inside-out.Network namespace anatomy
A new net namespace starts with a single lo interface, in the DOWN state. Connectivity is built by hand: a veth pair is a virtual ethernet cable; one end stays on the host, the other is moved into the namespace by ip link set ... netns .... Bridges, NAT, and routing finish the job.
User namespace and the UID map
The user namespace is the only one an unprivileged user can create directly. Inside the namespace, the kernel checks capabilities against the namespaced credentials, so the process can mount, change hostname, and create other namespaces. From the host the process is still its real UID and has no extra power on host-owned objects.
uid_map: 0 1000 1 means inside-uid 0 maps to outside-uid 1000 for one id./proc layout
Persistence
A namespace lives as long as something references it: a process inside it, an open fd to its /proc/<pid>/ns/<type>, or a bind mount of that file onto a stable path. ip netns add foo uses the bind-mount trick under /var/run/netns/foo. nsm create uses a long-lived holder process. Both work, both leak if you forget to clean them up.
| method | used by | holds via | cleanup |
|---|---|---|---|
| bind mount | ip netns, CRIU | mount --bind /proc/<pid>/ns/X /var/run/... | umount |
| holder process | nsm, runc, podman | long-lived sleep infinity or container init | kill the holder |
| open fd | library code | process keeps an fd to the ns file | close the fd |
Building a container, phase by phase
- Prepare a rootfs.
debootstrap,docker export, or anything that drops a Linux userland into a directory. - Create a cgroup and set limits. cgroups v2: write to
memory.max,cpu.max,pids.max. - unshare into all namespace types.
--pid --uts --mount --ipc --net --cgroup --fork --map-root-user. - Mount /proc /sys /dev inside the rootfs. Without
/proc,pssees the host. - pivot_root into the rootfs. Real containers use this, not
chroot. Afterumount -l /.old_rootthe host fs is gone. - Configure network. Create a veth pair, move one end in, assign IPs, add NAT on the host.
- Drop capabilities, apply seccomp, exec the entrypoint. Educational demos skip this; production runtimes do not.
Operations and debugging
- Find a container's PID:
docker inspect -f '{{.State.Pid}}' <name>, then everything is at/proc/<pid>/ns/. Same for podman, crictl, k8s. - Enter without the daemon:
nsenter --target <pid> --all bashworks even when dockerd is wedged. The kernel doesn't care which userspace asked. - Network debugging:
nsenter --target <pid> --net bashthentcpdump,ss,iptables-save,ip routeall run in the container's network view. - Compare to host: read both
readlink /proc/1/ns/Xandreadlink /proc/<pid>/ns/X; equal inodes mean the container is sharing that subsystem with the host. - List system-wide:
lsns, optionally-t netto filter, gives you holder-process and member counts. - Watch for change:
nsm monitordiff-walks/proc/*/ns/*once per second and prints additions and removals. Useful to catch a runtime spawning short-lived helper namespaces.
Pitfalls
- PID 1 dies, everything dies. If the container entrypoint is a shell that execs your app, the app becomes PID 1 and signal handling changes. Use
tinior--init. - Mount propagation surprises. A new mount namespace inherits propagation from the parent. If the parent is
shared, mounts inside the namespace can leak to the host. Containers usuallymount --make-rprivate /first. - User namespaces interact with everything. Files written from inside as "root" are owned by your subuid range on the host.
chownacross the namespace boundary is gated by the map. - cgroupv1 vs v2. The cgroup namespace is much more useful under v2, where the entire cgroup hierarchy is one tree. Under v1, hide-the-path matters less because controllers were already separate.
- net namespace teardown is not free. Closing a netns blocks until kernel cleanup completes. On busy hosts this can stall
ip netns delfor seconds. - Time namespace is opt-in. Tools that read
/proc/uptimeorCLOCK_BOOTTIMEsee different values; tools that useCLOCK_REALTIMEdo not. The wall clock is shared.