Linux Namespaces

Kernel Primitives Behind Every Container
demos & nsm CLI on github.com/hed0rah/namespaces-fun // one-page cheatsheet // whitepaper
LINUX NAMESPACES // KERNEL VIEW // PROC NS

Beyond the Container

A namespace is a kernel mechanism that gives a set of processes its own private view of some global resource. Eight resources can be partitioned: hostnames, process IDs, mount points, network stacks, IPC objects, user IDs, cgroup paths, and monotonic clocks. A container is a process started inside a fresh copy of every one of them, plus cgroups for limits, plus a filesystem image. Strip those away and you are left with these primitives. They are syscalls. You can call them from bash.

This page walks the eight types from the simplest (UTS) to the most security-relevant (user), then assembles a working container from scratch using only unshare and pivot_root. The same scripts ship in demos/ in the repo.

How to read this Each section is independent. If you have ten minutes, read Types, /proc, and Container. If you are debugging a runtime issue, jump to Ops. If you want the dense reference instead of the tour, the whitepaper covers the same ground in three columns with hover-context cards.

Quick Start

Clone the repo and start with the demos. Each script holds the namespace open for 60 seconds so you can poke at it from another terminal. Ctrl+C tears it down early.

install
$ git clone https://github.com/hed0rah/namespaces-fun
$ cd namespaces-fun
$ chmod +x demos/*.sh nsm/nsm

# gentlest intro: hostname isolation
$ sudo ./demos/01-uts-hostname.sh

# the Docker way: veth pair, two namespaces, one cable
$ sudo ./demos/03-net-namespace.sh

# no sudo: rootless containers basis
$ ./demos/05-user-namespace.sh

# a real container in pure bash
$ sudo ./demos/07-mini-container.sh

# install the nsm CLI
$ sudo cp nsm/nsm /usr/local/bin/nsm
$ nsm list                   # all namespaces on the system
$ sudo nsm diff 1 $$         # compare your shell to init

The Eight Types

Each type isolates one specific kernel resource. They were added one at a time across two decades. All eight together is what runc, podman, and crun create on every container start.

typeCLONE flagunsharekernelisolates
mntCLONE_NEWNS-m2.4.19 (2002)mount table
utsCLONE_NEWUTS-u2.6.19 (2006)hostname, NIS domain
ipcCLONE_NEWIPC-i2.6.19 (2006)SysV IPC, POSIX mqueues
pidCLONE_NEWPID-p2.6.24 (2008)process IDs
netCLONE_NEWNET-n2.6.29 (2009)full network stack
userCLONE_NEWUSER-U3.8 (2013)UIDs, GIDs, capabilities
cgroupCLONE_NEWCGROUP-C4.6 (2016)cgroup root path
timeCLONE_NEWTIME-T5.6 (2020)monotonic + boot clocks
What is not isolated The wall clock (CLOCK_REALTIME) is global; the time namespace only virtualises monotonic and boot clocks. Disk I/O, kernel modules, sysctls outside net.*, and the kernel itself are shared. Resource quantities (CPU, memory, PIDs) are bounded by cgroups, not namespaces.

The Inode Model

A namespace is a kernel object identified by an inode number. Every process exposes its membership through a directory of magic symlinks under /proc/<pid>/ns/. Two processes share a namespace if and only if their symlinks resolve to the same inode. That is the entire ownership model.

/proc/<pid>/ns/ on a typical host
$ ls -la /proc/$$/ns/
cgroup           -> cgroup:[4026531835]
ipc              -> ipc:[4026531839]
mnt              -> mnt:[4026531841]
net              -> net:[4026531840]
pid              -> pid:[4026531836]
pid_for_children
time             -> time:[4026531834]
time_for_children
user             -> user:[4026531837]
uts              -> uts:[4026531838]

The bracketed integer is the inode. Compare two processes by reading both symlinks. pid_for_children and time_for_children are the namespaces that a future fork() will land in; they differ from the current pid and time only in the brief window between unshare(CLONE_NEWPID) and the next fork.

# two processes: PID 1 (init) vs PID 9241 (a container) PID 1 (init) PID 9241 (container) |-- mnt:[4026531841] host |-- mnt:[4026532091] isolated |-- net:[4026531840] host |-- net:[4026532092] isolated |-- pid:[4026531836] host |-- pid:[4026532093] isolated |-- uts:[4026531838] host |-- uts:[4026532094] isolated |-- ipc:[4026531839] host |-- ipc:[4026532095] isolated |-- cgr:[4026531835] host |-- cgr:[4026532096] isolated |-- user:[4026531837] SHARED |-- user:[4026531837] SHARED `-- time:[4026531834] SHARED `-- time:[4026531834] SHARED

This container shares the user and time namespaces with the host (typical for runc without rootless mode) but has its own copy of everything else. Sharing is per-type, decided at clone() time, and not reversible without re-clone or setns().

Persistence A namespace lives as long as something references it: a process inside it, an open file descriptor to its ns file, or a bind mount of that file onto a stable path. ip netns add uses the bind-mount trick at /var/run/netns/<name>. nsm create uses a long-lived holder process.

Three Syscalls

Everything userspace does to namespaces routes through three syscalls. Container runtimes call them via libc; the CLI tools wrap them.

syscallsignatureused bypurpose
cloneclone(fn, stack, flags|CLONE_NEW*, ...)runc, crun, youkifork a child directly into new namespaces. No race window.
unshareunshare(flags)unshare(1) CLImove the calling process into new namespaces in place.
setnssetns(fd, nstype)nsenter(1)join an existing namespace by fd to /proc/<pid>/ns/<type>.

The CLI wrappers unshare(1) and nsenter(1) ship with util-linux. The kernel does not require a daemon to be alive: if dockerd is wedged, you can still nsenter --target $(pidof your-app) --all bash to get inside.

UTS // Hostname

The simplest namespace. Isolates only the hostname and NIS domain name. Use it as your introduction to the API; almost nothing can go wrong.

demo: 01-uts-hostname.sh
# terminal 1
$ sudo unshare --uts bash
# hostname namespace-land
# hostname
namespace-land

# terminal 2 (host)
$ hostname
your-real-hostname

Under the hood: unshare(CLONE_NEWUTS) tells the kernel to copy the calling process's UTS struct. The child sees the copy; the parent keeps the original. sethostname(2) writes only to the copy. When the last process in the namespace exits, the copy is freed.

PID // Where did my processes go

A new PID namespace gives the child its own PID number space. The first process inside is PID 1, with the kernel's init semantics: it must reap orphans, and if it dies the kernel sends SIGKILL to every other process in the namespace. This is why containers ship tini, dumb-init, or Docker's --init.

demo: 02-pid-namespace.sh
$ sudo unshare --pid --fork --mount-proc bash
# echo $$
1
# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0  10300  3204 pts/0    S    14:22   0:00 bash
root         8  0.0  0.0  11176  2940 pts/0    R+   14:22   0:00 ps aux
The --fork is required unshare(CLONE_NEWPID) does not move the caller into the new PID namespace. The caller's children are born there. unshare --fork tells the CLI to fork before exec'ing the target program. Forget it and you get the strange behaviour of being "in" a namespace where $$ still refers to your old PID.

A process in a new PID namespace has two PIDs: its host PID (visible from outside) and its in-namespace PID (visible from inside). Both are recorded in /proc/<host-pid>/status on the line NSpid:, listed innermost first.

two-PID identity
# on the host
$ ps -ef | grep sleep
root     12384 12381  0 14:22 pts/0    00:00:00 sleep 60

$ grep NSpid /proc/12384/status
NSpid:    12384    1

# inside the namespace
# echo $$
1

Network // Build a stack from scratch

A new net namespace starts with exactly one interface: lo, in the DOWN state. No routes, no neighbours, no iptables rules. You build connectivity by hand. A veth pair is a virtual ethernet cable: two devices, one end stays on the host, the other is moved into the namespace by ip link set ... netns ....

host network namespace container netns +--------------------------+ +---------------------+ | | | | | eth0 | | +-- veth-c | | | | | 10.200.1.2 | | docker0 10.200.1.1 | | | | | | | | | lo (UP) | | veth-h ----+ | | | | | | | | | | +-------------|------------+ +---|-----------------+ | | +--------- veth pair --------------+ # default route inside the namespace points at 10.200.1.1 # host runs iptables NAT to forward to the world
demo: 03-net-namespace.sh (abbreviated)
$ sudo ip netns add demo
$ sudo ip link add veth-h type veth peer name veth-c
$ sudo ip link set veth-c netns demo
$ sudo ip addr add 10.200.1.1/24 dev veth-h
$ sudo ip link set veth-h up
$ sudo ip netns exec demo ip addr add 10.200.1.2/24 dev veth-c
$ sudo ip netns exec demo ip link set veth-c up
$ sudo ip netns exec demo ip link set lo up
$ sudo ping -c1 10.200.1.2
PING 10.200.1.2 56(84) bytes of data.
64 bytes from 10.200.1.2: icmp_seq=1 ttl=64 time=0.045 ms
Reach the internet Three ingredients: enable IP forwarding on the host (sysctl net.ipv4.ip_forward=1), add a NAT rule (iptables -t nat -A POSTROUTING -s 10.200.1.0/24 -j MASQUERADE), set a default route inside the namespace (ip route add default via 10.200.1.1). That is precisely Docker's bridge networking, plus DNS forwarding.

Mount // Your filesystem is a lie

A mount namespace gives the child a copy of the parent's mount table. New mounts and unmounts are local to that copy. Combined with pivot_root, this is how a container gets its own root filesystem.

demo: 04-mount-namespace.sh
$ sudo unshare --mount bash
# mount -t tmpfs tmpfs /tmp/secret
# echo "in-ns" > /tmp/secret/file
# ls /tmp/secret
file

# in another terminal, on the host
$ ls /tmp/secret
# empty - the tmpfs only exists inside the namespace

Mount propagation

Each mount has a propagation type that controls whether changes in this namespace flow to peer mounts in other namespaces. The four modes:

Look at the current state with findmnt -o TARGET,PROPAGATION. Containers usually run mount --make-rprivate / first, so subsequent mounts inside the namespace cannot leak out. Skipping this is a classic source of "I unmounted it inside the container and it disappeared from the host" bugs.

pivot_root, not chroot

Real containers swap their root filesystem with pivot_root(2), then unmount the old root. This denies the container any path back to the host filesystem. chroot can be escaped from with CAP_SYS_CHROOT; pivot_root followed by umount -l /old cannot.

User // Fake root, real power

The user namespace maps UIDs and GIDs from outside to inside, and gives the calling process a full capability set against the namespaced credentials. It is the only namespace that can be created without root, and the foundation of rootless containers (podman, buildah).

demo: 05-user-namespace.sh - no sudo
$ id
uid=1000(user) gid=1000(user)

$ unshare --user --map-root-user bash
# id
uid=0(root) gid=0(root)

# cat /proc/self/uid_map
         0       1000          1

# cat /etc/shadow
cat: /etc/shadow: Permission denied

The map line 0 1000 1 reads: inside-uid 0 maps to outside-uid 1000, for a range of one. Inside the namespace you are "root" with the full capability set against namespaced objects. Files on the host that you do not own are inaccessible because their owner UID falls outside the map (it shows up as the overflow uid nobody).

Distros may block this Ubuntu 24.04+ ships an AppArmor restriction (kernel.apparmor_restrict_unprivileged_userns=1) that confines unprivileged unshare(1) to a profile lacking CAP_SYS_ADMIN, breaking the demo with "write failed /proc/self/uid_map". Toggle the sysctl to 0 for a one-off, or ship an AppArmor profile (what podman does). The kernel-level toggle is kernel.unprivileged_userns_clone on Debian-derived kernels.

Map ranges larger than 1 require newuidmap and newgidmap setuid helpers, configured via /etc/subuid and /etc/subgid. This is the part that lets a podman container see uid 0..65535 inside while only owning uid 100000..165535 on the host.

IPC // Can you hear me now

Isolates System V IPC objects (message queues, semaphores, shared memory segments) and POSIX message queues. Two namespaces cannot see each other's queues by ID. Without this, containers could communicate through legacy IPC mechanisms behind their operators' backs.

demo: 06-ipc-namespace.sh
# on the host
$ sudo ipcmk -Q
Message queue id: 3

$ sudo unshare --ipc bash
# ipcs -q
# the host's queue id 3 is not visible

# ipcmk -Q
Message queue id: 0  # numbering starts fresh

Kubernetes pods share the IPC namespace across containers in the pod by default. PID can be shared explicitly with shareProcessNamespace: true. That is by design: pod members are meant to be tightly coupled.

Cgroup // Hide the path

The cgroup namespace virtualises /proc/self/cgroup so the container sees its cgroup as the root /. It does not change enforcement: limits are still set from the outside. It changes what the container sees about its own placement.

demo: 09-cgroup-namespace.sh
# host view, before entering the cgroup namespace
$ cat /proc/self/cgroup
0::/user.slice/user-1000.slice/session-7.scope

$ sudo unshare --cgroup bash
# cat /proc/self/cgroup
0::/  # the container thinks it is at the root of the cgroup tree

The benefit shows up under cgroups v2 where the entire hierarchy is one tree. Under v1 the controllers were already separate, so hiding the path mattered less.

Setting limits

cgroups and namespaces are independent kernel features. You set limits by writing to control files under /sys/fs/cgroup/<your-cgroup>/:

$ sudo mkdir /sys/fs/cgroup/mybox
$ echo 67108864     | sudo tee /sys/fs/cgroup/mybox/memory.max  # 64 MB
$ echo "50000 100000" | sudo tee /sys/fs/cgroup/mybox/cpu.max     # 50% of one core
$ echo 20           | sudo tee /sys/fs/cgroup/mybox/pids.max    # max 20 procs
$ echo $$           | sudo tee /sys/fs/cgroup/mybox/cgroup.procs # put us in

Time // Per-namespace clocks

The newest type. Per-namespace offsets for CLOCK_MONOTONIC and CLOCK_BOOTTIME. CRIU uses it to keep monotonic time consistent across checkpoint and restore. Wall-clock time (CLOCK_REALTIME) is global and not virtualised.

offsets via /proc/<pid>/timens_offsets
$ sudo unshare --time --fork bash
# echo "boottime 86400 0" > /proc/$$/timens_offsets
# uptime
  14:22:03 up 1 day, 14 min,  0 users,  load average: 0.00, 0.00, 0.00

The host's actual uptime is unaffected. Tools reading CLOCK_MONOTONIC or /proc/uptime see the offset; tools reading date or wall-clock time do not.

Container // From scratch in bash

Combine all of the above and you have a container. The script demos/07-mini-container.sh ships in the repo and does this in seven steps. The whole runtime is unshare, mount, and pivot_root.

  1. Prepare a rootfs. debootstrap, docker export, or copy a static busybox with applet symlinks. The demo will install busybox automatically if your rootfs is empty.
  2. Create a cgroup, set limits. Under cgroups v2: write to memory.max, cpu.max, pids.max. Place the container's holder process into cgroup.procs before exec'ing unshare so descendants inherit membership.
  3. unshare into all namespace types. unshare --pid --uts --mount --ipc --net --cgroup --fork. With --fork, the new shell is PID 1 in the new namespace.
  4. Make the mount namespace private. mount --make-rprivate /. Otherwise mounts inside leak back to the host.
  5. Bind-mount the rootfs onto itself. pivot_root requires new_root to be a mount point distinct from the current /. A self-bind is the cheapest way to satisfy this.
  6. Mount /proc, /sys, /dev inside the rootfs. Without these the container's tools cannot see processes, devices, or kernel objects.
  7. pivot_root and exec the entrypoint. pivot_root . .old_root, umount -l /.old_root, then exec /bin/sh. Production runtimes also drop capabilities and apply seccomp here. The demos do not.
Run it sudo ./demos/07-mini-container.sh drops you into a fresh shell where ps shows two processes, hostname says tiny-container, and mount | head shows only what you mounted. exit tears the whole thing down. That is a container.

Ops // Debugging live containers

Every container is a process. Find its PID and you have everything.

Find a container's PID

$ docker inspect -f '{{.State.Pid}}' mycontainer   # docker
$ podman inspect -f '{{.State.Pid}}' mycontainer   # podman
$ crictl inspect <id> | jq '.info.pid'          # cri-o / containerd via crictl

$ PID=$(docker inspect -f '{{.State.Pid}}' mycontainer)
$ ls -la /proc/$PID/ns/                              # the eight namespaces

Enter, even when the daemon is wedged

$ sudo nsenter --target $PID --all bash               # into all namespaces
$ sudo nsenter --target $PID --net bash               # just the network
$ sudo nsenter --target $PID --mount --uts bash       # a subset

The kernel does not require dockerd or kubelet to be alive. Any tool in any namespace works: tcpdump inside --net, strace inside --pid, ls /proc/<in-ns-pid> inside --pid --mount.

Compare namespaces

# equal inodes = same namespace
$ readlink /proc/1/ns/net
net:[4026531840]
$ readlink /proc/$PID/ns/net
net:[4026532092]   # different - this container has its own net stack

# or use nsm in this repo
$ nsm diff 1 $PID

List system-wide

$ sudo lsns                       # all namespaces, all types
$ sudo lsns -t net                # just network namespaces
$ sudo lsns -t pid -o NS,PID,COMMAND   # custom columns
$ nsm list                        # grouped, with nsm-managed flagged
$ nsm monitor                     # watch for new/destroyed namespaces
Pitfalls PID 1 dying kills everything in the namespace. Mount propagation defaults can leak. Files written "as root" inside a user namespace are owned by your subuid range on the host. Net namespace teardown blocks until kernel cleanup completes; ip netns del can stall on busy hosts. Time namespace covers monotonic clocks only; the wall clock is global.

Further Reading

sourcewhat
man 7 namespacesthe definitive reference, kept current by Michael Kerrisk
man 7 user_namespacesUID mapping rules and capability semantics
man 7 cgroupsthe other half of containers
man 2 unsharethe syscall behind unshare(1) and nsm create
man 8 ip-netnsnetwork namespace management via iproute2
LWN namespaces seriesthe long-form deep dive in seven parts
OCI runtime-specwhat runc, crun, and youki actually implement on top
whitepaper.htmldense reference companion, with margin-card hover context
cheatsheet.htmlone-page screenshot-friendly cheatsheet of everything in this repo
github.com/hed0rah/namespaces-fundemo scripts, the nsm CLI, cheatsheet, deep-dive markdown