A namespace is a kernel mechanism that gives a set of processes its own private view of some global resource. Eight resources can be partitioned: hostnames, process IDs, mount points, network stacks, IPC objects, user IDs, cgroup paths, and monotonic clocks. A container is a process started inside a fresh copy of every one of them, plus cgroups for limits, plus a filesystem image. Strip those away and you are left with these primitives. They are syscalls. You can call them from bash.
This page walks the eight types from the simplest (UTS) to the most security-relevant (user), then assembles a working container from scratch using only unshare and pivot_root. The same scripts ship in demos/ in the repo.
Clone the repo and start with the demos. Each script holds the namespace open for 60 seconds so you can poke at it from another terminal. Ctrl+C tears it down early.
install $ git clone https://github.com/hed0rah/namespaces-fun $ cd namespaces-fun $ chmod +x demos/*.sh nsm/nsm # gentlest intro: hostname isolation $ sudo ./demos/01-uts-hostname.sh # the Docker way: veth pair, two namespaces, one cable $ sudo ./demos/03-net-namespace.sh # no sudo: rootless containers basis $ ./demos/05-user-namespace.sh # a real container in pure bash $ sudo ./demos/07-mini-container.sh # install the nsm CLI $ sudo cp nsm/nsm /usr/local/bin/nsm $ nsm list # all namespaces on the system $ sudo nsm diff 1 $$ # compare your shell to init
Each type isolates one specific kernel resource. They were added one at a time across two decades. All eight together is what runc, podman, and crun create on every container start.
| type | CLONE flag | unshare | kernel | isolates |
|---|---|---|---|---|
| mnt | CLONE_NEWNS | -m | 2.4.19 (2002) | mount table |
| uts | CLONE_NEWUTS | -u | 2.6.19 (2006) | hostname, NIS domain |
| ipc | CLONE_NEWIPC | -i | 2.6.19 (2006) | SysV IPC, POSIX mqueues |
| pid | CLONE_NEWPID | -p | 2.6.24 (2008) | process IDs |
| net | CLONE_NEWNET | -n | 2.6.29 (2009) | full network stack |
| user | CLONE_NEWUSER | -U | 3.8 (2013) | UIDs, GIDs, capabilities |
| cgroup | CLONE_NEWCGROUP | -C | 4.6 (2016) | cgroup root path |
| time | CLONE_NEWTIME | -T | 5.6 (2020) | monotonic + boot clocks |
CLOCK_REALTIME) is global; the time namespace only virtualises monotonic and boot clocks. Disk I/O, kernel modules, sysctls outside net.*, and the kernel itself are shared. Resource quantities (CPU, memory, PIDs) are bounded by cgroups, not namespaces.
A namespace is a kernel object identified by an inode number. Every process exposes its membership through a directory of magic symlinks under /proc/<pid>/ns/. Two processes share a namespace if and only if their symlinks resolve to the same inode. That is the entire ownership model.
/proc/<pid>/ns/ on a typical host $ ls -la /proc/$$/ns/ cgroup -> cgroup:[4026531835] ipc -> ipc:[4026531839] mnt -> mnt:[4026531841] net -> net:[4026531840] pid -> pid:[4026531836] pid_for_children time -> time:[4026531834] time_for_children user -> user:[4026531837] uts -> uts:[4026531838]
The bracketed integer is the inode. Compare two processes by reading both symlinks. pid_for_children and time_for_children are the namespaces that a future fork() will land in; they differ from the current pid and time only in the brief window between unshare(CLONE_NEWPID) and the next fork.
This container shares the user and time namespaces with the host (typical for runc without rootless mode) but has its own copy of everything else. Sharing is per-type, decided at clone() time, and not reversible without re-clone or setns().
ns file, or a bind mount of that file onto a stable path. ip netns add uses the bind-mount trick at /var/run/netns/<name>. nsm create uses a long-lived holder process.
Everything userspace does to namespaces routes through three syscalls. Container runtimes call them via libc; the CLI tools wrap them.
| syscall | signature | used by | purpose |
|---|---|---|---|
| clone | clone(fn, stack, flags|CLONE_NEW*, ...) | runc, crun, youki | fork a child directly into new namespaces. No race window. |
| unshare | unshare(flags) | unshare(1) CLI | move the calling process into new namespaces in place. |
| setns | setns(fd, nstype) | nsenter(1) | join an existing namespace by fd to /proc/<pid>/ns/<type>. |
The CLI wrappers unshare(1) and nsenter(1) ship with util-linux. The kernel does not require a daemon to be alive: if dockerd is wedged, you can still nsenter --target $(pidof your-app) --all bash to get inside.
The simplest namespace. Isolates only the hostname and NIS domain name. Use it as your introduction to the API; almost nothing can go wrong.
demo: 01-uts-hostname.sh # terminal 1 $ sudo unshare --uts bash # hostname namespace-land # hostname namespace-land # terminal 2 (host) $ hostname your-real-hostname
Under the hood: unshare(CLONE_NEWUTS) tells the kernel to copy the calling process's UTS struct. The child sees the copy; the parent keeps the original. sethostname(2) writes only to the copy. When the last process in the namespace exits, the copy is freed.
A new PID namespace gives the child its own PID number space. The first process inside is PID 1, with the kernel's init semantics: it must reap orphans, and if it dies the kernel sends SIGKILL to every other process in the namespace. This is why containers ship tini, dumb-init, or Docker's --init.
demo: 02-pid-namespace.sh $ sudo unshare --pid --fork --mount-proc bash # echo $$ 1 # ps aux USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 0.0 0.0 10300 3204 pts/0 S 14:22 0:00 bash root 8 0.0 0.0 11176 2940 pts/0 R+ 14:22 0:00 ps aux
unshare(CLONE_NEWPID) does not move the caller into the new PID namespace. The caller's children are born there. unshare --fork tells the CLI to fork before exec'ing the target program. Forget it and you get the strange behaviour of being "in" a namespace where $$ still refers to your old PID.
A process in a new PID namespace has two PIDs: its host PID (visible from outside) and its in-namespace PID (visible from inside). Both are recorded in /proc/<host-pid>/status on the line NSpid:, listed innermost first.
two-PID identity # on the host $ ps -ef | grep sleep root 12384 12381 0 14:22 pts/0 00:00:00 sleep 60 $ grep NSpid /proc/12384/status NSpid: 12384 1 # inside the namespace # echo $$ 1
A new net namespace starts with exactly one interface: lo, in the DOWN state. No routes, no neighbours, no iptables rules. You build connectivity by hand. A veth pair is a virtual ethernet cable: two devices, one end stays on the host, the other is moved into the namespace by ip link set ... netns ....
demo: 03-net-namespace.sh (abbreviated) $ sudo ip netns add demo $ sudo ip link add veth-h type veth peer name veth-c $ sudo ip link set veth-c netns demo $ sudo ip addr add 10.200.1.1/24 dev veth-h $ sudo ip link set veth-h up $ sudo ip netns exec demo ip addr add 10.200.1.2/24 dev veth-c $ sudo ip netns exec demo ip link set veth-c up $ sudo ip netns exec demo ip link set lo up $ sudo ping -c1 10.200.1.2 PING 10.200.1.2 56(84) bytes of data. 64 bytes from 10.200.1.2: icmp_seq=1 ttl=64 time=0.045 ms
sysctl net.ipv4.ip_forward=1), add a NAT rule (iptables -t nat -A POSTROUTING -s 10.200.1.0/24 -j MASQUERADE), set a default route inside the namespace (ip route add default via 10.200.1.1). That is precisely Docker's bridge networking, plus DNS forwarding.
A mount namespace gives the child a copy of the parent's mount table. New mounts and unmounts are local to that copy. Combined with pivot_root, this is how a container gets its own root filesystem.
demo: 04-mount-namespace.sh $ sudo unshare --mount bash # mount -t tmpfs tmpfs /tmp/secret # echo "in-ns" > /tmp/secret/file # ls /tmp/secret file # in another terminal, on the host $ ls /tmp/secret # empty - the tmpfs only exists inside the namespace
Each mount has a propagation type that controls whether changes in this namespace flow to peer mounts in other namespaces. The four modes:
Look at the current state with findmnt -o TARGET,PROPAGATION. Containers usually run mount --make-rprivate / first, so subsequent mounts inside the namespace cannot leak out. Skipping this is a classic source of "I unmounted it inside the container and it disappeared from the host" bugs.
Real containers swap their root filesystem with pivot_root(2), then unmount the old root. This denies the container any path back to the host filesystem. chroot can be escaped from with CAP_SYS_CHROOT; pivot_root followed by umount -l /old cannot.
The user namespace maps UIDs and GIDs from outside to inside, and gives the calling process a full capability set against the namespaced credentials. It is the only namespace that can be created without root, and the foundation of rootless containers (podman, buildah).
demo: 05-user-namespace.sh - no sudo $ id uid=1000(user) gid=1000(user) $ unshare --user --map-root-user bash # id uid=0(root) gid=0(root) # cat /proc/self/uid_map 0 1000 1 # cat /etc/shadow cat: /etc/shadow: Permission denied
The map line 0 1000 1 reads: inside-uid 0 maps to outside-uid 1000, for a range of one. Inside the namespace you are "root" with the full capability set against namespaced objects. Files on the host that you do not own are inaccessible because their owner UID falls outside the map (it shows up as the overflow uid nobody).
kernel.apparmor_restrict_unprivileged_userns=1) that confines unprivileged unshare(1) to a profile lacking CAP_SYS_ADMIN, breaking the demo with "write failed /proc/self/uid_map". Toggle the sysctl to 0 for a one-off, or ship an AppArmor profile (what podman does). The kernel-level toggle is kernel.unprivileged_userns_clone on Debian-derived kernels.
Map ranges larger than 1 require newuidmap and newgidmap setuid helpers, configured via /etc/subuid and /etc/subgid. This is the part that lets a podman container see uid 0..65535 inside while only owning uid 100000..165535 on the host.
Isolates System V IPC objects (message queues, semaphores, shared memory segments) and POSIX message queues. Two namespaces cannot see each other's queues by ID. Without this, containers could communicate through legacy IPC mechanisms behind their operators' backs.
demo: 06-ipc-namespace.sh # on the host $ sudo ipcmk -Q Message queue id: 3 $ sudo unshare --ipc bash # ipcs -q # the host's queue id 3 is not visible # ipcmk -Q Message queue id: 0 # numbering starts fresh
Kubernetes pods share the IPC namespace across containers in the pod by default. PID can be shared explicitly with shareProcessNamespace: true. That is by design: pod members are meant to be tightly coupled.
The cgroup namespace virtualises /proc/self/cgroup so the container sees its cgroup as the root /. It does not change enforcement: limits are still set from the outside. It changes what the container sees about its own placement.
demo: 09-cgroup-namespace.sh # host view, before entering the cgroup namespace $ cat /proc/self/cgroup 0::/user.slice/user-1000.slice/session-7.scope $ sudo unshare --cgroup bash # cat /proc/self/cgroup 0::/ # the container thinks it is at the root of the cgroup tree
The benefit shows up under cgroups v2 where the entire hierarchy is one tree. Under v1 the controllers were already separate, so hiding the path mattered less.
cgroups and namespaces are independent kernel features. You set limits by writing to control files under /sys/fs/cgroup/<your-cgroup>/:
$ sudo mkdir /sys/fs/cgroup/mybox $ echo 67108864 | sudo tee /sys/fs/cgroup/mybox/memory.max # 64 MB $ echo "50000 100000" | sudo tee /sys/fs/cgroup/mybox/cpu.max # 50% of one core $ echo 20 | sudo tee /sys/fs/cgroup/mybox/pids.max # max 20 procs $ echo $$ | sudo tee /sys/fs/cgroup/mybox/cgroup.procs # put us in
The newest type. Per-namespace offsets for CLOCK_MONOTONIC and CLOCK_BOOTTIME. CRIU uses it to keep monotonic time consistent across checkpoint and restore. Wall-clock time (CLOCK_REALTIME) is global and not virtualised.
offsets via /proc/<pid>/timens_offsets $ sudo unshare --time --fork bash # echo "boottime 86400 0" > /proc/$$/timens_offsets # uptime 14:22:03 up 1 day, 14 min, 0 users, load average: 0.00, 0.00, 0.00
The host's actual uptime is unaffected. Tools reading CLOCK_MONOTONIC or /proc/uptime see the offset; tools reading date or wall-clock time do not.
Combine all of the above and you have a container. The script demos/07-mini-container.sh ships in the repo and does this in seven steps. The whole runtime is unshare, mount, and pivot_root.
debootstrap, docker export, or copy a static busybox with applet symlinks. The demo will install busybox automatically if your rootfs is empty.memory.max, cpu.max, pids.max. Place the container's holder process into cgroup.procs before exec'ing unshare so descendants inherit membership.unshare --pid --uts --mount --ipc --net --cgroup --fork. With --fork, the new shell is PID 1 in the new namespace.mount --make-rprivate /. Otherwise mounts inside leak back to the host.pivot_root requires new_root to be a mount point distinct from the current /. A self-bind is the cheapest way to satisfy this.pivot_root . .old_root, umount -l /.old_root, then exec /bin/sh. Production runtimes also drop capabilities and apply seccomp here. The demos do not.sudo ./demos/07-mini-container.sh drops you into a fresh shell where ps shows two processes, hostname says tiny-container, and mount | head shows only what you mounted. exit tears the whole thing down. That is a container.
Every container is a process. Find its PID and you have everything.
$ docker inspect -f '{{.State.Pid}}' mycontainer # docker $ podman inspect -f '{{.State.Pid}}' mycontainer # podman $ crictl inspect <id> | jq '.info.pid' # cri-o / containerd via crictl $ PID=$(docker inspect -f '{{.State.Pid}}' mycontainer) $ ls -la /proc/$PID/ns/ # the eight namespaces
$ sudo nsenter --target $PID --all bash # into all namespaces $ sudo nsenter --target $PID --net bash # just the network $ sudo nsenter --target $PID --mount --uts bash # a subset
The kernel does not require dockerd or kubelet to be alive. Any tool in any namespace works: tcpdump inside --net, strace inside --pid, ls /proc/<in-ns-pid> inside --pid --mount.
# equal inodes = same namespace $ readlink /proc/1/ns/net net:[4026531840] $ readlink /proc/$PID/ns/net net:[4026532092] # different - this container has its own net stack # or use nsm in this repo $ nsm diff 1 $PID
$ sudo lsns # all namespaces, all types $ sudo lsns -t net # just network namespaces $ sudo lsns -t pid -o NS,PID,COMMAND # custom columns $ nsm list # grouped, with nsm-managed flagged $ nsm monitor # watch for new/destroyed namespaces
ip netns del can stall on busy hosts. Time namespace covers monotonic clocks only; the wall clock is global.
| source | what |
|---|---|
| man 7 namespaces | the definitive reference, kept current by Michael Kerrisk |
| man 7 user_namespaces | UID mapping rules and capability semantics |
| man 7 cgroups | the other half of containers |
| man 2 unshare | the syscall behind unshare(1) and nsm create |
| man 8 ip-netns | network namespace management via iproute2 |
| LWN namespaces series | the long-form deep dive in seven parts |
| OCI runtime-spec | what runc, crun, and youki actually implement on top |
| whitepaper.html | dense reference companion, with margin-card hover context |
| cheatsheet.html | one-page screenshot-friendly cheatsheet of everything in this repo |
| github.com/hed0rah/namespaces-fun | demo scripts, the nsm CLI, cheatsheet, deep-dive markdown |