Namespaces partition kernel resources so one set of processes sees one set of resources, another sees a different set. They are the isolation layer behind every container runtime.
Three syscalls power the entire namespace subsystem.
| Syscall | Signature | Purpose |
|---|---|---|
| clone() | clone(fn, stack, flags, arg) | Create child process in new namespace(s). Flags are OR'd CLONE_NEW* constants. |
| unshare() | unshare(flags) | Move the calling process into new namespace(s). No fork needed. |
| setns() | setns(fd, nstype) | Join an existing namespace via its /proc fd. This is what nsenter uses. |
C // Create child in new PID + UTS namespace int flags = CLONE_NEWPID | CLONE_NEWUTS | SIGCHLD; pid_t pid = clone(child_fn, stack + STACK_SIZE, flags, arg); // Move current process into a new network namespace unshare(CLONE_NEWNET); // Join an existing namespace int fd = open("/proc/12345/ns/net", O_RDONLY); setns(fd, CLONE_NEWNET);
Enter existing namespaces of a running process. Uses setns() under the hood.
enter by pid # Enter all namespaces of PID 12345 $ sudo nsenter --target 12345 --all bash # Enter just the network namespace $ sudo nsenter --target 12345 --net bash # Enter a Docker container (bypass docker exec) $ PID=$(docker inspect -f '{{.State.Pid}}' mycontainer) $ sudo nsenter --target $PID --all bash # Useful when dockerd is stuck but you need into a container
A new net namespace starts with nothing. You build the network from scratch using veth pairs.
veth pair # Create a virtual ethernet pair (two ends of a cable) # ip link add veth-host type veth peer name veth-ns # Move one end into the namespace # ip link set veth-ns netns myns # Assign IPs # ip addr add 10.200.1.1/24 dev veth-host # ip netns exec myns ip addr add 10.200.1.2/24 dev veth-ns # Bring up # ip link set veth-host up # ip netns exec myns ip link set veth-ns up # ip netns exec myns ip link set lo up # Test # ip netns exec myns ping 10.200.1.1
internet access # Enable IP forwarding # echo 1 > /proc/sys/net/ipv4/ip_forward # NAT the namespace traffic # iptables -t nat -A POSTROUTING -s 10.200.1.0/24 -j MASQUERADE # Default route inside the namespace # ip netns exec myns ip route add default via 10.200.1.1 # Now the namespace can reach the internet # ip netns exec myns curl ifconfig.me
ip netns $ ip netns add myns # create $ ip netns list # list $ ip netns exec myns bash # enter $ ip netns exec myns ip a # run command inside $ ip netns del myns # delete # ip netns creates bind mounts at /var/run/netns/<name> # This keeps the namespace alive even with no processes in it
Every process exposes its namespace memberships as symlinks under /proc/<pid>/ns/. Same inode number = same namespace.
inspection # Your own namespaces $ ls -la /proc/$$/ns/ # Compare two processes $ readlink /proc/1/ns/pid pid:[4026531836] $ readlink /proc/$$/ns/pid pid:[4026531836] # same inode = same namespace # List all namespaces on the system $ lsns # Filter by type $ lsns -t net # A process has TWO PIDs in a PID namespace $ grep NSpid /proc/<pid>/status NSpid: 12345 1 # host PID 12345, container PID 1
The most security-relevant namespace. Maps UIDs between namespaces. Root inside the container is an unprivileged user on the host.
uid_map # /proc/<pid>/uid_map format: # <id_inside> <id_outside> <range> 0 1000 1 # UID 0 inside = UID 1000 outside # "root" in the container is # your unprivileged user on host
rootless # No sudo needed $ unshare --user --map-root-user bash # Inside: # whoami root # id uid=0(root) gid=0(root) # But on the host, it's still you
A container is just namespaces + chroot/pivot_root + cgroups. Here's the process.
bash #!/usr/bin/env bash # Minimal container in ~20 lines ROOTFS="/tmp/container-root" mkdir -p "$ROOTFS" # Install a minimal userland (busybox) cp /path/to/busybox "$ROOTFS/bin/busybox" for cmd in sh ls ps mount cat echo; do ln -s busybox "$ROOTFS/bin/$cmd" done # Launch with all namespaces sudo unshare --pid --uts --mount --ipc --net --fork \ /bin/bash -c " hostname container mount -t proc proc $ROOTFS/proc pivot_root $ROOTFS $ROOTFS/.old umount -l /.old exec /bin/sh "
Simplified flow of docker run. Steps 4a-4b are what the demo scripts above do, just in Go instead of bash.
docker run flow 1. docker CLI --> dockerd (REST API call) 2. dockerd --> containerd (gRPC: create container) 3. containerd --> runc (OCI runtime) 4. runc: a. clone(CLONE_NEWPID | CLONE_NEWUTS | CLONE_NEWNS | ...) b. In child: - Set up cgroups - Mount /proc, /sys, /dev - Set up rootfs (overlay mounts) - pivot_root - Set hostname - Configure network (via CNI plugin) - Drop capabilities - Set seccomp filters - Set AppArmor/SELinux labels - execve(container entrypoint) 5. Container process running, fully isolated
Mounts can propagate between namespaces. This is often the source of subtle container bugs.
| Mode | Behavior | Use Case |
|---|---|---|
| private | Changes stay in this namespace only | Default for containers |
| shared | Changes propagate to all peer mounts | Host mounts visible in containers |
| slave | Receives propagation but doesn't send | One-way host-to-container sync |
| unbindable | Can't be bind-mounted at all | Prevent mount namespace escapes |
check propagation $ findmnt -o TARGET,PROPAGATION # Make a mount private (no propagation) # mount --make-private /mnt/data # Make a mount shared (propagate everywhere) # mount --make-shared /mnt/data # Recursively make everything private (container startup) # mount --make-rprivate /
no network # Run a process with zero network access $ sudo unshare --net -- bash -c ' curl google.com # fails: no interfaces ping 8.8.8.8 # fails: no route ' # Useful for sandboxing builds, running untrusted code, etc.
hostname $ sudo unshare --uts -- bash -c ' hostname devbox exec my-app ' # my-app sees hostname "devbox" # Host hostname is untouched
rootless # No sudo needed $ unshare --user --map-root-user --pid --fork --mount-proc bash # You're now "root" with isolated PIDs # whoami root # ps aux PID USER COMMAND 1 root bash
docker namespaces $ docker run -d --name test alpine sleep 3600 $ PID=$(docker inspect -f '{{.State.Pid}}' test) # Compare container vs host namespaces $ for ns in /proc/$PID/ns/*; do type=$(basename $ns) cnt=$(readlink $ns) host=$(readlink /proc/1/ns/$type) [[ "$cnt" != "$host" ]] && echo "ISOLATED: $type" done ISOLATED: cgroup ISOLATED: ipc ISOLATED: mnt ISOLATED: net ISOLATED: pid ISOLATED: uts
rescue # Enter a container when docker exec hangs or dockerd is dead $ PID=$(cat /run/docker/containerd/*/init.pid) # or find it in ps $ sudo nsenter --target $PID --all bash # You're inside the container, full shell # Works even if docker daemon is completely dead
/proc/<pid>/ns/* | Namespace symlinks |
/proc/<pid>/uid_map | UID mapping |
/proc/<pid>/gid_map | GID mapping |
/proc/<pid>/status | NSpid, NStgid, etc. |
/proc/<pid>/cgroup | Cgroup membership |
/var/run/netns/ | Named net namespaces |
unshare | Create new namespaces |
nsenter | Enter existing namespaces |
lsns | List namespaces |
ip netns | Manage net namespaces |
findmnt | Mount propagation info |
pstree | PID namespace trees |
$ man 7 namespaces # overview $ man 7 user_namespaces # UID mapping details $ man 7 cgroups # resource limits (the other half of containers) $ man 2 unshare # syscall $ man 2 clone # syscall $ man 2 setns # syscall