Linux Namespaces

$ unshare --pid --uts --net --mount --fork --mount-proc bash

The 8 Namespace Types

Namespaces partition kernel resources so one set of processes sees one set of resources, another sees a different set. They are the isolation layer behind every container runtime.

Mount
CLONE_NEWNS
unshare -m
Filesystem mount points. Each namespace gets its own mount table. This is how containers get their own rootfs.
Linux 2.4.19 (2002)
UTS
CLONE_NEWUTS
unshare -u
Hostname and NIS domain name. The simplest namespace type. Change hostname without affecting the host.
Linux 2.6.19 (2006)
IPC
CLONE_NEWIPC
unshare -i
System V IPC objects and POSIX message queues. Prevents cross-container IPC snooping.
Linux 2.6.19 (2006)
PID
CLONE_NEWPID
unshare -p
Process ID number space. First forked child becomes PID 1. Processes outside are invisible from inside.
Linux 2.6.24 (2008)
Network
CLONE_NEWNET
unshare -n
Entire network stack: interfaces, routes, iptables, sockets. Starts with only a loopback (and it's DOWN).
Linux 2.6.29 (2009)
User
CLONE_NEWUSER
unshare -U
UIDs, GIDs, and capabilities. Be root inside, nobody outside. Enables rootless containers.
Linux 3.8 (2013)
Cgroup
CLONE_NEWCGROUP
unshare -C
Cgroup root directory. Virtualizes /proc/self/cgroup so the container sees itself at the cgroup root.
Linux 4.6 (2016)
Time
CLONE_NEWTIME
unshare -T
CLOCK_MONOTONIC and CLOCK_BOOTTIME offsets. Make a process think the system booted at a different time.
Linux 5.6 (2020)

Syscalls

Three syscalls power the entire namespace subsystem.

SyscallSignaturePurpose
clone() clone(fn, stack, flags, arg) Create child process in new namespace(s). Flags are OR'd CLONE_NEW* constants.
unshare() unshare(flags) Move the calling process into new namespace(s). No fork needed.
setns() setns(fd, nstype) Join an existing namespace via its /proc fd. This is what nsenter uses.
C
// Create child in new PID + UTS namespace
int flags = CLONE_NEWPID | CLONE_NEWUTS | SIGCHLD;
pid_t pid = clone(child_fn, stack + STACK_SIZE, flags, arg);

// Move current process into a new network namespace
unshare(CLONE_NEWNET);

// Join an existing namespace
int fd = open("/proc/12345/ns/net", O_RDONLY);
setns(fd, CLONE_NEWNET);

unshare command

Create new namespace(s) and run a program inside them.

basics
# New UTS namespace, change hostname
$ sudo unshare --uts bash -c 'hostname mybox; bash'

# New PID namespace (must fork + remount /proc)
$ sudo unshare --pid --fork --mount-proc bash

# New network namespace (blank stack, only lo)
$ sudo unshare --net bash

# Rootless user namespace (no sudo!)
$ unshare --user --map-root-user bash

# All namespaces (basically a container)
$ sudo unshare --pid --uts --mount --ipc --net --fork --mount-proc bash
Why --fork with --pid? The PID namespace applies to children of the unshare call. Without --fork, the calling process stays in the old PID namespace and you get confusing behavior. --mount-proc remounts /proc so ps/top see the new PID space.

nsenter command

Enter existing namespaces of a running process. Uses setns() under the hood.

enter by pid
# Enter all namespaces of PID 12345
$ sudo nsenter --target 12345 --all bash

# Enter just the network namespace
$ sudo nsenter --target 12345 --net bash

# Enter a Docker container (bypass docker exec)
$ PID=$(docker inspect -f '{{.State.Pid}}' mycontainer)
$ sudo nsenter --target $PID --all bash

# Useful when dockerd is stuck but you need into a container

Network Namespaces

A new net namespace starts with nothing. You build the network from scratch using veth pairs.

veth pair
# Create a virtual ethernet pair (two ends of a cable)
# ip link add veth-host type veth peer name veth-ns

# Move one end into the namespace
# ip link set veth-ns netns myns

# Assign IPs
# ip addr add 10.200.1.1/24 dev veth-host
# ip netns exec myns ip addr add 10.200.1.2/24 dev veth-ns

# Bring up
# ip link set veth-host up
# ip netns exec myns ip link set veth-ns up
# ip netns exec myns ip link set lo up

# Test
# ip netns exec myns ping 10.200.1.1
internet access
# Enable IP forwarding
# echo 1 > /proc/sys/net/ipv4/ip_forward

# NAT the namespace traffic
# iptables -t nat -A POSTROUTING -s 10.200.1.0/24 -j MASQUERADE

# Default route inside the namespace
# ip netns exec myns ip route add default via 10.200.1.1

# Now the namespace can reach the internet
# ip netns exec myns curl ifconfig.me
ip netns
$ ip netns add myns          # create
$ ip netns list              # list
$ ip netns exec myns bash    # enter
$ ip netns exec myns ip a    # run command inside
$ ip netns del myns          # delete

# ip netns creates bind mounts at /var/run/netns/<name>
# This keeps the namespace alive even with no processes in it

Docker's Bridge Networking

Container A ---veth---+ | docker0 bridge --- eth0 --- internet | (+ iptables NAT) Container B ---veth---+

/proc Namespace Interface

Every process exposes its namespace memberships as symlinks under /proc/<pid>/ns/. Same inode number = same namespace.

/proc/<pid>/ns/ +-- cgroup cgroup:[4026531835] +-- ipc ipc:[4026531839] +-- mnt mnt:[4026531841] +-- net net:[4026531840] +-- pid pid:[4026531836] +-- pid_for_children +-- time time:[4026531834] +-- time_for_children +-- user user:[4026531837] +-- uts uts:[4026531838]
inspection
# Your own namespaces
$ ls -la /proc/$$/ns/

# Compare two processes
$ readlink /proc/1/ns/pid
pid:[4026531836]
$ readlink /proc/$$/ns/pid
pid:[4026531836]          # same inode = same namespace

# List all namespaces on the system
$ lsns

# Filter by type
$ lsns -t net

# A process has TWO PIDs in a PID namespace
$ grep NSpid /proc/<pid>/status
NSpid:  12345   1       # host PID 12345, container PID 1

User Namespaces + UID Maps

The most security-relevant namespace. Maps UIDs between namespaces. Root inside the container is an unprivileged user on the host.

uid_map
# /proc/<pid>/uid_map format:
# <id_inside> <id_outside> <range>

         0       1000          1

# UID 0 inside = UID 1000 outside
# "root" in the container is
# your unprivileged user on host
rootless
# No sudo needed
$ unshare --user --map-root-user bash

# Inside:
# whoami
root
# id
uid=0(root) gid=0(root)

# But on the host, it's still you
Why this matters Without user namespaces, containers need real root to mount filesystems, change hostnames, create network interfaces, and change UIDs. With them, the kernel checks capabilities within the user namespace, so unprivileged users can do all of this. This is what podman's rootless mode is built on.

Container from Scratch

A container is just namespaces + chroot/pivot_root + cgroups. Here's the process.

Step 1
clone()
NEWPID | NEWUTS | NEWNS | NEWNET
Step 2
Mount rootfs
overlay / bind mount
Step 3
pivot_root
swap / with new rootfs
Step 4
Mount /proc
mount -t proc proc /proc
Step 5
Set hostname
UTS namespace
Step 6
exec
/bin/sh
bash
#!/usr/bin/env bash
# Minimal container in ~20 lines

ROOTFS="/tmp/container-root"
mkdir -p "$ROOTFS"

# Install a minimal userland (busybox)
cp /path/to/busybox "$ROOTFS/bin/busybox"
for cmd in sh ls ps mount cat echo; do
    ln -s busybox "$ROOTFS/bin/$cmd"
done

# Launch with all namespaces
sudo unshare --pid --uts --mount --ipc --net --fork \
    /bin/bash -c "
        hostname container
        mount -t proc proc $ROOTFS/proc
        pivot_root $ROOTFS $ROOTFS/.old
        umount -l /.old
        exec /bin/sh
    "
The PID 1 problem In a PID namespace, if PID 1 dies, ALL processes in that namespace are killed by the kernel. This is why containers need an init process. Docker uses --init (tini) for this. It's not optional; it's a kernel requirement.

How Docker Actually Does It

Simplified flow of docker run. Steps 4a-4b are what the demo scripts above do, just in Go instead of bash.

docker run flow
1.  docker CLI --> dockerd (REST API call)
2.  dockerd --> containerd (gRPC: create container)
3.  containerd --> runc (OCI runtime)
4.  runc:
    a. clone(CLONE_NEWPID | CLONE_NEWUTS | CLONE_NEWNS | ...)
    b. In child:
       - Set up cgroups
       - Mount /proc, /sys, /dev
       - Set up rootfs (overlay mounts)
       - pivot_root
       - Set hostname
       - Configure network (via CNI plugin)
       - Drop capabilities
       - Set seccomp filters
       - Set AppArmor/SELinux labels
       - execve(container entrypoint)
5.  Container process running, fully isolated

Mount Propagation

Mounts can propagate between namespaces. This is often the source of subtle container bugs.

ModeBehaviorUse Case
private Changes stay in this namespace only Default for containers
shared Changes propagate to all peer mounts Host mounts visible in containers
slave Receives propagation but doesn't send One-way host-to-container sync
unbindable Can't be bind-mounted at all Prevent mount namespace escapes
check propagation
$ findmnt -o TARGET,PROPAGATION

# Make a mount private (no propagation)
# mount --make-private /mnt/data

# Make a mount shared (propagate everywhere)
# mount --make-shared /mnt/data

# Recursively make everything private (container startup)
# mount --make-rprivate /

Recipes

no network
# Run a process with zero network access
$ sudo unshare --net -- bash -c '
    curl google.com   # fails: no interfaces
    ping 8.8.8.8      # fails: no route
'

# Useful for sandboxing builds, running untrusted code, etc.
hostname
$ sudo unshare --uts -- bash -c '
    hostname devbox
    exec my-app
'

# my-app sees hostname "devbox"
# Host hostname is untouched
rootless
# No sudo needed
$ unshare --user --map-root-user --pid --fork --mount-proc bash

# You're now "root" with isolated PIDs
# whoami
root
# ps aux
  PID  USER  COMMAND
    1  root  bash
docker namespaces
$ docker run -d --name test alpine sleep 3600
$ PID=$(docker inspect -f '{{.State.Pid}}' test)

# Compare container vs host namespaces
$ for ns in /proc/$PID/ns/*; do
    type=$(basename $ns)
    cnt=$(readlink $ns)
    host=$(readlink /proc/1/ns/$type)
    [[ "$cnt" != "$host" ]] && echo "ISOLATED: $type"
done
ISOLATED: cgroup
ISOLATED: ipc
ISOLATED: mnt
ISOLATED: net
ISOLATED: pid
ISOLATED: uts
rescue
# Enter a container when docker exec hangs or dockerd is dead
$ PID=$(cat /run/docker/containerd/*/init.pid)  # or find it in ps
$ sudo nsenter --target $PID --all bash

# You're inside the container, full shell
# Works even if docker daemon is completely dead

Quick Reference

Key Files

/proc/<pid>/ns/*Namespace symlinks
/proc/<pid>/uid_mapUID mapping
/proc/<pid>/gid_mapGID mapping
/proc/<pid>/statusNSpid, NStgid, etc.
/proc/<pid>/cgroupCgroup membership
/var/run/netns/Named net namespaces

Key Tools

unshareCreate new namespaces
nsenterEnter existing namespaces
lsnsList namespaces
ip netnsManage net namespaces
findmntMount propagation info
pstreePID namespace trees

Man Pages

$ man 7 namespaces          # overview
$ man 7 user_namespaces     # UID mapping details
$ man 7 cgroups             # resource limits (the other half of containers)
$ man 2 unshare             # syscall
$ man 2 clone               # syscall
$ man 2 setns               # syscall