Linux Namespaces

The 8 Namespace Types

Namespaces partition kernel resources so one set of processes sees one set of resources, another sees a different set. They are the isolation layer behind every container runtime.

Mount

CLONE_NEWNS

unshare -m

Filesystem mount points. Each namespace gets its own mount table. This is how containers get their own rootfs.

Linux 2.4.19 (2002)

UTS

CLONE_NEWUTS

unshare -u

Hostname and NIS domain name. The simplest namespace type. Change hostname without affecting the host.

Linux 2.6.19 (2006)

IPC

CLONE_NEWIPC

unshare -i

System V IPC objects and POSIX message queues. Prevents cross-container IPC snooping.

Linux 2.6.19 (2006)

PID

CLONE_NEWPID

unshare -p

Process ID number space. First forked child becomes PID 1. Processes outside are invisible from inside.

Linux 2.6.24 (2008)

Network

CLONE_NEWNET

unshare -n

Entire network stack: interfaces, routes, iptables, sockets. Starts with only a loopback (and it's DOWN).

Linux 2.6.29 (2009)

User

CLONE_NEWUSER

unshare -U

UIDs, GIDs, and capabilities. Be root inside, nobody outside. Enables rootless containers.

Linux 3.8 (2013)

Cgroup

CLONE_NEWCGROUP

unshare -C

Cgroup root directory. Virtualizes /proc/self/cgroup so the container sees itself at the cgroup root.

Linux 4.6 (2016)

Time

CLONE_NEWTIME

unshare -T

CLOCK_MONOTONIC and CLOCK_BOOTTIME offsets. Make a process think the system booted at a different time.

Linux 5.6 (2020)

Syscalls

Three syscalls power the entire namespace subsystem.

Syscall	Signature	Purpose
clone()	clone(fn, stack, flags, arg)	Create child process in new namespace(s). Flags are OR'd CLONE_NEW* constants.
unshare()	unshare(flags)	Move the calling process into new namespace(s). No fork needed.
setns()	setns(fd, nstype)	Join an existing namespace via its /proc fd. This is what `nsenter` uses.

C
// Create child in new PID + UTS namespace
int flags = CLONE_NEWPID | CLONE_NEWUTS | SIGCHLD;
pid_t pid = clone(child_fn, stack + STACK_SIZE, flags, arg);

// Move current process into a new network namespace
unshare(CLONE_NEWNET);

// Join an existing namespace
int fd = open("/proc/12345/ns/net", O_RDONLY);
setns(fd, CLONE_NEWNET);

unshare command

Create new namespace(s) and run a program inside them.

basics
# New UTS namespace, change hostname
$ sudo unshare --uts bash -c 'hostname mybox; bash'

# New PID namespace (must fork + remount /proc)
$ sudo unshare --pid --fork --mount-proc bash

# New network namespace (blank stack, only lo)
$ sudo unshare --net bash

# Rootless user namespace (no sudo!)
$ unshare --user --map-root-user bash

# All namespaces (basically a container)
$ sudo unshare --pid --uts --mount --ipc --net --fork --mount-proc bash

Why --fork with --pid? The PID namespace applies to children of the unshare call. Without --fork, the calling process stays in the old PID namespace and you get confusing behavior. --mount-proc remounts /proc so ps/top see the new PID space.

nsenter command

Enter existing namespaces of a running process. Uses setns() under the hood.

enter by pid
# Enter all namespaces of PID 12345
$ sudo nsenter --target 12345 --all bash

# Enter just the network namespace
$ sudo nsenter --target 12345 --net bash

# Enter a Docker container (bypass docker exec)
$ PID=$(docker inspect -f '{{.State.Pid}}' mycontainer)
$ sudo nsenter --target $PID --all bash

# Useful when dockerd is stuck but you need into a container

Network Namespaces

A new net namespace starts with nothing. You build the network from scratch using veth pairs.

veth pair
# Create a virtual ethernet pair (two ends of a cable)
# ip link add veth-host type veth peer name veth-ns

# Move one end into the namespace
# ip link set veth-ns netns myns

# Assign IPs
# ip addr add 10.200.1.1/24 dev veth-host
# ip netns exec myns ip addr add 10.200.1.2/24 dev veth-ns

# Bring up
# ip link set veth-host up
# ip netns exec myns ip link set veth-ns up
# ip netns exec myns ip link set lo up

# Test
# ip netns exec myns ping 10.200.1.1

internet access
# Enable IP forwarding
# echo 1 > /proc/sys/net/ipv4/ip_forward

# NAT the namespace traffic
# iptables -t nat -A POSTROUTING -s 10.200.1.0/24 -j MASQUERADE

# Default route inside the namespace
# ip netns exec myns ip route add default via 10.200.1.1

# Now the namespace can reach the internet
# ip netns exec myns curl ifconfig.me

ip netns
$ ip netns add myns          # create
$ ip netns list              # list
$ ip netns exec myns bash    # enter
$ ip netns exec myns ip a    # run command inside
$ ip netns del myns          # delete

# ip netns creates bind mounts at /var/run/netns/<name>
# This keeps the namespace alive even with no processes in it

Docker's Bridge Networking

Container A ---veth---+ | docker0 bridge --- eth0 --- internet | (+ iptables NAT) Container B ---veth---+

/proc Namespace Interface

Every process exposes its namespace memberships as symlinks under /proc/<pid>/ns/. Same inode number = same namespace.

/proc/<pid>/ns/ +-- cgroup cgroup:[4026531835] +-- ipc ipc:[4026531839] +-- mnt mnt:[4026531841] +-- net net:[4026531840] +-- pid pid:[4026531836] +-- pid_for_children +-- time time:[4026531834] +-- time_for_children +-- user user:[4026531837] +-- uts uts:[4026531838]

inspection
# Your own namespaces
$ ls -la /proc/$$/ns/

# Compare two processes
$ readlink /proc/1/ns/pid
pid:[4026531836]
$ readlink /proc/$$/ns/pid
pid:[4026531836]          # same inode = same namespace

# List all namespaces on the system
$ lsns

# Filter by type
$ lsns -t net

# A process has TWO PIDs in a PID namespace
$ grep NSpid /proc/<pid>/status
NSpid:  12345   1       # host PID 12345, container PID 1

User Namespaces + UID Maps

The most security-relevant namespace. Maps UIDs between namespaces. Root inside the container is an unprivileged user on the host.

uid_map
# /proc/<pid>/uid_map format:
# <id_inside> <id_outside> <range>

         0       1000          1

# UID 0 inside = UID 1000 outside
# "root" in the container is
# your unprivileged user on host

rootless
# No sudo needed
$ unshare --user --map-root-user bash

# Inside:
# whoami
root
# id
uid=0(root) gid=0(root)

# But on the host, it's still you

Why this matters Without user namespaces, containers need real root to mount filesystems, change hostnames, create network interfaces, and change UIDs. With them, the kernel checks capabilities within the user namespace, so unprivileged users can do all of this. This is what podman's rootless mode is built on.

Container from Scratch

A container is just namespaces + chroot/pivot_root + cgroups. Here's the process.

Step 1

clone()

NEWPID | NEWUTS | NEWNS | NEWNET

→

Step 2

Mount rootfs

overlay / bind mount

→

Step 3

pivot_root

swap / with new rootfs

→

Step 4

Mount /proc

mount -t proc proc /proc

→

Step 5

Set hostname

UTS namespace

→

Step 6

exec

/bin/sh

bash
#!/usr/bin/env bash
# Minimal container in ~20 lines

ROOTFS="/tmp/container-root"
mkdir -p "$ROOTFS"

# Install a minimal userland (busybox)
cp /path/to/busybox "$ROOTFS/bin/busybox"
for cmd in sh ls ps mount cat echo; do
    ln -s busybox "$ROOTFS/bin/$cmd"
done

# Launch with all namespaces
sudo unshare --pid --uts --mount --ipc --net --fork \
    /bin/bash -c "
        hostname container
        mount -t proc proc $ROOTFS/proc
        pivot_root $ROOTFS $ROOTFS/.old
        umount -l /.old
        exec /bin/sh
    "

The PID 1 problem In a PID namespace, if PID 1 dies, ALL processes in that namespace are killed by the kernel. This is why containers need an init process. Docker uses --init (tini) for this. It's not optional; it's a kernel requirement.

How Docker Actually Does It

Simplified flow of docker run. Steps 4a-4b are what the demo scripts above do, just in Go instead of bash.

docker run flow
1.  docker CLI --> dockerd (REST API call)
2.  dockerd --> containerd (gRPC: create container)
3.  containerd --> runc (OCI runtime)
4.  runc:
    a. clone(CLONE_NEWPID | CLONE_NEWUTS | CLONE_NEWNS | ...)
    b. In child:
       - Set up cgroups
       - Mount /proc, /sys, /dev
       - Set up rootfs (overlay mounts)
       - pivot_root
       - Set hostname
       - Configure network (via CNI plugin)
       - Drop capabilities
       - Set seccomp filters
       - Set AppArmor/SELinux labels
       - execve(container entrypoint)
5.  Container process running, fully isolated

Mount Propagation

Mounts can propagate between namespaces. This is often the source of subtle container bugs.

Mode	Behavior	Use Case
private	Changes stay in this namespace only	Default for containers
shared	Changes propagate to all peer mounts	Host mounts visible in containers
slave	Receives propagation but doesn't send	One-way host-to-container sync
unbindable	Can't be bind-mounted at all	Prevent mount namespace escapes

check propagation
$ findmnt -o TARGET,PROPAGATION

# Make a mount private (no propagation)
# mount --make-private /mnt/data

# Make a mount shared (propagate everywhere)
# mount --make-shared /mnt/data

# Recursively make everything private (container startup)
# mount --make-rprivate /

Recipes

no network
# Run a process with zero network access
$ sudo unshare --net -- bash -c '
    curl google.com   # fails: no interfaces
    ping 8.8.8.8      # fails: no route
'

# Useful for sandboxing builds, running untrusted code, etc.

hostname
$ sudo unshare --uts -- bash -c '
    hostname devbox
    exec my-app
'

# my-app sees hostname "devbox"
# Host hostname is untouched

rootless
# No sudo needed
$ unshare --user --map-root-user --pid --fork --mount-proc bash

# You're now "root" with isolated PIDs
# whoami
root
# ps aux
  PID  USER  COMMAND
    1  root  bash

docker namespaces
$ docker run -d --name test alpine sleep 3600
$ PID=$(docker inspect -f '{{.State.Pid}}' test)

# Compare container vs host namespaces
$ for ns in /proc/$PID/ns/*; do
    type=$(basename $ns)
    cnt=$(readlink $ns)
    host=$(readlink /proc/1/ns/$type)
    [[ "$cnt" != "$host" ]] && echo "ISOLATED: $type"
done
ISOLATED: cgroup
ISOLATED: ipc
ISOLATED: mnt
ISOLATED: net
ISOLATED: pid
ISOLATED: uts

rescue
# Enter a container when docker exec hangs or dockerd is dead
$ PID=$(cat /run/docker/containerd/*/init.pid)  # or find it in ps
$ sudo nsenter --target $PID --all bash

# You're inside the container, full shell
# Works even if docker daemon is completely dead

Quick Reference

Key Files

`/proc/<pid>/ns/*`	Namespace symlinks
`/proc/<pid>/uid_map`	UID mapping
`/proc/<pid>/gid_map`	GID mapping
`/proc/<pid>/status`	NSpid, NStgid, etc.
`/proc/<pid>/cgroup`	Cgroup membership
`/var/run/netns/`	Named net namespaces

Key Tools

`unshare`	Create new namespaces
`nsenter`	Enter existing namespaces
`lsns`	List namespaces
`ip netns`	Manage net namespaces
`findmnt`	Mount propagation info
`pstree`	PID namespace trees

Man Pages

$ man 7 namespaces          # overview
$ man 7 user_namespaces     # UID mapping details
$ man 7 cgroups             # resource limits (the other half of containers)
$ man 2 unshare             # syscall
$ man 2 clone               # syscall
$ man 2 setns               # syscall