eBPF Technical Overview

What eBPF Actually Is

eBPF is a sandboxed virtual machine inside the Linux kernel. It lets you run custom bytecode at kernel hook points without modifying kernel source or loading kernel modules. Programs are verified at load time for safety (no infinite loops, no invalid memory access), then JIT-compiled to native machine code.

The "extended" part matters: classic BPF (cBPF, 1992) was a simple packet filter with 2 registers and 32-bit ops. eBPF (2014+) is a general-purpose in-kernel execution engine with 11 registers, 64-bit ops, maps (key-value stores), helper functions, and tail calls.

Programs attach to hook points: kprobes (any kernel function), uprobes (any userspace function), tracepoints (stable kernel instrumentation), XDP (network driver ingress), tc (traffic control), cgroup hooks, LSM hooks, and more. Each hook type defines what context the program receives and what it can do.

Why this matters Before eBPF, if you wanted custom kernel instrumentation you had two options: write a kernel module (dangerous, version-specific, can crash the system) or modify the kernel source and recompile. eBPF gives you the same power with safety guarantees enforced by the verifier.

Kernel Execution Pipeline

USERSPACE BCC Python/Go or bpftrace or libbpf CO-RE | | | | C source code | bpftrace script | pre-compiled .o | | | | | | v v v v v v LLVM/Clang compiles to BPF bytecode (or loaded directly) | v ----------- bpf() syscall ------------------------------------ | KERNEL v VERIFIER -- rejects unsafe programs (bounds checks, no loops*, | stack depth <= 512 bytes, no null derefs) v JIT COMPILER -- bytecode -> native (x86_64, arm64, etc.) | v ATTACH to hook point (kprobe, tracepoint, XDP, etc.) | v BPF MAPS <-- shared state between kernel prog & userspace (hash, array, ringbuf, perf_event, LRU, stack_trace, ...) | v Userspace reads maps / perf buffer / ring buffer for output * bounded loops allowed since kernel 5.3 (verifier proves termination)

Step 1

Write C

BPF program source

→

Step 2

Compile

LLVM -> BPF bytecode

→

Step 3

bpf() syscall

load into kernel

→

Step 4

Verify

safety proof

→

Step 5

JIT

native machine code

→

Step 6

Attach

kprobe / XDP / etc

Core Concepts

Core Concept

BPF Verifier

Every BPF program passes through the verifier before execution. It walks all possible code paths and proves:

1. No unreachable instructions 2. No out-of-bounds memory access 3. All branches terminate (DAG check) 4. Stack usage <= 512 bytes 5. Only allowed helper functions called 6. R0 contains valid return before exit 7. Map pointers null-checked before deref

The verifier is your biggest enemy when writing BPF C. R1 invalid mem access 'scalar' means you tried to dereference a raw pointer without bpf_probe_read(). Every struct field read from a kernel pointer in kprobe context needs an explicit probe read.

Core Concept

BPF Maps

Maps are the shared data structures between kernel-side BPF programs and userspace. They persist across BPF program invocations and are the primary mechanism for collecting data.

BPF_MAP_TYPE_HASHgeneral k/v store

BPF_MAP_TYPE_ARRAYfixed-size indexed

BPF_MAP_TYPE_PERF_EVENT_ARRAYper-cpu event stream

BPF_MAP_TYPE_RINGBUFshared ring (5.8+)

BPF_MAP_TYPE_LRU_HASHauto-evicting cache

BPF_MAP_TYPE_STACK_TRACEcall stack capture

BPF_MAP_TYPE_PERCPU_HASHlock-free per-cpu

In BCC Python: BPF_HASH(counts) declares a hash map. BPF_PERF_OUTPUT(events) declares a perf buffer. Userspace polls with b.perf_buffer_poll().

Where BPF Programs Attach

kprobe / kretprobeany kernel function entry/return

uprobe / uretprobeany userspace function entry/return

tracepointstable kernel instrumentation points

raw_tracepointlower overhead, raw args

fentry / fexitBTF-based, zero overhead (5.5+)

perf_eventPMC sampling, CPU profiling

kprobe vs tracepoint kprobes can hook ANY kernel function but the interface can change between kernel versions. Tracepoints are stable across versions but only exist where kernel devs placed them. Use tracepoints when available, kprobes when you need to reach deeper.

XDPdriver-level packet processing (fastest)

tc (clsact)traffic control ingress/egress

sk_filtersocket-level packet filtering

sk_msg / sk_skbsockmap redirect (service mesh)

lwtlightweight tunnel encap/decap

cgroup/sock*per-cgroup network policy

XDP performance XDP runs before the kernel allocates an skb (socket buffer). This means it can process packets at line rate -- Cloudflare uses it to mitigate multi-Tbps DDoS attacks. Actions: XDP_PASS, XDP_DROP, XDP_TX (bounce back), XDP_REDIRECT.

LSMsecurity module hooks (5.7+)

seccompsyscall filtering

cgroup/devicedevice access control

cgroup/sysctlsysctl interposition

LSM + eBPF LSM hooks let BPF enforce security policy at the same kernel points as SELinux/AppArmor -- but dynamically, without rebooting or recompiling policy. Cilium uses this for Kubernetes network policy enforcement at the kernel level.

BCC (BPF Compiler Collection)

BCC provides a Python (and Lua) frontend for writing BPF programs. C code is embedded as a string, compiled at runtime via LLVM, and loaded into the kernel. This is the fastest path from zero to working BPF instrumentation but has trade-offs.

Advantages

rapid prototypingwrite, run, iterate

rich python APImap access, formatting

100+ built-in toolsproduction-ready

auto struct offsetsreads kernel headers

Trade-offs

requires LLVM + headerson every target

compile at runtimeslow startup on ARM

not portabletied to kernel version

heavy deps~300MB installed

Anatomy of a BCC Program

BCC Python
#!/usr/bin/env python3
from bcc import BPF

# ---- KERNEL SIDE (C) ----
prog = """
struct event_t {
    u32 pid;
    char comm[16];
    char fname[64];
    u64 size;
};

BPF_PERF_OUTPUT(events);              // declare perf buffer

int trace_read_entry(struct pt_regs *ctx,
                     struct file *file,
                     char __user *buf,
                     size_t count) {
    struct event_t evt = {};
    evt.pid = bpf_get_current_pid_tgid() >> 32;
    bpf_get_current_comm(&evt.comm, sizeof(evt.comm));
    bpf_probe_read_kernel_str(&evt.fname, sizeof(evt.fname),
                               file->f_path.dentry->d_name.name);
    evt.size = count;
    events.perf_submit(ctx, &evt, sizeof(evt));
    return 0;
}
"""

# ---- USERSPACE SIDE (Python) ----
b = BPF(text=prog)                           # compile + load
b.attach_kprobe(event="vfs_read",
                fn_name="trace_read_entry") # attach to hook

def handle_event(cpu, data, size):         # callback for events
    evt = b["events"].event(data)
    print(f"{evt.pid}  {evt.comm}  {evt.fname}  {evt.size}")

b["events"].open_perf_buffer(handle_event)
while True:
    b.perf_buffer_poll()                     # blocks, calls handle_event

BCC's rewriter gotcha The C string is compiled by LLVM into BPF bytecode at runtime. BCC's rewriter automatically converts struct member access into bpf_probe_read() calls -- but it misses some cases (like ntohs(sk->field)), which is why you hit verifier errors and need explicit bpf_probe_read_kernel().

Reference

bpf_get_current_pid_tgid()pid in upper 32 bits

bpf_get_current_comm(buf, sz)process name (16 char max)

bpf_ktime_get_ns()monotonic nanoseconds

bpf_probe_read_kernel(dst, sz, src)safe kernel mem read

bpf_probe_read_user(dst, sz, src)safe userspace mem read

bpf_probe_read_kernel_str()read kernel string safely

bpf_map_lookup_elem(map, &key)read from map (returns ptr or NULL)

bpf_map_update_elem(map, &key, &val, flags)write to map

bpf_map_delete_elem(map, &key)remove from map

bpf_perf_event_output(ctx, map, flags, data, sz)send to perf buffer

bpf_trace_printk(fmt, ...)debug only (slow, limited)

C-side macros (in the embedded C string):

BPF_HASH(name, key_t, val_t)hash map

BPF_ARRAY(name, val_t, max)array map

BPF_PERF_OUTPUT(name)perf buffer

BPF_PERCPU_ARRAY(name)per-cpu array

Python-side API:

b.attach_kprobe(event, fn_name)hook kernel func

b.attach_kretprobe(event, fn)hook return

b.attach_uprobe(name, sym, fn)hook userspace func

b.attach_tracepoint(tp, fn)stable kernel hook

b["map"].open_perf_buffer(cb)register event callback

b.perf_buffer_poll()block + dispatch events

b["map"][key]read map entry directly

File I/O

vfs_read / vfs_writeall file reads/writes

vfs_openfile open

do_sys_openat2openat syscall impl

Networking

tcp_v4_connectoutbound TCP

tcp_finish_connecthandshake done

tcp_sendmsg / tcp_recvmsgTCP data xfer

tcp_closeteardown

udp_sendmsgUDP (DNS etc)

Process

__arm64_sys_execveexec (arm64)

__x64_sys_execveexec (x86_64)

do_exitprocess exit

wake_up_new_taskfork/clone

BPF Verifier Errors & Fixes

R1 invalid mem access 'scalar' Dereferencing a kernel pointer directly. Fix: use bpf_probe_read_kernel()

BPF stack limit exceeded (512 bytes) Struct too large for BPF stack. Fix: use BPF_PERCPU_ARRAY as scratch space

back-edge from insn X to Y Unbounded loop detected (pre-5.3 kernels). Fix: #pragma unroll, or bounded for-loops on 5.3+

invalid indirect read from stack Reading uninitialized stack memory. Fix: zero-init structs: struct foo bar = {};

cannot pass map_value to helper Map value pointer used where scalar expected. Fix: dereference into local var first

btf_vmlinux is malformed Kernel BTF data missing or broken. Fix: install linux-headers or build kernel with CONFIG_DEBUG_INFO_BTF=y

Toolchains Compared

BCC

python lua runtime compile

Inline C compiled at load time. Best for prototyping and ad-hoc investigation. Requires LLVM + kernel headers on target.

startupslow (LLVM compile)

portabilitykernel-version tied

prototypingfastest iteration

libbpf + CO-RE

c go rust pre-compiled

Compile Once, Run Everywhere. BPF programs compiled ahead of time with clang. BTF provides struct layout info at runtime. Production standard.

startupinstant (pre-compiled)

portabilitycross-kernel (BTF)

prototypingslower iteration

bpftrace

awk-like DSL one-liners

High-level tracing language inspired by DTrace/awk. One-liners for quick investigation. Compiles to BPF under the hood.

startupmoderate

portabilityneeds BTF or headers

prototypingone-liner speed

Cilium eBPF Go Library

cilium/ebpf is the Go library for loading and managing pre-compiled BPF programs. It's the production approach: write BPF C, compile with clang to .o, then load/attach/read from Go. No LLVM on the target machine. CO-RE + BTF handles cross-kernel portability.

Workflow: 1. Write BPF C program (probe.bpf.c) 2. Compile: clang -O2 -target bpf -c probe.bpf.c -o probe.bpf.o 3. Generate Go bindings: bpf2go or manual 4. Load from Go: ebpf.LoadCollection() / link.Kprobe() 5. Read maps: map.Lookup() / map.Iterate()

This is what Cilium itself uses for Kubernetes network policy, service mesh, and observability (Hubble). Also used by Cloudflare, Meta, and Netflix for production eBPF tooling.

cilium/ebpf CO-RE BTF bpf2go kubernetes

bpftrace One-Liners

bpftrace
# count syscalls by process
$ bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'

# trace file opens with path
$ bpftrace -e 'tracepoint:syscalls:sys_enter_openat
  { printf("%s %s\n", comm, str(args.filename)); }'

# histogram of read() sizes
$ bpftrace -e 'tracepoint:syscalls:sys_exit_read /args.ret > 0/
  { @bytes = hist(args.ret); }'

# TCP connect with dest IP
$ bpftrace -e 'kprobe:tcp_v4_connect { printf("%s -> %s\n", comm,
    ntop(((struct sock *)arg0)->__sk_common.skc_daddr)); }'

# slow syscalls (>1ms)
$ bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @start[tid] = nsecs; }
  tracepoint:raw_syscalls:sys_exit /@start[tid]/
  { $d = nsecs - @start[tid];
    if ($d > 1000000) { printf("%s %dms\n", comm, $d/1000000); }
    delete(@start[tid]); }'

eBPF Evolution

1992

BPF introducedSteven McCanne and Van Jacobson. Simple packet filter in BSD. 2 registers, 32-bit.

2014

eBPF merged (Linux 3.15-3.18)Alexei Starovoitov. 11 registers, 64-bit, maps, verifier, JIT. Initial use: networking.

2015

kprobes + tracing supportBPF programs can attach to kernel functions. BCC project starts at PLUMgrid.

2016

Cilium founded / XDP merged (4.8)Thomas Graf. eBPF-based K8s networking. XDP: BPF at the network driver level.

2018

BTF (BPF Type Format)Struct layout metadata enables CO-RE: compile once, run on different kernel versions.

2019

bpftrace 0.9 / bounded loops (5.3)High-level tracing goes mainstream. Verifier gains bounded loop support.

2020

LSM hooks (5.7) / fentry (5.5)BPF can enforce security policy. Zero-overhead function entry/exit tracing.

2023

eBPF Foundation / Windows eBPFLinux Foundation governance. Cross-platform eBPF runtime explorations.