eBPF (extended Berkeley Packet Filter) is the most significant change to how Linux systems are observed and controlled in two decades. It allows user-defined programs to run inside the kernel, attached to any of thousands of probe points — network stack, scheduler, file I/O, system calls — with overhead measured in nanoseconds, not microseconds. Cilium, Pixie, Parca, Falco, Datadog's Agent, and dozens of other production tools run on eBPF.
The marketing pitch writes itself. The engineering reality is harder. eBPF programs run in a constrained execution environment with a strict verifier, limited stack space, no dynamic memory allocation, and a type system enforced at load time. Writing correct, safe, and performant eBPF for production observability requires understanding these constraints deeply — not just calling bpftrace one-liners.
The eBPF Execution Model
An eBPF program is a sequence of 64-bit instructions executed by the kernel's in-kernel virtual machine. The ISA (instruction set architecture) is RISC-style with 11 64-bit registers (r0-r10), a 512-byte stack, and a fixed maximum instruction count (currently 1 million for programs with the BPF_F_ANY_ALIGNMENT flag, but effectively constrained by verifier complexity limits).
Before execution, every eBPF program is passed through the kernel verifier — a static analysis pass that rejects programs that could cause crashes, infinite loops, or invalid memory accesses. The verifier performs abstract interpretation: it simulates all possible execution paths of the program, tracking the type and value range of every register at every instruction.
What the Verifier Checks
- No unbounded loops: Every loop must have a bounded iteration count provable at compile time. Since Linux 5.3, bounded loops are allowed if the verifier can prove termination
- No invalid pointer dereferences: Every memory access must be preceded by a bounds check that the verifier can trace. Reading a struct field requires proving the pointer is non-null and the offset is within bounds
- No use of uninitialized data: Registers that haven't been written are marked 'not initialized.' Reading an uninitialized register is a verifier rejection
- No kernel memory leaks: File descriptors and references to kernel objects must be properly released
- No stack overflow: The 512-byte stack limit is fixed and enforced. Large local variables must use BPF maps instead
The verifier runs in O(n^2) in the worst case and can reject programs that are correct simply because the verification complexity exceeds internal limits. This is a real production concern: complex observability programs with many branches and map accesses can hit verifier limits even when semantically correct.
BPF Maps: State Between Kernel and Userspace
eBPF programs are stateless by nature — they execute in response to a probe firing and return. State is maintained through BPF maps: key-value data structures accessible from both the eBPF program (in kernel context) and userspace. Choosing the right map type for each use case is a critical performance decision.
Map Types and Their Performance Characteristics
- BPF_MAP_TYPE_HASH: A hash table with O(1) average lookup. Uses a spin lock per bucket, so concurrent writes from different CPUs can cause contention. Maximum 16M entries. Best for per-connection or per-process tracking
- BPF_MAP_TYPE_PERCPU_HASH: A per-CPU hash table that eliminates locking. Each CPU maintains its own map; userspace aggregates. Best for high-frequency counter increments (network packet counting, syscall frequencies)
- BPF_MAP_TYPE_ARRAY: A fixed-size array with O(1) lookup by integer index. Pre-allocated, no dynamic resizing. Zero-cost iterations since all entries exist. Best for small, fixed-size configuration data or indexed counters
- BPF_MAP_TYPE_RINGBUF: A lock-free ring buffer for high-throughput event streaming from kernel to userspace. Supports variable-size records. Orders of magnitude faster than BPF_MAP_TYPE_PERF_EVENT_ARRAY for bulk event streaming. Introduced in Linux 5.8
- BPF_MAP_TYPE_LRU_HASH: A hash map with automatic LRU eviction for bounded-memory tracking of unbounded event streams (e.g., active TCP connections). The LRU eviction uses a per-CPU LRU list to minimize contention
- BPF_MAP_TYPE_STACK_TRACE: Specialized map for storing kernel or user stack traces. Each entry is an array of instruction pointers. Integrates with the kernel's stack walking machinery (frame pointer or ORC unwinder)
A common performance mistake is using BPF_MAP_TYPE_PERF_EVENT_ARRAY for high-frequency event streaming. Each event requires a system call from userspace to drain, creating overhead that dominates at >100k events/second. BPF_MAP_TYPE_RINGBUF with poll() is the correct choice for modern kernels (5.8+) — it is 2-4x more CPU efficient at high event rates.
BTF: BPF Type Format and CO-RE
One of the most significant engineering challenges in eBPF has been portability across kernel versions. Kernel data structures change between versions — struct offsets shift, fields are added or removed. An eBPF program that reads task_struct->mm->pgd at a fixed byte offset will produce garbage or crash on a kernel where that offset changed.
BTF (BPF Type Format) and CO-RE (Compile Once, Run Everywhere) solve this. BTF is a compact type information format embedded in the kernel. CO-RE is a technique in libbpf that uses BTF to relocate field accesses at load time, adjusting byte offsets based on the actual structure layout of the running kernel.
How CO-RE Works in Practice
When you write BPF_CORE_READ(task, mm->pgd) in your eBPF C program, the compiler emits relocation records. At load time, libbpf consults the kernel's BTF to find the actual offset of mm within task_struct and pgd within mm_struct on this specific kernel version, and patches the compiled bytecode before loading it. The result is a single compiled eBPF object that runs correctly on any kernel from 4.14 onward (with appropriate BTF data).
- vmlinux.h: Auto-generated header from kernel BTF containing all kernel struct definitions. Eliminates the need to build eBPF programs against kernel headers
- BPF_CORE_READ(): The primary macro for safe, portable structure field access with CO-RE relocation
- bpf_core_field_exists(): Conditional compilation based on whether a struct field exists in the running kernel — essential for code that handles kernel API changes
- BPF skeleton: libbpf auto-generates a C skeleton for each eBPF object, providing a typed API for loading, attaching, and interacting with the program and its maps
Probe Types: Choosing the Right Attachment Point
eBPF programs attach to probe points — specific locations in kernel or userspace code where the program fires. Choosing the right probe type for an observability goal is critical for correctness, performance, and stability.
Kernel Probe Types
- kprobe/kretprobe: Dynamic probes attached to any kernel function entry (kprobe) or exit (kretprobe). Highly flexible but unstable — function names and signatures can change between kernel versions. Not suitable as a stable observability API
- tracepoint: Stable, explicitly defined probe points in the kernel. The kernel maintainers commit to not breaking tracepoint interfaces. Always prefer tracepoints over kprobes when a tracepoint exists for your use case. Key tracepoints: sched:sched_switch, net:netif_receive_skb, syscalls:sys_enter_*
- fentry/fexit: eBPF-native alternatives to kprobe/kretprobe, available since Linux 5.5. Attach via BTF rather than symbol names, providing type-checked access to function arguments. 2-3x lower overhead than kprobes
- perf_event: Attach to hardware performance counters (CPU cycles, cache misses, branch mispredictions). The foundation for low-overhead continuous profiling
- uprobe/uretprobe: Dynamic probes in userspace processes. Used for language runtime instrumentation (JVM, CPython, Go runtime) without requiring language-specific agents
XDP: The Networking Fast Path
XDP (eXpress Data Path) is an eBPF hook attached at the earliest point in the network receive path — before skb allocation, before the kernel's networking stack. XDP programs make a decision (pass, drop, redirect, or transmit) for every incoming packet with overhead of 50-200 nanoseconds per packet, comparable to raw DPDK performance but without leaving the kernel.
Cilium uses XDP for load balancing and DDoS mitigation — dropping attack traffic before it consumes any significant kernel resources. Cloudflare uses XDP to mitigate volumetric DDoS attacks at rates of tens of millions of packets per second on commodity hardware.
Writing a Production eBPF Observability Program
Let's walk through the engineering decisions for a realistic production use case: latency distribution tracking for all outbound TCP connections, broken down by destination IP, with p50/p95/p99 reporting from userspace.
Architecture Decisions
- Probe attachment: Use the tcp_connect tracepoint (connection initiation) and tcp_close tracepoint (connection termination). Tracepoints are stable; kprobes would be fragile across kernel versions
- Timestamp storage: On tcp_connect, store the timestamp (bpf_ktime_get_ns()) in a per-CPU hash map keyed by socket pointer. Per-CPU eliminates lock contention for the common case
- Latency recording: On tcp_close, compute the duration and record in a histogram map keyed by destination IP prefix (/24). Use a BPF_MAP_TYPE_ARRAY with fixed histogram buckets (0-1ms, 1-5ms, 5-10ms, ..., >1s)
- Userspace reading: Aggregate per-CPU histogram values in userspace every 5 seconds. Compute percentiles from the histogram buckets in O(n_buckets) time
- Memory bounds: The per-CPU hash for in-flight connections is bounded at 65536 entries with LRU eviction for connections that don't close cleanly (e.g., TCP RST with no close tracepoint)
The verifier will reject naive implementations of this. A common pitfall: reading the socket pointer from the tracepoint arguments requires a null check that the verifier can trace. Failing to add the explicit null check before the map lookup will cause the program to be rejected with 'R1 is not a known value' or 'possible null pointer dereference.'
The Overhead Reality
eBPF's overhead is genuinely low, but not zero. Understanding the cost profile is essential for production deployment.
- Tracepoint probe firing: ~50-100 ns per event when the eBPF program is attached. Tracepoints have no overhead when no program is attached
- kprobe probe firing: ~100-300 ns per event due to int3 trap mechanism. fentry probes reduce this to ~30-80 ns via trampolining
- Hash map lookup: ~50-100 ns for BPF_MAP_TYPE_HASH. Per-CPU variant eliminates the spinlock, reducing to ~20-40 ns for non-contested access
- Stack trace capture: ~500-2000 ns depending on stack depth and unwinding method. Frame pointer unwinding is 3-5x faster than DWARF unwinding
- Ring buffer write: ~30-50 ns for writing a variable-size event to BPF_MAP_TYPE_RINGBUF
For a system handling 100k TCP connections per second, the per-connection eBPF overhead is approximately 200 ns * 2 (connect + close) * 100k = 40ms of CPU time per second — about 0.1% of a single CPU core. This is negligible for almost any production workload, which is why eBPF's value proposition is so compelling.
eBPF has made the tradeoff between observability depth and production overhead essentially disappear. The question is no longer 'can we afford to observe this?' but 'do we have the engineering capacity to write the eBPF program correctly?' The bottleneck shifted from performance to correctness.
Operational Considerations
- Kernel version requirements: Most modern eBPF features (ringbuf, fentry, CO-RE) require Linux 5.8+. For systems on 4.x kernels (common in RHEL 7 / CentOS 7), capabilities are significantly reduced
- CAP_BPF vs. CAP_SYS_ADMIN: Linux 5.8 introduced the more granular CAP_BPF capability, allowing eBPF program loading without full CAP_SYS_ADMIN. This is the correct production deployment model — not running observability agents as root
- Program pinning: eBPF programs and maps are reference-counted. Pinning them to the BPF virtual filesystem (/sys/fs/bpf/) keeps them alive across agent restarts, enabling zero-downtime agent updates
- Tail calls: Programs exceeding the verifier's complexity limit can chain via tail calls — one eBPF program jumps to another, resetting the instruction counter. This enables effectively unlimited program complexity at the cost of one additional overhead per chain
Build Zero-Overhead Infrastructure Observability with Accelar
Accelar builds production-grade eBPF observability systems — from custom kernel probes and network telemetry to continuous profiling pipelines and security detection engines. If you need deep visibility into your infrastructure without the overhead of traditional agents, we have the kernel engineering expertise to build it. Let's talk.
