Database performance is one of those topics where intuitions built from high-level abstractions consistently lead engineers astray. 'Why is this query slow?' often has an answer that traces all the way down to write amplification in an LSM tree, a row lock escalation in MVCC, or a branch misprediction in the query executor. The engineers who can reason at this level aren't just faster debuggers — they make fundamentally different architectural decisions.
This post covers three of the most important internal subsystems in modern databases: LSM trees (the storage engine used by RocksDB, Cassandra, LevelDB, ScyllaDB, and TiKV), MVCC (the concurrency control mechanism in PostgreSQL, MySQL InnoDB, CockroachDB, and every serious OLTP database), and vectorized query execution (the engine design used by DuckDB, ClickHouse, DataFusion, and Velox). Each of these is a rabbit hole deep enough for a book. This post gives you the engineering intuitions you need to make informed decisions.
LSM Trees: Write-Optimized Storage
The Log-Structured Merge-Tree (LSM Tree), introduced by O'Neil et al. in 1996, is the storage engine behind the write-heavy world: time-series databases, event stores, distributed KV stores, and analytics ingestion pipelines. The core insight is that sequential writes to disk are dramatically faster than random writes, and that this difference (often 100x on HDDs, 10x on SSDs) is worth paying with increased read complexity.
The LSM Write Path
Every write in an LSM tree goes through three stages. First, it's appended to a write-ahead log (WAL) for durability — this is sequential I/O, fast on any storage medium. Then it's inserted into an in-memory buffer called the MemTable, typically implemented as a skip list or red-black tree for O(log n) point lookups. When the MemTable reaches a size threshold (typically 64MB in RocksDB), it's flushed to disk as an SSTable (Sorted String Table).
An SSTable is an immutable, sorted file of key-value pairs. Once written, it is never modified in place. Updates and deletes are handled by writing new records: a delete writes a 'tombstone' marker, and an update writes a new version. Reads must merge these overlapping records to find the most recent value — the 'merge' in Log-Structured Merge-Tree.
Compaction: The Hidden Cost
Without compaction, the number of SSTable files grows without bound, degrading read performance. Compaction is a background process that merges multiple SSTables into fewer, larger files, discarding obsolete versions of keys and tombstones. This is where LSM trees hide their costs.
- Write amplification: Each byte written by the application may be re-written to disk 10-30x through multiple compaction rounds. For RocksDB with leveled compaction on a 1TB dataset, sustained write throughput of 100MB/s application writes may generate 1-3 GB/s of actual disk writes
- Read amplification: A point lookup must check the MemTable, then each level's SSTable (with bloom filters to avoid most disk reads). With 7 levels and a bloom filter false positive rate of 1%, worst-case reads touch 7 disk files
- Space amplification: Before compaction, multiple versions of the same key exist. Space amplification can be 1.5-2x — a 1TB dataset occupies 1.5-2TB on disk at peak
- Compaction strategies: Leveled compaction (RocksDB default) minimizes read amplification at the cost of higher write amplification. Tiered/size-tiered compaction (Cassandra default) minimizes write amplification at the cost of read amplification and space usage
The specific compaction strategy choice is one of the most consequential configuration decisions for an LSM-backed database. Workloads with high write-to-read ratios and large datasets should prefer tiered compaction. Workloads with mixed read-write patterns should prefer leveled compaction despite the higher write amplification.
Bloom Filters: Making LSM Reads Tractable
Without bloom filters, a point lookup in an LSM tree with L levels requires reading at least L SSTable files. Bloom filters are probabilistic data structures that answer 'is key K definitely NOT in this SSTable?' with zero false negatives and configurable false positive rates. With a bloom filter per SSTable and a 1% FPR, 99% of SSTable files can be skipped without disk I/O.
A bloom filter for 1 million keys with 1% FPR requires approximately 9.6 bits per key = 1.2MB — tiny relative to the SSTable size. RocksDB keeps bloom filters in block cache in memory, meaning point lookups after the first access cost only a CPU memory access per level, not a disk seek.
B-Tree vs. LSM: When to Choose Which
B-trees (used by PostgreSQL, MySQL InnoDB, SQLite, Oracle) update data in place. A write finds the page containing the key and modifies it directly. This gives much lower write amplification for point updates (typically 1-2x) and excellent read performance since data layout reflects access patterns.
- Choose B-trees for: OLTP workloads with mixed reads and writes, update-heavy workloads, workloads requiring strong point-query performance, databases where write amplification threatens SSD endurance
- Choose LSM trees for: Write-heavy workloads (>70% writes), append-only or time-series data, workloads where p99 write latency consistency matters more than read performance, distributed systems where anti-entropy and compaction can run asynchronously
- The crossover: At sustained write rates above ~50MB/s per node with LSM, compaction I/O begins competing with application I/O. At this point, storage hardware (NVMe vs. SATA) and compaction strategy tuning become critical
MVCC: Concurrency Without Locking
Multi-Version Concurrency Control (MVCC) is the concurrency control mechanism that enables high-throughput concurrent reads without blocking writes. Instead of using read locks, MVCC maintains multiple versions of each row. Readers see a consistent snapshot of the database as it existed at transaction start, regardless of concurrent modifications.
How PostgreSQL MVCC Works
Every row in PostgreSQL has two hidden system columns: xmin (the transaction ID that inserted this version) and xmax (the transaction ID that deleted or updated this version, or 0 if current). A row version is visible to a transaction T if xmin committed before T started and xmax is either 0 or started after T (or was aborted).
When a row is updated, PostgreSQL does not modify it in place. It writes a new row version with the updated values and marks the old version's xmax with the updating transaction's ID. The old version remains in the table until VACUUM removes it. This is why PostgreSQL tables grow after heavy UPDATE workloads and require regular VACUUM to reclaim space — a direct consequence of the MVCC model.
Transaction Isolation Levels in MVCC
- Read Committed (PostgreSQL default): Each statement within a transaction sees the most recently committed data at the time the statement starts. A transaction may see different data in consecutive SELECT statements if another transaction commits in between
- Repeatable Read: The transaction sees a consistent snapshot taken at its start. No phantom reads for rows matching a query, though range-based phantoms are still possible in some MVCC implementations
- Serializable (PostgreSQL SSI): Serializable Snapshot Isolation, introduced in PostgreSQL 9.1. Uses predicate locking to detect and abort transactions that would violate serializability. Provides the gold standard of isolation with overhead of ~30-50% vs. Repeatable Read in write-heavy workloads
- The snapshot: Each transaction gets a snapshot of the transaction ID space: the set of XIDs that were in-progress at snapshot time are invisible; all completed XIDs before snapshot time are visible. This is implemented as a sorted array of in-progress XIDs, typically fitting in a few hundred bytes
VACUUM: The MVCC Tax
Dead row versions (old MVCC versions no longer visible to any transaction) accumulate in PostgreSQL tables. VACUUM reclaims this space by scanning tables and marking dead versions as reclaimable. Autovacuum runs this automatically, but its interaction with long-running transactions is a major production concern.
A transaction that has been open for hours holds a snapshot from when it started. VACUUM cannot remove any row version that might be visible to this snapshot. A single long-running analytics query against an OLTP PostgreSQL database can cause table bloat at rates of gigabytes per hour on write-heavy tables. This is the primary reason 'don't run long transactions on OLTP databases' is a hard engineering rule, not a preference.
Transaction ID Wraparound: The PostgreSQL Apocalypse
PostgreSQL uses 32-bit transaction IDs. At ~2 billion transactions, the XID space wraps around. Unless properly managed with VACUUM FREEZE, all data appears to be 'in the future' and becomes invisible. This is the xid wraparound problem — it has taken down production databases that ran for years without proper vacuuming. Monitor pg_stat_user_tables.n_dead_tup and pg_stat_user_tables.last_autovacuum diligently.
Vectorized Query Execution
The Volcano/Iterator execution model, dominant in databases since the 1990s, processes one row at a time. Each operator (scan, filter, join, aggregate) calls next() on its child operator to get one row, processes it, and passes it up. This model is elegant and composable but has a fundamental problem in modern hardware: it's terrible for CPUs.
Modern CPUs achieve peak performance through vectorized SIMD instructions (processing 4/8/16 values simultaneously), deep pipelines (executing many instructions in parallel via out-of-order execution), and large branch predictors. The row-at-a-time model defeats all of these: function call overhead per row dominates, branch prediction fails because control flow varies per row, and SIMD is impossible with single-element operations.
Vector-at-a-Time Execution
Vectorized execution, introduced by MonetDB/X100 (Boncz et al., 2005) and now used by DuckDB, ClickHouse, Velox, and Apache Arrow's DataFusion, processes batches of 1024-8192 rows at a time. Each operator receives a 'vector' (a column-oriented array of values) and produces another vector. The tight inner loop over a vector of the same type enables SIMD, reduces per-row function call overhead to near-zero, and allows the compiler to generate loop-optimized code.
- Flat column arrays: Values for a single column are stored in contiguous memory arrays, enabling cache-friendly sequential access and SIMD operations
- Selection vectors: Instead of copying filtered rows to a new array, a selection vector (array of indices) marks which rows in the current batch pass the filter. Subsequent operators only process selected rows
- Validity bitmaps: NULL handling uses a separate bitmap array rather than a per-value nullable flag, enabling NULL-ignorant fast paths for non-null columns
- Adaptive vector size: Some engines (DuckDB) tune vector size to L1/L2 cache capacity to maximize cache hit rates during operator processing
Practical Speedups from Vectorization
The speedup from vectorized execution over row-at-a-time is real and significant. A simple filter + aggregate query over 100 million rows: Volcano model takes ~5 seconds. Vectorized model with SIMD takes ~300ms. The speedup comes from three sources: reduced function call overhead (100M next() calls eliminated), better branch prediction (the filter is applied to a whole vector before branching), and SIMD parallelism (AVX-512 can evaluate 8 double comparisons simultaneously).
Push vs. Pull Execution Models
Volcano uses a pull model: operators pull data from their children. Modern vectorized engines increasingly use a push model: operators push data to their parents (or directly to the consumer). The push model enables better code generation — the compiler sees the entire pipeline and can eliminate virtual function calls, inline operators, and optimize the full pipeline as a single unit.
HyPer (now Tableau/Hyper), DuckDB, and Velox all use push-based execution with whole-query code generation. The resulting compiled query code can be 10-50x faster than interpreted vectorized execution for CPU-bound queries, as the compiler applies constant folding, dead code elimination, and register allocation across the entire query pipeline.
Bringing It Together: Engine Selection for Real Workloads
- High-throughput OLTP with point reads/writes: PostgreSQL or MySQL InnoDB (B-tree + MVCC). Consider CockroachDB or YugabyteDB for distributed OLTP at the cost of higher latency
- Write-heavy time-series or KV workloads: RocksDB, ScyllaDB, or Cassandra (LSM). Tune compaction strategy to workload write-to-read ratio
- OLAP / analytics over large datasets: DuckDB for single-node columnar analytics; ClickHouse or Apache Doris for distributed analytics. Both use vectorized execution over columnar storage
- Hybrid HTAP: TiDB (TiKV for OLTP via RocksDB, TiFlash for OLAP via columnar vectorized engine). The two engines share data through Raft replication with zero-ETL integration
- Embedded analytics (application-side): DuckDB is the correct choice. It reads Parquet/Arrow natively, uses vectorized execution, and embeds in-process with no server overhead
The engineers who designed today's fastest databases didn't just choose better algorithms — they reasoned about hardware: cache hierarchies, SIMD width, branch prediction, NVMe latency characteristics. Modern database performance is a hardware-software co-design problem, and understanding the internals is how you know when to optimize and when to switch engines entirely.
Optimize Your Data Infrastructure with Accelar
Accelar designs and tunes data infrastructure for demanding workloads — from RocksDB compaction tuning and PostgreSQL VACUUM management to columnar analytics pipelines and HTAP architectures. If your database is a performance bottleneck, we have the internals expertise to fix it. Let's audit your stack.
