Back to blog
Database Engineering15 min read

LSM Trees, MVCC, and Vectorized Execution: The Internals That Determine Your Database Performance

The difference between a query taking 10ms and 10 seconds often comes down to three subsystems: the storage engine, the concurrency control mechanism, and the query execution engine. Here's how the state of the art actually works.

Database performance is one of those topics where intuitions built from high-level abstractions consistently lead engineers astray. 'Why is this query slow?' often has an answer that traces all the way down to write amplification in an LSM tree, a row lock escalation in MVCC, or a branch misprediction in the query executor. The engineers who can reason at this level aren't just faster debuggers — they make fundamentally different architectural decisions.

This post covers three of the most important internal subsystems in modern databases: LSM trees (the storage engine used by RocksDB, Cassandra, LevelDB, ScyllaDB, and TiKV), MVCC (the concurrency control mechanism in PostgreSQL, MySQL InnoDB, CockroachDB, and every serious OLTP database), and vectorized query execution (the engine design used by DuckDB, ClickHouse, DataFusion, and Velox). Each of these is a rabbit hole deep enough for a book. This post gives you the engineering intuitions you need to make informed decisions.

LSM Trees: Write-Optimized Storage

The Log-Structured Merge-Tree (LSM Tree), introduced by O'Neil et al. in 1996, is the storage engine behind the write-heavy world: time-series databases, event stores, distributed KV stores, and analytics ingestion pipelines. The core insight is that sequential writes to disk are dramatically faster than random writes, and that this difference (often 100x on HDDs, 10x on SSDs) is worth paying with increased read complexity.

The LSM Write Path

Every write in an LSM tree goes through three stages. First, it's appended to a write-ahead log (WAL) for durability — this is sequential I/O, fast on any storage medium. Then it's inserted into an in-memory buffer called the MemTable, typically implemented as a skip list or red-black tree for O(log n) point lookups. When the MemTable reaches a size threshold (typically 64MB in RocksDB), it's flushed to disk as an SSTable (Sorted String Table).

An SSTable is an immutable, sorted file of key-value pairs. Once written, it is never modified in place. Updates and deletes are handled by writing new records: a delete writes a 'tombstone' marker, and an update writes a new version. Reads must merge these overlapping records to find the most recent value — the 'merge' in Log-Structured Merge-Tree.

Compaction: The Hidden Cost

Without compaction, the number of SSTable files grows without bound, degrading read performance. Compaction is a background process that merges multiple SSTables into fewer, larger files, discarding obsolete versions of keys and tombstones. This is where LSM trees hide their costs.

The specific compaction strategy choice is one of the most consequential configuration decisions for an LSM-backed database. Workloads with high write-to-read ratios and large datasets should prefer tiered compaction. Workloads with mixed read-write patterns should prefer leveled compaction despite the higher write amplification.

Bloom Filters: Making LSM Reads Tractable

Without bloom filters, a point lookup in an LSM tree with L levels requires reading at least L SSTable files. Bloom filters are probabilistic data structures that answer 'is key K definitely NOT in this SSTable?' with zero false negatives and configurable false positive rates. With a bloom filter per SSTable and a 1% FPR, 99% of SSTable files can be skipped without disk I/O.

A bloom filter for 1 million keys with 1% FPR requires approximately 9.6 bits per key = 1.2MB — tiny relative to the SSTable size. RocksDB keeps bloom filters in block cache in memory, meaning point lookups after the first access cost only a CPU memory access per level, not a disk seek.

B-Tree vs. LSM: When to Choose Which

B-trees (used by PostgreSQL, MySQL InnoDB, SQLite, Oracle) update data in place. A write finds the page containing the key and modifies it directly. This gives much lower write amplification for point updates (typically 1-2x) and excellent read performance since data layout reflects access patterns.

MVCC: Concurrency Without Locking

Multi-Version Concurrency Control (MVCC) is the concurrency control mechanism that enables high-throughput concurrent reads without blocking writes. Instead of using read locks, MVCC maintains multiple versions of each row. Readers see a consistent snapshot of the database as it existed at transaction start, regardless of concurrent modifications.

How PostgreSQL MVCC Works

Every row in PostgreSQL has two hidden system columns: xmin (the transaction ID that inserted this version) and xmax (the transaction ID that deleted or updated this version, or 0 if current). A row version is visible to a transaction T if xmin committed before T started and xmax is either 0 or started after T (or was aborted).

When a row is updated, PostgreSQL does not modify it in place. It writes a new row version with the updated values and marks the old version's xmax with the updating transaction's ID. The old version remains in the table until VACUUM removes it. This is why PostgreSQL tables grow after heavy UPDATE workloads and require regular VACUUM to reclaim space — a direct consequence of the MVCC model.

Transaction Isolation Levels in MVCC

VACUUM: The MVCC Tax

Dead row versions (old MVCC versions no longer visible to any transaction) accumulate in PostgreSQL tables. VACUUM reclaims this space by scanning tables and marking dead versions as reclaimable. Autovacuum runs this automatically, but its interaction with long-running transactions is a major production concern.

A transaction that has been open for hours holds a snapshot from when it started. VACUUM cannot remove any row version that might be visible to this snapshot. A single long-running analytics query against an OLTP PostgreSQL database can cause table bloat at rates of gigabytes per hour on write-heavy tables. This is the primary reason 'don't run long transactions on OLTP databases' is a hard engineering rule, not a preference.

Transaction ID Wraparound: The PostgreSQL Apocalypse

PostgreSQL uses 32-bit transaction IDs. At ~2 billion transactions, the XID space wraps around. Unless properly managed with VACUUM FREEZE, all data appears to be 'in the future' and becomes invisible. This is the xid wraparound problem — it has taken down production databases that ran for years without proper vacuuming. Monitor pg_stat_user_tables.n_dead_tup and pg_stat_user_tables.last_autovacuum diligently.

Vectorized Query Execution

The Volcano/Iterator execution model, dominant in databases since the 1990s, processes one row at a time. Each operator (scan, filter, join, aggregate) calls next() on its child operator to get one row, processes it, and passes it up. This model is elegant and composable but has a fundamental problem in modern hardware: it's terrible for CPUs.

Modern CPUs achieve peak performance through vectorized SIMD instructions (processing 4/8/16 values simultaneously), deep pipelines (executing many instructions in parallel via out-of-order execution), and large branch predictors. The row-at-a-time model defeats all of these: function call overhead per row dominates, branch prediction fails because control flow varies per row, and SIMD is impossible with single-element operations.

Vector-at-a-Time Execution

Vectorized execution, introduced by MonetDB/X100 (Boncz et al., 2005) and now used by DuckDB, ClickHouse, Velox, and Apache Arrow's DataFusion, processes batches of 1024-8192 rows at a time. Each operator receives a 'vector' (a column-oriented array of values) and produces another vector. The tight inner loop over a vector of the same type enables SIMD, reduces per-row function call overhead to near-zero, and allows the compiler to generate loop-optimized code.

Practical Speedups from Vectorization

The speedup from vectorized execution over row-at-a-time is real and significant. A simple filter + aggregate query over 100 million rows: Volcano model takes ~5 seconds. Vectorized model with SIMD takes ~300ms. The speedup comes from three sources: reduced function call overhead (100M next() calls eliminated), better branch prediction (the filter is applied to a whole vector before branching), and SIMD parallelism (AVX-512 can evaluate 8 double comparisons simultaneously).

Push vs. Pull Execution Models

Volcano uses a pull model: operators pull data from their children. Modern vectorized engines increasingly use a push model: operators push data to their parents (or directly to the consumer). The push model enables better code generation — the compiler sees the entire pipeline and can eliminate virtual function calls, inline operators, and optimize the full pipeline as a single unit.

HyPer (now Tableau/Hyper), DuckDB, and Velox all use push-based execution with whole-query code generation. The resulting compiled query code can be 10-50x faster than interpreted vectorized execution for CPU-bound queries, as the compiler applies constant folding, dead code elimination, and register allocation across the entire query pipeline.

Bringing It Together: Engine Selection for Real Workloads

The engineers who designed today's fastest databases didn't just choose better algorithms — they reasoned about hardware: cache hierarchies, SIMD width, branch prediction, NVMe latency characteristics. Modern database performance is a hardware-software co-design problem, and understanding the internals is how you know when to optimize and when to switch engines entirely.

Optimize Your Data Infrastructure with Accelar

Accelar designs and tunes data infrastructure for demanding workloads — from RocksDB compaction tuning and PostgreSQL VACUUM management to columnar analytics pipelines and HTAP architectures. If your database is a performance bottleneck, we have the internals expertise to fix it. Let's audit your stack.