Skip to main content

RocksDB Overview

RocksDB is a high performance, embeddable key-value store designed for fast storage, large datasets, and efficient range queries. It is widely used in distributed systems for both OLTP and analytical workloads.

1. Column Family Support

  • RocksDB supports multiple Column Families (CFs) for logical separation of data.
  • Each CF has its own WAL, memtable, and SST files.
  • Recommendation: Minimize CFs where possible; each CF adds memory and compaction overhead.

2. Handling Large Data Volumes

  • RocksDB is optimized for terabytes of data.

  • Data is organized in an LSM-tree structure:

  • Memtables buffer writes in memory.

  • SSTables store immutable files on disk.

  • Compactions merge SST files to maintain read efficiency.

  • Efficient for workloads with heavy writes and sequential or prefix-based scans.

3. Snapshot and Backup Strategies

  • Snapshot-based backup

  • Lightweight, point-in-time view of DB.

  • Used for replication and catching up follower nodes.

  • Minimal I/O overhead.

  • File-based backup (Checkpoint / BackupEngine)

  • Durable copy of SST and WAL files.

  • Useful for disaster recovery or migration.

  • Recommendation: Use snapshot for replication, file-based backup for full persistence. Minimize CFs to reduce backup complexity.

4. Key and Prefix Design

  • Fixed-length prefix keys improve range scan performance.

  • Design keys so frequently scanned ranges share a common prefix.

  • Example:

  • Prefix: RegionID + EntityType

  • Suffix: Timestamp or unique ID

  • Avoid variable-length prefixes in hot paths, as they reduce prefix indexing efficiency.

5. Typical Performance Metrics (SSD-based)

These numbers are empirical references from medium-to-large deployments, with RocksDB tuned for batch writes and prefix scans:

MetricTypical Value (SSD/NVMe)Notes
Write throughput (random writes)50k–200k ops/sec per DB instance (16–32 MB memtable)Depends on write batch size
Write amplification2–5xWith tuned compaction and CF count ~1–2
Read throughput (point lookup)100k–500k ops/secUsing 8–16 GB block cache
Read throughput (prefix scan)200–800 MB/secWith fixed prefix keys
Compaction I/O100–400 MB/secTuned via level0_file_num_compaction_trigger and max_bytes_for_level_base
Latency (write)~0.5–2 msDepends on WAL sync policy
Latency (read, cached)<0.1 msCached in block cache
SST file size64 MB (default)Tunable via target_file_size_base
Max DB size~10–100 TB+Depends on hardware and LSM tuning

Notes on tuning:

  • SSD/NVMe storage is essential for predictable write performance.
  • Prefix-based fixed keys allow memtable prefix bloom and reduce disk seeks.
  • Backup and snapshot strategy affect I/O; prefer incremental snapshots for high-frequency backups.