RocksDB Overview
RocksDB is a high performance, embeddable key-value store designed for fast storage, large datasets, and efficient range queries. It is widely used in distributed systems for both OLTP and analytical workloads.
1. Column Family Support
- RocksDB supports multiple Column Families (CFs) for logical separation of data.
- Each CF has its own WAL, memtable, and SST files.
- Recommendation: Minimize CFs where possible; each CF adds memory and compaction overhead.
2. Handling Large Data Volumes
-
RocksDB is optimized for terabytes of data.
-
Data is organized in an LSM-tree structure:
-
Memtables buffer writes in memory.
-
SSTables store immutable files on disk.
-
Compactions merge SST files to maintain read efficiency.
-
Efficient for workloads with heavy writes and sequential or prefix-based scans.
3. Snapshot and Backup Strategies
-
Snapshot-based backup
-
Lightweight, point-in-time view of DB.
-
Used for replication and catching up follower nodes.
-
Minimal I/O overhead.
-
File-based backup (Checkpoint / BackupEngine)
-
Durable copy of SST and WAL files.
-
Useful for disaster recovery or migration.
-
Recommendation: Use snapshot for replication, file-based backup for full persistence. Minimize CFs to reduce backup complexity.
4. Key and Prefix Design
-
Fixed-length prefix keys improve range scan performance.
-
Design keys so frequently scanned ranges share a common prefix.
-
Example:
-
Prefix: RegionID + EntityType
-
Suffix: Timestamp or unique ID
-
Avoid variable-length prefixes in hot paths, as they reduce prefix indexing efficiency.
5. Typical Performance Metrics (SSD-based)
These numbers are empirical references from medium-to-large deployments, with RocksDB tuned for batch writes and prefix scans:
| Metric | Typical Value (SSD/NVMe) | Notes |
|---|---|---|
| Write throughput (random writes) | 50k–200k ops/sec per DB instance (16–32 MB memtable) | Depends on write batch size |
| Write amplification | 2–5x | With tuned compaction and CF count ~1–2 |
| Read throughput (point lookup) | 100k–500k ops/sec | Using 8–16 GB block cache |
| Read throughput (prefix scan) | 200–800 MB/sec | With fixed prefix keys |
| Compaction I/O | 100–400 MB/sec | Tuned via level0_file_num_compaction_trigger and max_bytes_for_level_base |
| Latency (write) | ~0.5–2 ms | Depends on WAL sync policy |
| Latency (read, cached) | <0.1 ms | Cached in block cache |
| SST file size | 64 MB (default) | Tunable via target_file_size_base |
| Max DB size | ~10–100 TB+ | Depends on hardware and LSM tuning |
Notes on tuning:
- SSD/NVMe storage is essential for predictable write performance.
- Prefix-based fixed keys allow memtable prefix bloom and reduce disk seeks.
- Backup and snapshot strategy affect I/O; prefer incremental snapshots for high-frequency backups.