跳到主要内容

DiskANN (Disk-based Approximate Nearest Neighbor) Overview

Purpose & Use Cases

DiskANN is a CPU-only, disk-optimized vector similarity search library designed for ultra-large-scale (billion/trillion-level) high-dimensional vector datasets where in-memory storage is infeasible. Its core advantage is minimal RAM usage (1–5% of total vector size) via disk-backed graph indexing, focusing on read-heavy, static/low-churn scenarios rather than real-time dynamic updates or high-speed query performance. It is suitable for:

  • Approximate nearest neighbor (ANN) search over 1B+ high-dimensional vectors (100–10000 dimensions).
  • Scenarios with strict RAM constraints (e.g., trillion-scale embedding libraries that cannot fit in memory).
  • Offline batch indexing and online disk-backed query (no need to load full datasets into RAM).

Typical applications include:

  • Web-scale semantic search (text/image embedding retrieval with petabyte-scale datasets).
  • Enterprise recommendation systems (cold user/item embedding libraries with 1B+ entries).
  • Scientific computing (high-dimensional astronomical/genomic data similarity search).

Algorithms Supported

DiskANN exclusively implements disk-aware graph-based indexes optimized for CPU/disk I/O, with no hybrid or accelerated variants:

AlgorithmKey CharacteristicsUse Case
StaticDiskANNDisk-backed HNSW variant for static datasets – minimal RAM usage, read-optimized.1B+ static vectors (no updates) – DiskANN’s core scenario.
DynamicDiskANNIncremental disk-backed index supporting low-frequency batch insertion/deletion.Large-scale datasets with hourly/daily batch updates.
PQ-DiskANNProduct Quantization + DiskANN – further reduces disk/RAM usage (minor recall loss).Trillion-scale vectors with extreme memory constraints.
Brute-forceExact disk-based search (baseline) – no approximation, slow but high recall.Small disk-resident datasets (≤100M vectors).

Core Technical Specifications

1. Supported Metric Spaces

DiskANN supports a narrow set of metric spaces for numerical vectors (no native non-metric support):

Metric TypeFull NameSupported IndexesUse Case
L2Euclidean DistanceAll indexesGeneral numerical vectors (image/video embeddings).
CosineCosine Similarity (normalized L2)All indexes (via pre-normalization)Text embeddings (BERT/LLM embeddings).
Inner Product (IP)Dot ProductAll indexes (via normalization)Normalized vector similarity (equivalent to cosine).

Note: Non-metric spaces (Jaccard/Hamming/L1/L∞) are not supported, and custom preprocessing for these spaces is unmaintained and not production-grade.

2. Supported Data Types

DiskANN is optimized for 32-bit floating-point vectors, with limited support for low-precision integer types via quantization:

Data TypePrecisionC++ TypePython Binding MappingSupport StatusUse Case
Float3232-bitfloatnumpy.float32Fully supported (optimal performance)Default – DiskANN’s core optimized type.
Float64 (Double)64-bitdoublenumpy.float64Partial (CPU-only, 2–3x slower)High-precision scientific computing only.
Int8/UInt88-bitint8_t/uint8_tnumpy.int8/numpy.uint8Partial (via scalar quantization)Edge/disk-constrained environments (4x memory reduction).
Float16 (FP16)16-bit-numpy.float16Not supportedManual conversion to Float32 provides no memory/performance benefits.
BinaryBit-leveluint8_t (packed)numpy.uint8 (bitpacked)Not supportedNo native handling for Hamming distance.

3. Dynamic Data Operations (Insert/Delete/Modify)

DiskANN’s dynamic capabilities are batch-oriented and non-real-time, with no support for high-frequency updates:

OperationSupport LevelConstraints
Incremental InsertionPartial (batch-only, DynamicDiskANN)- Batch insertion only (10k+ vectors per batch, latency ~100ms/batch);
- Single-vector insertion latency: ~10ms/vector (too slow for real-time);
- Max sustainable QPS: ~100 (CPU-only).
Real-Time DeletionMinimal (soft delete only)- "Soft delete" (mark IDs as invalid, filter post-query);
- Physical deletion requires partial index rebuild (hourly/daily batch);
- No memory bloat for up to 30% deleted vectors.
Vector ModificationIndirect (delete + reinsert)Inherits deletion/insertion latency (batch-only, no real-time support).

Characteristics

FeatureDescription
Incremental UpdatesBatch-only insertion (DynamicDiskANN); soft deletion only; no real-time single-vector support.
Query SpeedCPU/disk-backed: 3–15ms/query (1B 768-dim vectors) – 2–4x faster than disk-based FAISS IVF-PQ.
Index TypeDisk-backed graph-based HNSW (static/dynamic); quantization-enhanced variants (PQ-DiskANN).
ScalabilityHandles trillion-scale vectors (disk-backed); RAM usage = 0.5–5% of total vector size (core advantage).
Language BindingsC++ (full feature set), Python (core features only – no DynamicDiskANN full support).
Non-Metric SupportNo native support (requires custom preprocessing).

Notes

  • DiskANN’s core value is ultra-low RAM usage for billion/trillion-scale static vectors – it is not optimized for speed (FAISS HNSW/GPU is faster for in-memory datasets <1B vectors).
  • DynamicDiskANN is designed for low-frequency batch updates (hourly/daily) – avoid using it for real-time use cases (e.g., sub-second user behavior embedding insertion).
  • Limitations: No distributed multi-node support (single-node disk-backed only); Python bindings lack full DynamicDiskANN functionality; no non-metric space support.
  • Best Practices: Use StaticDiskANN for 1B+ static vectors (low RAM footprint); pair with in-memory FAISS HNSW for hot data (<100M vectors) to balance scale and query speed.