Skip to main content

Posting List Implementations (C++ / Industrial)

Overview

A posting list (or inverted list) is a fundamental data structure in large‑scale retrieval and search systems. It stores the list of document IDs (and optionally positions) where a given term appears. An efficient posting list engine must balance:

  • memory footprint
  • scan and random access performance
  • intersection / skip efficiency
  • compression / unpacking cost

In practical systems, posting lists are implemented as either uncompressed arrays or compressed integer lists with optional block/skip indexing.


1) Uncompressed Posting List

Overview

An uncompressed posting list is a sorted array of uint32_t document IDs (or positions). It is conceptually simple and provides fast access without decode overhead.

Characteristics

  • Space: largest (no compression)
  • Scan performance: best for sequential / random access
  • Skip support: trivial
  • Serialization: direct binary dump

Typical Use Cases

  • Short posting lists
  • Memory‑rich systems
  • Scenarios where CPU decode cost must be minimal

Simple C++ Sketch

std::vector<uint32_t> postlist = {3, 7, 15, 29, 102};

// Sequential scan
for (uint32_t doc : postlist) {
process(doc);
}

// Random access
uint32_t doc = postlist[2];

2) Block / Skip Structures

Overview

Block or skip indexing partitions a posting list into fixed or variable sized blocks, and stores skip pointers (or summary metadata) to accelerate traversal and intersection. Many modern retrieval engines combine block indexing with compressed posting formats for optimal performance.

Characteristics

  • Skip support: allows jumping past irrelevant blocks
  • Intersection speed: improved for long lists
  • Extra metadata: cost proportional to block count

Typical Use Cases

  • Long posting lists
  • High term intersection frequencies
  • Real‑time query processing

3) Delta + Compressed Integer Lists

Principle

Delta encoding exploits the fact that sorted document IDs often have small differences between adjacent values. For a posting list:

original: [d1, d2, d3, ...]
delta: [d1, d2 - d1, d3 - d2, ...]

Delta values tend to be small and more compressible. A second stage integer codec then compresses these deltas.

Common Industrial Codecs

Industry uses several established integer compression codecs:

Codec CategoryExample Algorithms
Variable‑byteVByte
Bit‑packingPFor (Frame of Reference)
SIMD‑friendlySIMD‑FOR / SIMD‑PFor
Simple integerSimple9 / Simple16
Legacy gammaElias Gamma / Delta / Rice

Example Libraries

These are real, industrially used libraries:

FastPForLib

  • GitHub: https://github.com/lemire/FastPFor

  • Implements:

  • Frame of Reference (FOR)

  • Patched Frame of Reference (PFor)

  • SIMD‑PFor / SIMD‑FOR variants

  • Variable‑byte, Simple9, Simple16, etc.

  • Widely used in inverted index systems for posting list compression.

Encoding / Decoding Pattern (Pseudo)

#include "common/headers.hpp"
#include "FastPFor/headers.hpp"

// assume posting list in 'postings'
auto deltas = delta_encode(postings);

// choose a codec, e.g., PFor
FastPForLib::PFor<32> codec;

// encode into compressed buffer
std::vector<uint32_t> compressed(deltas.size());
size_t outSize = 0;
codec.encodeArray(deltas.data(), deltas.size(), compressed.data(), outSize);

// decode back
std::vector<uint32_t> decoded(deltas.size());
size_t decodedSize = 0;
codec.decodeArray(compressed.data(), outSize, decoded.data(), decodedSize);

// restore original
auto restored = delta_decode(decoded);

Note: actual API names vary by FastPForLib version; above is schematic.


4) Simple9 / Simple16 / Variable Byte

These codecs compress integer lists without SIMD optimization but remain common in older systems:

  • Variable Byte (VByte): easy decode, moderate compression
  • Simple9 / Simple16: better compression, slightly higher decode cost
  • Elias / Gamma / Delta / Rice: good theoretical compression, weaker random access

Typical Uses

  • Legacy systems
  • Compatibility scenarios
  • Lightweight compression where SIMD is unnecessary

5) Emerging SIMD‑Optimized Schemes

Newer schemes focus on balancing compression with SIMD decode throughput:

  • Bit‑packing with SIMD instructions
  • Block‑oriented ANS / Bulk coding
  • Byte/bit shuffle + entropy coding (zstd‐style)

While research demonstrates high throughput, adoption in large retrieval stacks is still catching up compared to established codecs like SIMD‑PFor / FastPFor.


Industrial Comparison

SchemeSpace EfficiencyDecode SpeedRandom AccessSkip SupportSIMD Friendly
Uncompressed✔️✔️✔️
Block / Skip Structure⬆️✔️✔️✔️optional
PFor / PFORDelta✔️✔️✔️⚠️⬆️ via blocks✔️
Simple9 / Simple16⬆️✔️⚠️⬆️
Variable Byte✔️✔️⚠️limited
New SIMD Schemes✔️✔️✔️✔️✔️⚠️plugin✔️

Typical Scenarios

Uncompressed

  • Very short lists
  • Memory available
  • Lowest CPU cost

Compressed (PFor / PFORDelta)

  • Long posting lists (millions of entries)
  • Disk / memory bandwidth constrained
  • Intersection and batch processing dominant

Block / Skip Structure

  • Frequent skipping required
  • Multi‐term intersection heavy workloads
  • Real‐time low latency retrieval

Implementation Notes

Practical industrial systems often combine:

Block Index + Delta Encoding + SIMD‑friendly Codec

to achieve a balance of compact storage, fast sequential scan, and efficient intersection.

Posting lists are typically part of a larger inverted index which may include:

  • Position lists (for phrase queries)
  • Payloads (term frequency, field boosts)
  • Bitmap filters (for quick disqualification)

Engineering considerations include:

  • Memory pool / mmap support
  • Batch decode interfaces
  • Cache‑friendly layout
  • Concurrency support (read parallelism)

Posting List Encoding Example (FastPForLib)

#include "FastPFor/headers.hpp"
#include <vector>

// delta_encode / delta_decode implemented separately

std::vector<uint32_t> postings = {10, 15, 102, 205, 2200};

// compute deltas
auto deltas = delta_encode(postings);

// choose PFor codec from FastPForLib
FastPForLib::PFor<32> codec;

// encode
std::vector<uint32_t> compressed(deltas.size());
size_t outSize = 0;
codec.encodeArray(deltas.data(), deltas.size(), compressed.data(), outSize);

// decode
std::vector<uint32_t> recovered(deltas.size());
size_t recSize = 0;
codec.decodeArray(compressed.data(), outSize, recovered.data(), recSize);

// restore postings
auto restored = delta_decode(recovered);