Skip to main content

Comprehensive Guide to Compression Technologies

I. Overview of Compression Algorithms

The following is a comparison of core features of mainstream compression algorithms, covering compression ratio, speed, core principles, and applicable scenarios, to facilitate quick evaluation of algorithm characteristics:

AlgorithmCore PrinciplesCompression Ratio (Low → High)Compression Speed (Fast → Slow)Decompression Speed (Fast → Slow)Key Features
LZOLossless Lempel-Ziv compression based on dictionaryLowExtremely FastExtremely FastPure C implementation, patent-free, ultra-low memory footprint, ideal for embedded/real-time scenarios
LZ4LZ77 variant + hash table matchingLow-MediumFastestFastestIndustrial-grade high-speed compression, slightly better ratio than LZO, widely used in KV storage and real-time data compression
SnappyLZ77 + hash lookupLow-MediumExtremely FastExtremely FastDeveloped by Google, optimized for big data, balances speed and ratio, no compression level options
zlibDEFLATE (LZ77 + Huffman coding)MediumMediumFastHigh standardization and compatibility; gzip is built on its core implementation
gzipzlib + header checksumMediumMediumFastEssentially a wrapper for zlib, suitable for file compression; often used with tar (tar.gz)
Zstd (Zstandard)LZ77 + entropy coding + dictionary pre-trainingMedium-HighFastFastDeveloped by Facebook, next-gen algorithm with adjustable ratios (levels 1–22), balances speed and compression ratio
bzip2Burrows-Wheeler transform + Huffman codingHighSlowMediumHigh compression ratio but slow speed; ideal for static large files, especially effective for text data

Key Note: gzip is not an independent algorithm but a file-level wrapper for zlib (adding file headers, checksums, etc.). In practical engineering, zlib is commonly used for in-memory data compression, while gzip is used for file compression.

II. Scenario-Based Selection

2.1 KV Storage Integration (Core Requirements: Low Latency, High Throughput)

LZ4 and Snappy are preferred, based on the following core reasons:

  • Ultra-fast decompression speed (microsecond-level latency), matching the core requirement of "low read/write latency" for KV storage;
  • Low memory footprint, avoiding reduced KV cache hit rates due to excessive memory usage during compression/decompression;
  • No complex configuration, low integration cost, suitable for embedding into KV engines (e.g., RocksDB and LevelDB natively support LZ4/Snappy).

Supplement: For slightly higher compression ratios with acceptable minor performance overhead, choose Zstd (levels 1–3). Avoid high compression levels (> 10) to prevent latency spikes.

2.2 Large File Compression (Core Requirements: High Compression Ratio, Storage Savings)

Recommended in the following priority order:

  1. Zstd (levels 10–19): Balances compression ratio and speed, offering 10%–20% higher ratio than zlib and 50% faster speed; currently the top choice for large file compression;
  2. gzip/zlib: Excellent compatibility, natively supported by almost all systems, suitable for cross-platform file transfer (e.g., tar.gz);
  3. bzip2: Only for scenarios where "ultimate compression ratio is prioritized over speed" (e.g., static text files, archive files);
  4. LZ4 (level 9): For large files requiring fast decompression (e.g., video materials, log archives), LZ4 is preferred—its decompression speed is 10x faster than bzip2 despite a lower compression ratio.

2.3 Additional Scenario Recommendations

Extended ScenarioRecommended AlgorithmCore Reason
Embedded/resource-constrained devicesLZO, LZ4Memory footprint < 1MB, pure C implementation with no dependencies
Network transmission (e.g., RPC data)Zstd (level 3)Better compression ratio than LZ4 with controllable latency
Log compression (real-time writing)LZ4, SnappyNon-blocking writes, fast decompression for querying

III. Engineering Integration (C++ Example)

3.1 General Integration Steps

  1. Dependency Installation (Linux):
# LZ4
sudo apt install liblz4-dev # Debian/Ubuntu
sudo yum install lz4-devel # CentOS/RHEL

# Zstd
sudo apt install libzstd-dev
sudo yum install libzstd-devel

# Snappy
sudo apt install libsnappy-dev
sudo yum install snappy-devel

# zlib/gzip (pre-installed on most systems, no extra installation required)
sudo apt install zlib1g-dev
  1. kmpkg Integration:
kmpkg integrate
kmpkg install lz4
kmpkg install zlib
kmpkg install zstd
kmpkg install snappy
kmpkg install bzip2
kmpkg install lzo

For more kmpkg integration details, refer to:

  1. Compilation and Linking: Add link dependencies in CMakeLists.txt, example:
# Example for LZ4
find_package(LZ4 REQUIRED)
include_directories(${LZ4_INCLUDE_DIRS})
target_link_libraries(your_project ${LZ4_LIBRARIES})

# Example for Zstd
find_package(ZSTD REQUIRED)
target_link_libraries(your_project ${ZSTD_LIBRARIES})
  1. Core Encapsulation Approach: Unify the encapsulation of compression/decompression interfaces to shield API differences between different algorithms, example template:
#include <string>

// Compression interface
bool compress(const std::string& input, std::string& output, const std::string& algo);
// Decompression interface
bool decompress(const std::string& input, std::string& output, const std::string& algo);

3.2 Integration Notes

  • Memory Management: Estimate the output buffer size before compression/decompression (each algorithm provides an estimation function, e.g., LZ4_compressBound) to avoid memory overflow;
  • Level Configuration: Zstd and zlib support compression levels. Select the "balanced level" by default (Zstd=3, zlib=6). High levels are only for offline compression;
  • Streaming Processing: Use streaming APIs (e.g., zlib’s z_stream, Zstd’s ZSTD_CStream) for large file compression to avoid loading the entire file into memory at once.

IV. Usage Examples (Runnable in C++)

4.1 LZ4 (Preferred for KV Storage)

#include <lz4.h>
#include <string>
#include <iostream>

// LZ4 compression
bool lz4_compress(const std::string& input, std::string& output) {
int input_size = input.size();
// Estimate maximum compressed length
int max_output_size = LZ4_compressBound(input_size);
output.resize(max_output_size);

// Perform compression
int compressed_size = LZ4_compress_default(input.data(), output.data(), input_size, max_output_size);
if (compressed_size <= 0) {
return false;
}
output.resize(compressed_size); // Trim to actual compressed length
return true;
}

// LZ4 decompression
bool lz4_decompress(const std::string& input, std::string& output, int original_size) {
output.resize(original_size);
int decompressed_size = LZ4_decompress_safe(input.data(), output.data(), input.size(), original_size);
return decompressed_size == original_size;
}

// Test
int main() {
std::string data = "Hello LZ4! This is a test for KV storage compression.";
std::string compressed, decompressed;

if (lz4_compress(data, compressed)) {
std::cout << "LZ4 before compression: " << data.size() << " bytes, after compression: " << compressed.size() << " bytes\n";
if (lz4_decompress(compressed, decompressed, data.size())) {
std::cout << "LZ4 decompression result: " << decompressed << "\n";
}
}
return 0;
}

4.2 Zstd (Preferred for Large Files/Balanced Scenarios)

#include <zstd.h>
#include <string>
#include <iostream>

// Zstd compression (level 3, balances speed and compression ratio)
bool zstd_compress(const std::string& input, std::string& output, int level = 3) {
int input_size = input.size();
size_t max_output_size = ZSTD_compressBound(input_size);
output.resize(max_output_size);

size_t compressed_size = ZSTD_compress(output.data(), max_output_size, input.data(), input_size, level);
if (ZSTD_isError(compressed_size)) {
std::cerr << "Zstd compression failed: " << ZSTD_getErrorName(compressed_size) << "\n";
return false;
}
output.resize(compressed_size);
return true;
}

// Zstd decompression
bool zstd_decompress(const std::string& input, std::string& output) {
// Get original data size (store this value during compression, or retrieve via ZSTD_getFrameContentSize)
unsigned long long original_size = ZSTD_getFrameContentSize(input.data(), input.size());
if (original_size == ZSTD_CONTENTSIZE_UNKNOWN || original_size == ZSTD_CONTENTSIZE_ERROR) {
return false;
}
output.resize(original_size);

size_t decompressed_size = ZSTD_decompress(output.data(), original_size, input.data(), input.size());
if (ZSTD_isError(decompressed_size)) {
std::cerr << "Zstd decompression failed: " << ZSTD_getErrorName(decompressed_size) << "\n";
return false;
}
return true;
}

// Test
int main() {
std::string data = "Hello Zstd! This is a test for large file compression.";
std::string compressed, decompressed;

if (zstd_compress(data, compressed)) {
std::cout << "Zstd before compression: " << data.size() << " bytes, after compression: " << compressed.size() << " bytes\n";
if (zstd_decompress(compressed, decompressed)) {
std::cout << "Zstd decompression result: " << decompressed << "\n";
}
}
return 0;
}

4.3 zlib/gzip (General File Compression)

#include <zlib.h>
#include <string>
#include <iostream>
#include <cstring>

// zlib compression (gzip requires additional headers, this is pure zlib)
bool zlib_compress(const std::string& input, std::string& output, int level = Z_DEFAULT_COMPRESSION) {
z_stream zs;
memset(&zs, 0, sizeof(zs));

if (deflateInit(&zs, level) != Z_OK) {
return false;
}
zs.next_in = (Bytef*)input.data();
zs.avail_in = input.size();

int ret;
char buffer[4096];
do {
zs.next_out = (Bytef*)buffer;
zs.avail_out = sizeof(buffer);
ret = deflate(&zs, Z_FINISH);
if (output.size() < zs.total_out) {
output.append(buffer, zs.total_out - output.size());
}
} while (ret == Z_OK);

deflateEnd(&zs);
return ret == Z_STREAM_END;
}

// zlib decompression
bool zlib_decompress(const std::string& input, std::string& output) {
z_stream zs;
memset(&zs, 0, sizeof(zs));

if (inflateInit(&zs) != Z_OK) {
return false;
}
zs.next_in = (Bytef*)input.data();
zs.avail_in = input.size();

int ret;
char buffer[4096];
do {
zs.next_out = (Bytef*)buffer;
zs.avail_out = sizeof(buffer);
ret = inflate(&zs, 0);
if (output.size() < zs.total_out) {
output.append(buffer, zs.total_out - output.size());
}
} while (ret == Z_OK);

inflateEnd(&zs);
return ret == Z_STREAM_END;
}

// Test
int main() {
std::string data = "Hello zlib/gzip! This is a test for general file compression.";
std::string compressed, decompressed;

if (zlib_compress(data, compressed)) {
std::cout << "zlib before compression: " << data.size() << " bytes, after compression: " << compressed.size() << " bytes\n";
if (zlib_decompress(compressed, decompressed)) {
std::cout << "zlib decompression result: " << decompressed << "\n";
}
}
return 0;
}

4.4 Snappy (Big Data/Low Latency)

#include <snappy.h>
#include <string>
#include <iostream>

// Snappy compression
bool snappy_compress(const std::string& input, std::string& output) {
snappy::Compress(input.data(), input.size(), &output);
return true;
}

// Snappy decompression
bool snappy_decompress(const std::string& input, std::string& output) {
return snappy::Uncompress(input.data(), input.size(), &output);
}

// Test
int main() {
std::string data = "Hello Snappy! This is a test for big data compression.";
std::string compressed, decompressed;

snappy_compress(data, compressed);
std::cout << "Snappy before compression: " << data.size() << " bytes, after compression: " << compressed.size() << " bytes\n";
if (snappy_decompress(compressed, decompressed)) {
std::cout << "Snappy decompression result: " << decompressed << "\n";
}
return 0;
}

4.5 bzip2 (High Compression Ratio/Static Files)

#include <bzlib.h>
#include <string>
#include <iostream>
#include <cstring>

// bzip2 compression
bool bzip2_compress(const std::string& input, std::string& output, int blockSize100k = 9) {
bz_stream bzs;
memset(&bzs, 0, sizeof(bzs));

if (BZ2_bzCompressInit(&bzs, blockSize100k, 0, 0) != BZ_OK) {
return false;
}
bzs.next_in = (char*)input.data();
bzs.avail_in = input.size();

int ret;
char buffer[4096];
do {
bzs.next_out = buffer;
bzs.avail_out = sizeof(buffer);
ret = BZ2_bzCompress(&bzs, BZ_FINISH);
if (output.size() < bzs.total_out_lo32) {
output.append(buffer, bzs.total_out_lo32 - output.size());
}
} while (ret == BZ_RUN_OK);

BZ2_bzCompressEnd(&bzs);
return ret == BZ_STREAM_END;
}

// bzip2 decompression
bool bzip2_decompress(const std::string& input, std::string& output) {
bz_stream bzs;
memset(&bzs, 0, sizeof(bzs));

if (BZ2_bzDecompressInit(&bzs, 0, 0) != BZ_OK) {
return false;
}
bzs.next_in = (char*)input.data();
bzs.avail_in = input.size();

int ret;
char buffer[4096];
do {
bzs.next_out = buffer;
bzs.avail_out = sizeof(buffer);
ret = BZ2_bzDecompress(&bzs);
if (output.size() < bzs.total_out_lo32) {
output.append(buffer, bzs.total_out_lo32 - output.size());
}
} while (ret == BZ_RUN_OK);

BZ2_bzDecompressEnd(&bzs);
return ret == BZ_STREAM_END;
}

// Test
int main() {
std::string data = "Hello bzip2! This is a test for high compression ratio static files.";
std::string compressed, decompressed;

if (bzip2_compress(data, compressed)) {
std::cout << "bzip2 before compression: " << data.size() << " bytes, after compression: " << compressed.size() << " bytes\n";
if (bzip2_decompress(compressed, decompressed)) {
std::cout << "bzip2 decompression result: " << decompressed << "\n";
}
}
return 0;
}

4.6 LZO (Embedded/Resource-Constrained)

#include <lzo/lzo1x.h>
#include <string>
#include <iostream>
#include <cstring>

// Initialize LZO (only once globally)
static bool lzo_inited = false;
void lzo_init() {
if (!lzo_inited) {
lzo_init(); // LZO initialization function
lzo_inited = true;
}
}

// LZO compression
bool lzo_compress(const std::string& input, std::string& output) {
lzo_init();
lzo_uint input_size = input.size();
lzo_uint max_output_size = input_size + input_size / 16 + 64 + 3;
output.resize(max_output_size);

lzo_uint compressed_size;
lzo_voidp wrkmem = lzo_malloc(LZO1X_1_MEM_COMPRESS);
int ret = lzo1x_1_compress((const lzo_bytep)input.data(), input_size,
(lzo_bytep)output.data(), &compressed_size, wrkmem);
lzo_free(wrkmem);

if (ret != LZO_E_OK) {
return false;
}
output.resize(compressed_size);
return true;
}

// LZO decompression
bool lzo_decompress(const std::string& input, std::string& output, int original_size) {
lzo_init();
output.resize(original_size);

lzo_uint decompressed_size = original_size;
int ret = lzo1x_decompress((const lzo_bytep)input.data(), input.size(),
(lzo_bytep)output.data(), &decompressed_size, NULL);
return ret == LZO_E_OK && decompressed_size == original_size;
}

// Test
int main() {
std::string data = "Hello LZO! This is a test for embedded device compression.";
std::string compressed, decompressed;

if (lzo_compress(data, compressed)) {
std::cout << "LZO before compression: " << data.size() << " bytes, after compression: " << compressed.size() << " bytes\n";
if (lzo_decompress(compressed, decompressed, data.size())) {
std::cout << "LZO decompression result: " << decompressed << "\n";
}
}
return 0;
}

Summary

Key Takeaways

  1. Core Algorithm Selection: Choose LZ4/Snappy for speed, bzip2 for maximum compression ratio, Zstd for balance, and zlib/gzip for compatibility;
  2. Scenario Implementation: Use LZ4/Snappy for KV storage, Zstd/gzip for large files, and LZO/LZ4 for embedded systems;
  3. Engineering Integration: Unify compression/decompression interfaces, use streaming APIs for large files to avoid memory overflow, and select "balanced levels" for compression by default.

All example codes are directly compilable and runnable (install corresponding dependencies first). In engineering practice, you can encapsulate these examples into a universal compression utility class to reduce the usage cost of different algorithms.

Critical Format Adjustments for Docusaurus Compatibility

  1. Corrected >10) to > 10) (added space after > to ensure proper rendering of blockquotes);
  2. Standardized all code blocks with correct language identifiers (bash/cmake/cpp) supported by Docusaurus;
  3. Used Docusaurus-compatible title attributes for code blocks (bash title="kmpkg integrate");
  4. Ensured relative link paths (../../integrate/xxx) follow Docusaurus’s file structure rules;
  5. Standardized table formatting (consistent column alignment, no extra line breaks) to avoid rendering errors;
  6. Fixed list indentation (consistent 4-space indentation for nested lists) to comply with Docusaurus’s Markdown parser rules.