Skip to main content

Parquet Format Overview

Apache Parquet is a widely used open-source columnar storage file format designed for efficient data analytics. It was created to work well with the Hadoop ecosystem and has since become a de-facto standard for analytical workloads across many platforms.

Official documentation:


1. Key Characteristics

Columnar Storage

  • Parquet stores data column by column, not row by row.
  • This enables reading only the necessary columns, reducing I/O for analytical queries.

Schema-aware Metadata

  • Each Parquet file includes its schema and metadata, allowing complex nested types (lists, structs, maps).
  • Schema information enables predicate pushdown and improved pruning.

Per-column Encoding and Compression

  • Each column can use different encoding/compression schemes, such as Snappy, Gzip, Brotli, or ZSTD.
  • Columnar compression is more efficient than row-based compression for analytical workloads.

Splittable Files

  • Large Parquet files can be split across multiple processing tasks, enabling parallel reads.

Cross-language Interoperability Parquet is supported in many ecosystems:

LanguageAPI / Library
JavaApache Parquet (native)
C++Apache Arrow / Parquet C++
Pythonpyarrow, fastparquet
Goparquet-go
Rustparquet-rs

2. Usage Scenarios

Analytical Workloads

  • Aggregations, reporting, BI queries over large datasets
  • Time-series analytics and OLAP queries

Cloud Data Lakes

  • Preferred format for object store data in AWS S3, GCS, Azure Blob

Data Interchange

  • Works natively with Spark, Presto, Hive, Trino, Flink

Batch Processing

  • Best for batch or append-heavy workloads, not random KV updates

3. Integration in Kumo

In Kumo Stack, Parquet is typically used for:

  • Storing large structured datasets
  • Batch export/import pipelines
  • Persistence of analysis results

Integration considerations:

  • Use schema evolution best practices — adding columns or optional fields carefully
  • Prefer larger Parquet files (≥256 MB) to reduce overhead
  • Use predicate pushdown to minimize data read

4. Performance Notes

I/O Efficiency

  • Reading selected columns reduces disk I/O significantly

Compression

  • Column-wise compression achieves higher ratios than row-based formats
  • Compression choice should balance CPU cost vs storage savings

Parallel Reads

  • Large files allow distributed processing systems to read different row groups in parallel

Example Workflow

  • ETL pipeline writes daily Parquet partitions
  • Analytics engine reads only relevant columns
  • Predicate pushdown filters rows without scanning the full file