Skip to main content

Data Formats Overview

This document provides a comprehensive overview of common data formats used in large-scale analytics and storage systems, their typical use cases, key libraries, and implementation differences.

Format Summary

Format / TypePurpose / RoleTypical Use CasesKey Libraries / SDKs
ParquetColumnar storage format optimized for analyticsBig data analytics, ETL pipelines, OLAP queriesApache Parquet (C++/Java/Python), Arrow
ArrowIn-memory columnar format for fast computationIn-memory analytics, zero-copy interchange between enginesApache Arrow (C++/Python/Java/Go)
AvroRow-based storage format, schema evolution supportLog storage, streaming pipelines, serializationApache Avro (C++/Java/Python)
HDF5Hierarchical file format for storing large numerical datasetsScientific computing, large multidimensional arraysHDF5 (C/C++/Python), h5py
NPYSimple binary format for NumPy arraysMachine learning experiments, numerical data storageNumPy (Python)
OCRSpecialized format for scanned documents, text extractionDocument processing pipelines, OCR-based indexingTesseract, OpenCV, custom SDKs
SubstraitRelational algebra query plan serializationQuery plan interchange across engines, query federationSubstrait (Protobuf-based), Arrow integration

Implementation Differences and Performance Notes

Parquet

  • Columnar storage allows efficient compression and predicate pushdown.
  • Implementations:
  • Apache Parquet C++ (Arrow integration) – high-performance, zero-copy reads, widely used in analytics engines like Spark, Dask.
  • Apache Parquet Java – reference implementation in Hadoop/Spark ecosystem; JVM overhead can affect throughput.
  • Third-party libraries – e.g., fastparquet (Python) with different codec defaults and performance.
  • Performance considerations:
  • Encoding/compression codecs (Snappy, ZSTD, GZIP) significantly impact I/O and CPU usage.
  • Arrow integration enables zero-copy memory access, reducing CPU load in analytics pipelines.
  • Multi-threaded read/decompression is supported in C++/Arrow, less so in Java default.
  • Use case: Optimized for batch analytics; library choice can significantly affect read throughput and CPU consumption.

Arrow

  • Focused on in-memory analytics, often as an intermediate format for data interchange.
  • C++ core is fastest; Python/Java bindings may add overhead.

Avro

  • Row-based, ideal for write-heavy streaming workloads.
  • Java implementation most mature; C++/Python bindings exist but may lack some schema evolution features.
  • Performance differences are primarily in serialization/deserialization speed.

HDF5

  • Hierarchical storage with rich metadata, suitable for large numerical datasets.
  • C library core; Python h5py is a wrapper.
  • Performance depends heavily on chunking and I/O patterns.

NPY

  • Simple, fast binary storage for NumPy arrays.
  • No default compression; optimized for small to medium arrays.
  • Mainly used for machine learning experiments, not production batch storage.

OCR

  • Application-specific; extraction accuracy, speed, and output format vary by library.
  • No standardized storage; usage depends on document processing pipelines.

Substrait

  • Protocol-level format for query plan serialization, not for raw data storage.
  • Implementation differences minor; focus is on compatibility and expressiveness.

Note: Due to ecosystem constraints, the choices for each format are relatively fixed. If additional requirements arise, it is possible to customize your own implementation to suit specific needs.