Data Formats Overview

This document provides a comprehensive overview of common data formats used in large-scale analytics and storage systems, their typical use cases, key libraries, and implementation differences.

Format Summary

Format / Type	Purpose / Role	Typical Use Cases	Key Libraries / SDKs
Parquet	Columnar storage format optimized for analytics	Big data analytics, ETL pipelines, OLAP queries	Apache Parquet (C++/Java/Python), Arrow
Arrow	In-memory columnar format for fast computation	In-memory analytics, zero-copy interchange between engines	Apache Arrow (C++/Python/Java/Go)
Avro	Row-based storage format, schema evolution support	Log storage, streaming pipelines, serialization	Apache Avro (C++/Java/Python)
HDF5	Hierarchical file format for storing large numerical datasets	Scientific computing, large multidimensional arrays	HDF5 (C/C++/Python), h5py
NPY	Simple binary format for NumPy arrays	Machine learning experiments, numerical data storage	NumPy (Python)
OCR	Specialized format for scanned documents, text extraction	Document processing pipelines, OCR-based indexing	Tesseract, OpenCV, custom SDKs
Substrait	Relational algebra query plan serialization	Query plan interchange across engines, query federation	Substrait (Protobuf-based), Arrow integration

Implementation Differences and Performance Notes

Parquet

Columnar storage allows efficient compression and predicate pushdown.
Implementations:
Apache Parquet C++ (Arrow integration) – high-performance, zero-copy reads, widely used in analytics engines like Spark, Dask.
Apache Parquet Java – reference implementation in Hadoop/Spark ecosystem; JVM overhead can affect throughput.
Third-party libraries – e.g., fastparquet (Python) with different codec defaults and performance.
Performance considerations:
Encoding/compression codecs (Snappy, ZSTD, GZIP) significantly impact I/O and CPU usage.
Arrow integration enables zero-copy memory access, reducing CPU load in analytics pipelines.
Multi-threaded read/decompression is supported in C++/Arrow, less so in Java default.
Use case: Optimized for batch analytics; library choice can significantly affect read throughput and CPU consumption.

Arrow

Focused on in-memory analytics, often as an intermediate format for data interchange.
C++ core is fastest; Python/Java bindings may add overhead.

Avro

Row-based, ideal for write-heavy streaming workloads.
Java implementation most mature; C++/Python bindings exist but may lack some schema evolution features.
Performance differences are primarily in serialization/deserialization speed.

HDF5

Hierarchical storage with rich metadata, suitable for large numerical datasets.
C library core; Python h5py is a wrapper.
Performance depends heavily on chunking and I/O patterns.

NPY

Simple, fast binary storage for NumPy arrays.
No default compression; optimized for small to medium arrays.
Mainly used for machine learning experiments, not production batch storage.

OCR

Application-specific; extraction accuracy, speed, and output format vary by library.
No standardized storage; usage depends on document processing pipelines.

Substrait

Protocol-level format for query plan serialization, not for raw data storage.
Implementation differences minor; focus is on compatibility and expressiveness.

Note: Due to ecosystem constraints, the choices for each format are relatively fixed. If additional requirements arise, it is possible to customize your own implementation to suit specific needs.

Format Summary​

Implementation Differences and Performance Notes​

Parquet​

Arrow​

Avro​

HDF5​

NPY​

OCR​

Substrait​