Data Formats Overview
This document provides a comprehensive overview of common data formats used in large-scale analytics and storage systems, their typical use cases, key libraries, and implementation differences.
Format Summary
| Format / Type | Purpose / Role | Typical Use Cases | Key Libraries / SDKs |
|---|---|---|---|
| Parquet | Columnar storage format optimized for analytics | Big data analytics, ETL pipelines, OLAP queries | Apache Parquet (C++/Java/Python), Arrow |
| Arrow | In-memory columnar format for fast computation | In-memory analytics, zero-copy interchange between engines | Apache Arrow (C++/Python/Java/Go) |
| Avro | Row-based storage format, schema evolution support | Log storage, streaming pipelines, serialization | Apache Avro (C++/Java/Python) |
| HDF5 | Hierarchical file format for storing large numerical datasets | Scientific computing, large multidimensional arrays | HDF5 (C/C++/Python), h5py |
| NPY | Simple binary format for NumPy arrays | Machine learning experiments, numerical data storage | NumPy (Python) |
| OCR | Specialized format for scanned documents, text extraction | Document processing pipelines, OCR-based indexing | Tesseract, OpenCV, custom SDKs |
| Substrait | Relational algebra query plan serialization | Query plan interchange across engines, query federation | Substrait (Protobuf-based), Arrow integration |
Implementation Differences and Performance Notes
Parquet
- Columnar storage allows efficient compression and predicate pushdown.
- Implementations:
- Apache Parquet C++ (Arrow integration) – high-performance, zero-copy reads, widely used in analytics engines like Spark, Dask.
- Apache Parquet Java – reference implementation in Hadoop/Spark ecosystem; JVM overhead can affect throughput.
- Third-party libraries – e.g., fastparquet (Python) with different codec defaults and performance.
- Performance considerations:
- Encoding/compression codecs (Snappy, ZSTD, GZIP) significantly impact I/O and CPU usage.
- Arrow integration enables zero-copy memory access, reducing CPU load in analytics pipelines.
- Multi-threaded read/decompression is supported in C++/Arrow, less so in Java default.
- Use case: Optimized for batch analytics; library choice can significantly affect read throughput and CPU consumption.
Arrow
- Focused on in-memory analytics, often as an intermediate format for data interchange.
- C++ core is fastest; Python/Java bindings may add overhead.
Avro
- Row-based, ideal for write-heavy streaming workloads.
- Java implementation most mature; C++/Python bindings exist but may lack some schema evolution features.
- Performance differences are primarily in serialization/deserialization speed.
HDF5
- Hierarchical storage with rich metadata, suitable for large numerical datasets.
- C library core; Python
h5pyis a wrapper. - Performance depends heavily on chunking and I/O patterns.
NPY
- Simple, fast binary storage for NumPy arrays.
- No default compression; optimized for small to medium arrays.
- Mainly used for machine learning experiments, not production batch storage.
OCR
- Application-specific; extraction accuracy, speed, and output format vary by library.
- No standardized storage; usage depends on document processing pipelines.
Substrait
- Protocol-level format for query plan serialization, not for raw data storage.
- Implementation differences minor; focus is on compatibility and expressiveness.
Note: Due to ecosystem constraints, the choices for each format are relatively fixed. If additional requirements arise, it is possible to customize your own implementation to suit specific needs.