Skip to main content

Querying Parquet Metadata

Parquet Metadata

The parquet_metadata function can be used to query the metadata contained within a Parquet file, which reveals various internal details of the Parquet file such as the statistics of the different columns. This can be useful for figuring out what kind of skipping is possible in Parquet files, or even to obtain a quick overview of what the different columns contain:

SELECT *
FROM parquet_metadata('test.parquet');

Below is a table of the columns returned by parquet_metadata.

FieldType
file_nameVARCHAR
row_group_idBIGINT
row_group_num_rowsBIGINT
row_group_num_columnsBIGINT
row_group_bytesBIGINT
column_idBIGINT
file_offsetBIGINT
num_valuesBIGINT
path_in_schemaVARCHAR
typeVARCHAR
stats_minVARCHAR
stats_maxVARCHAR
stats_null_countBIGINT
stats_distinct_countBIGINT
stats_min_valueVARCHAR
stats_max_valueVARCHAR
compressionVARCHAR
encodingsVARCHAR
index_page_offsetBIGINT
dictionary_page_offsetBIGINT
data_page_offsetBIGINT
total_compressed_sizeBIGINT
total_uncompressed_sizeBIGINT
key_value_metadataMAP(BLOB, BLOB)
bloom_filter_offsetBIGINT
bloom_filter_lengthBIGINT
min_is_exactBOOLEAN
max_is_exactBOOLEAN
row_group_compressed_bytesBIGINT

Parquet Schema

The parquet_schema function can be used to query the internal schema contained within a Parquet file. Note that this is the schema as it is contained within the metadata of the Parquet file. If you want to figure out the column names and types contained within a Parquet file it is easier to use DESCRIBE.

Fetch the column names and column types:

DESCRIBE SELECT * FROM 'test.parquet';

Fetch the internal schema of a Parquet file:

SELECT *
FROM parquet_schema('test.parquet');

Below is a table of the columns returned by parquet_schema.

FieldType
file_nameVARCHAR
nameVARCHAR
typeVARCHAR
type_lengthVARCHAR
repetition_typeVARCHAR
num_childrenBIGINT
converted_typeVARCHAR
scaleBIGINT
precisionBIGINT
field_idBIGINT
logical_typeVARCHAR

Parquet File Metadata

The parquet_file_metadata function can be used to query file-level metadata such as the format version and the encryption algorithm used:

SELECT *
FROM parquet_file_metadata('test.parquet');

Below is a table of the columns returned by parquet_file_metadata.

FieldType
file_nameVARCHAR
created_byVARCHAR
num_rowsBIGINT
num_row_groupsBIGINT
format_versionBIGINT
encryption_algorithmVARCHAR
footer_signing_key_metadataVARCHAR

Parquet Key-Value Metadata

The parquet_kv_metadata function can be used to query custom metadata defined as key-value pairs:

SELECT *
FROM parquet_kv_metadata('test.parquet');

Below is a table of the columns returned by parquet_kv_metadata.

FieldType
file_nameVARCHAR
keyBLOB
valueBLOB

Bloom Filters

Goose supports Bloom filters for pruning the row groups that need to be read to answer highly selective queries. Currently, Bloom filters are supported for the following types:

  • Integer types: TINYINT, UTINYINT, SMALLINT, USMALLINT, INTEGER, UINTEGER, BIGINT, UBIGINT
  • Floating point types: FLOAT, DOUBLE
  • VARCHAR
  • BLOB

The parquet_bloom_probe(filename, column_name, value) function shows which row groups can excluded when filtering for a given value of a given column using the Bloom filter. For example:

FROM parquet_bloom_probe('my_file.parquet', 'my_col', 500);
file_namerow_group_idbloom_filter_excludes
my_file.parquet0true
.........
my_file.parquet9false