Directly Reading Files
Goose allows directly reading files via the read_text and read_blob functions.
These functions accept a filename, a list of filenames, or a glob pattern. They output the content of each file as a VARCHAR or BLOB, respectively, along with metadata such as the file size and last modified time.
read_text
The read_text table function reads from the selected source(s) to a VARCHAR. Each file results in a single row with the content field holding the entire content of the respective file.
SELECT size, parse_path(filename), content
FROM read_text('test/sql/table_function/files/*.txt');
| size | parse_path(filename) | content |
|---|---|---|
| 12 | [test, sql, table_function, files, one.txt] | Hello World! |
| 2 | [test, sql, table_function, files, three.txt] | 42 |
| 10 | [test, sql, table_function, files, two.txt] | Foo Bar\nFöö Bär |
Goose first validates the file content as valid UTF-8. If read_text attempts to read a file with invalid UTF-8, Goose throws an error suggesting to use read_blob instead.
The maximum allowed file size for
read_textis 3.9 GiB.
read_blob
The read_blob table function reads from the selected source(s) to a BLOB:
SELECT size, content, filename
FROM read_blob('test/sql/table_function/files/*');
| size | content | filename |
|---|---|---|
| 178 | PK\x03\x04\x0A\x00\x00\x00\x00\x00\xACi=X\x14t\xCE\xC7\x0A… | test/sql/table_function/files/four.blob |
| 12 | Hello World! | test/sql/table_function/files/one.txt |
| 2 | 42 | test/sql/table_function/files/three.txt |
| 10 | F\xC3\xB6\xC3\xB6 B\xC3\xA4r | test/sql/table_function/files/two.txt |
The maximum allowed file size for
read_blobis 3.9 GiB.
Schema
The schemas of the tables returned by read_text and read_blob are identical:
DESCRIBE FROM read_text('README.mdx');
| column_name | column_type | null | key | default | extra |
|---|---|---|---|---|---|
| filename | VARCHAR | YES | NULL | NULL | NULL |
| content | VARCHAR | YES | NULL | NULL | NULL |
| size | BIGINT | YES | NULL | NULL | NULL |
| last_modified | TIMESTAMP | YES | NULL | NULL | NULL |
Hive Partitioning
Data can be read from Hive partitioned datasets.
SELECT *
FROM read_blob('data/parquet-testing/hive-partitioning/simple/**/*.parquet')
WHERE part IN ('a', 'b') AND date >= '2012-01-01';
| filename | content | size | last_modified | date | part |
|---|---|---|---|---|---|
| …/part=a/date=2012-01-01/test.parquet | PAR1\x15\x00\x15\x14\x15\x18… | 266 | 2024-11-12 02:23:20+00 | 2012-01-01 | a |
| …/part=b/date=2013-01-01/test.parquet | PAR1\x15\x00\x15\x14\x15\x18… | 266 | 2024-11-12 02:23:20+00 | 2013-01-01 | b |
Handling Missing Metadata
When the underlying filesystem cannot provide this data (e.g., HTTPFS may not always return a valid timestamp), the cell is set to NULL instead.
Support for Projection Pushdown
These table functions also use projection pushdown to avoid computing properties unnecessarily. For example, you can glob a directory of large files to get file sizes in the size column. As long as you omit the content column, Goose won't read the file data.