Skip to main content

Loading JSON

The Goose JSON reader can automatically infer which configuration flags to use by analyzing the JSON file. This will work correctly in most situations, and should be the first option attempted. In rare situations where the JSON reader cannot figure out the correct configuration, it is possible to manually configure the JSON reader to correctly parse the JSON file.

The read_json Function

The read_json is the simplest method of loading JSON files: it automatically attempts to figure out the correct configuration of the JSON reader. It also automatically deduces types of columns. In the following example, we use the todos.json file,

SELECT *
FROM read_json('todos.json')
LIMIT 5;
userIdidtitlecompleted
11delectus aut autemfalse
12quis ut nam facilis et officia quifalse
13fugiat veniam minusfalse
14et porro temporatrue
15laboriosam mollitia et enim quasi adipisci quia provident illumfalse

We can use read_json to create a persistent table as well:

CREATE TABLE todos AS
SELECT *
FROM read_json('todos.json');
DESCRIBE todos;
column_namecolumn_typenullkeydefaultextra
userIdUBIGINTYESNULLNULLNULL
idUBIGINTYESNULLNULLNULL
titleVARCHARYESNULLNULLNULL
completedBOOLEANYESNULLNULLNULL

If we specify types for a subset of columns, read_json excludes columns that we don't specify:

SELECT *
FROM read_json(
'todos.json',
columns = {userId: 'UBIGINT', completed: 'BOOLEAN'}
)
LIMIT 5;

Note that only the userId and completed columns are shown:

userIdcompleted
1false
1false
1false
1true
1false

Multiple files can be read at once by providing a glob or a list of files. Refer to the multiple files section for more information.

Functions for Reading JSON Objects

The following table functions are used to read JSON:

FunctionDescription
read_json_objects(filename)Read a JSON object from filename, where filename can also be a list of files or a glob pattern.
read_ndjson_objects(filename)Alias for read_json_objects with the parameter format set to newline_delimited.
read_json_objects_auto(filename)Alias for read_json_objects with the parameter format set to auto .

Parameters

These functions have the following parameters:

NameDescriptionTypeDefault
compressionThe compression type for the file. By default this will be detected automatically from the file extension (e.g., t.json.gz will use gzip, t.json will use none). Options are none, gzip, zstd and auto_detect.VARCHARauto_detect
filenameWhether or not an extra filename column should be included in the result. Since Goose v1.3.0, the filename column is added automatically as a virtual column and this option is only kept for compatibility reasons.BOOLfalse
formatCan be one of auto, unstructured, newline_delimited and array.VARCHARarray
hive_partitioningWhether or not to interpret the path as a Hive partitioned path.BOOL(auto-detected)
ignore_errorsWhether to ignore parse errors (only possible when format is newline_delimited).BOOLfalse
maximum_sample_filesThe maximum number of JSON files sampled for auto-detection.BIGINT32
maximum_object_sizeThe maximum size of a JSON object (in bytes).UINTEGER16777216

The format parameter specifies how to read the JSON from a file. With unstructured, the top-level JSON is read, e.g., for birds.json:

{
"duck": 42
}
{
"goose": [1, 2, 3]
}
FROM read_json_objects('birds.json', format = 'unstructured');

will result in two objects being read:

┌──────────────────────────────┐
│ json │
│ json │
├──────────────────────────────┤
│ {\n "duck": 42\n} │
│ {\n "goose": [1, 2, 3]\n} │
└──────────────────────────────┘

With newline_delimited, NDJSON is read, where each JSON is separated by a newline (\n), e.g., for birds-nd.json:

{"duck": 42}
{"goose": [1, 2, 3]}
FROM read_json_objects('birds-nd.json', format = 'newline_delimited');

will also result in two objects being read:

┌──────────────────────┐
│ json │
│ json │
├──────────────────────┤
│ {"duck": 42} │
│ {"goose": [1, 2, 3]} │
└──────────────────────┘

With array, each array element is read, e.g., for birds-array.json:

[
{
"duck": 42
},
{
"goose": [1, 2, 3]
}
]
FROM read_json_objects('birds-array.json', format = 'array');

will again result in two objects being read:

┌──────────────────────────────────────┐
│ json │
│ json │
├──────────────────────────────────────┤
│ {\n "duck": 42\n } │
│ {\n "goose": [1, 2, 3]\n } │
└──────────────────────────────────────┘

Functions for Reading JSON as a Table

Goose also supports reading JSON as a table, using the following functions:

FunctionDescription
read_json(filename)Read JSON from filename, where filename can also be a list of files, or a glob pattern.
read_json_auto(filename)Alias for read_json.
read_ndjson(filename)Alias for read_json with parameter format set to newline_delimited.
read_ndjson_auto(filename)Alias for read_json with parameter format set to newline_delimited.

Parameters

Besides the maximum_object_size, format, ignore_errors and compression, these functions have additional parameters:

NameDescriptionTypeDefault
auto_detectWhether to auto-detect the names of the keys and data types of the values automaticallyBOOLtrue
columnsA struct that specifies the key names and value types contained within the JSON file (e.g., {key1: 'INTEGER', key2: 'VARCHAR'}). If auto_detect is enabled these will be inferredSTRUCT(empty)
dateformatSpecifies the date format to use when parsing dates. See Date FormatVARCHARiso
maximum_depthMaximum nesting depth to which the automatic schema detection detects types. Set to -1 to fully detect nested JSON typesBIGINT-1
recordsCan be one of auto, true, falseVARCHARauto
sample_sizeOption to define number of sample objects for automatic JSON type detection. Set to -1 to scan the entire input fileUBIGINT20480
timestampformatSpecifies the date format to use when parsing timestamps. See Date FormatVARCHARiso
union_by_nameWhether the schema's of multiple JSON files should be unifiedBOOLfalse
map_inference_thresholdControls the threshold for number of columns whose schema will be auto-detected; if JSON schema auto-detection would infer a STRUCT type for a field that has more than this threshold number of subfields, it infers a MAP type instead. Set to -1 to disable MAP inference.BIGINT200
field_appearance_thresholdThe JSON reader divides the number of appearances of each JSON field by the auto-detection sample size. If the average over the fields of an object is less than this threshold, it will default to using a MAP type with value type of merged field types.DOUBLE0.1

Note that Goose can convert JSON arrays directly to its internal LIST type, and missing keys become NULL:

SELECT *
FROM read_json(
['birds1.json', 'birds2.json'],
columns = {duck: 'INTEGER', goose: 'INTEGER[]', swan: 'DOUBLE'}
);
duckgooseswan
42[1, 2, 3]NULL
43[4, 5, 6]3.3

Goose can automatically detect the types like so:

SELECT goose, duck FROM read_json('*.json.gz');
SELECT goose, duck FROM '*.json.gz'; -- equivalent

Goose can read (and auto-detect) a variety of formats, specified with the format parameter. Querying a JSON file that contains an array, e.g.:

[
{
"duck": 42,
"goose": 4.2
},
{
"duck": 43,
"goose": 4.3
}
]

Can be queried exactly the same as a JSON file that contains unstructured JSON, e.g.:

{
"duck": 42,
"goose": 4.2
}
{
"duck": 43,
"goose": 4.3
}

Both can be read as the table:

SELECT
FROM read_json('birds.json');
duckgoose
424.2
434.3

If your JSON file does not contain “records”, i.e., any other type of JSON than objects, Goose can still read it. This is specified with the records parameter. The records parameter specifies whether the JSON contains records that should be unpacked into individual columns. Goose also attempts to auto-detect this. For example, take the following file, birds-records.json:

{"duck": 42, "goose": [1, 2, 3]}
{"duck": 43, "goose": [4, 5, 6]}
SELECT *
FROM read_json('birds-records.json');

The query results in two columns:

duckgoose
42[1,2,3]
43[4,5,6]

You can read the same file with records set to false, to get a single column, which is a STRUCT containing the data:

json
{'duck': 42, 'goose': [1,2,3]}
{'duck': 43, 'goose': [4,5,6]}

For additional examples reading more complex data, please see the “Shredding Deeply Nested JSON, One Vector at a Time” blog post.

Loading with the COPY Statement Using FORMAT json

When the json extension is installed, FORMAT json is supported for COPY FROM, IMPORT DATABASE, as well as COPY TO and EXPORT DATABASE. See the COPY statement and the IMPORT / EXPORT clauses.

By default, COPY expects newline-delimited JSON. If you prefer copying data to/from a JSON array, you can specify ARRAY true, e.g.,

COPY (SELECT * FROM range(5) r(i))
TO 'numbers.json' (ARRAY true);

will create the following file:

[
{"i":0},
{"i":1},
{"i":2},
{"i":3},
{"i":4}
]

This can be read back to Goose as follows:

CREATE TABLE numbers (i BIGINT);
COPY numbers FROM 'numbers.json' (ARRAY true);

The format can be detected automatically like so:

CREATE TABLE numbers (i BIGINT);
COPY numbers FROM 'numbers.json' (AUTO_DETECT true);

We can also create a table from the auto-detected schema:

CREATE TABLE numbers AS
FROM 'numbers.json';

Parameters

NameDescriptionTypeDefault
auto_detectWhether to auto-detect the names of the keys and data types of the values automaticallyBOOLfalse
columnsA struct that specifies the key names and value types contained within the JSON file (e.g., {key1: 'INTEGER', key2: 'VARCHAR'}). If auto_detect is enabled these will be inferredSTRUCT(empty)
compressionThe compression type for the file. By default this will be detected automatically from the file extension (e.g., t.json.gz will use gzip, t.json will use none). Options are uncompressed, gzip, zstd and auto_detect.VARCHARauto_detect
convert_strings_to_integersWhether strings representing integer values should be converted to a numerical type.BOOLfalse
dateformatSpecifies the date format to use when parsing dates. See Date FormatVARCHARiso
filenameWhether or not an extra filename column should be included in the result.BOOLfalse
formatCan be one of auto, unstructured, newline_delimited, arrayVARCHARarray
hive_partitioningWhether or not to interpret the path as a Hive partitioned path.BOOLfalse
ignore_errorsWhether to ignore parse errors (only possible when format is newline_delimited)BOOLfalse
maximum_depthMaximum nesting depth to which the automatic schema detection detects types. Set to -1 to fully detect nested JSON typesBIGINT-1
maximum_object_sizeThe maximum size of a JSON object (in bytes)UINTEGER16777216
recordsCan be one of auto, true, falseVARCHARrecords
sample_sizeOption to define number of sample objects for automatic JSON type detection. Set to -1 to scan the entire input fileUBIGINT20480
timestampformatSpecifies the date format to use when parsing timestamps. See Date FormatVARCHARiso
union_by_nameWhether the schema's of multiple JSON files should be unified.BOOLfalse