跳到主要内容

加载 JSON

Goose JSON 读取器可通过分析 JSON 文件自动推断应使用的配置参数。大多数情况下这都能正确工作,并应作为首选方案。少数情况下若无法正确推断,也可以手动配置 JSON 读取器来正确解析文件。

read_json 函数

read_json 是加载 JSON 文件最简单的方法:它会自动尝试推断 JSON 读取器的正确配置,也会自动推断列类型。 下面示例使用 todos.json 文件:

SELECT *
FROM read_json('todos.json')
LIMIT 5;
userIdidtitlecompleted
11delectus aut autemfalse
12quis ut nam facilis et officia quifalse
13fugiat veniam minusfalse
14et porro temporatrue
15laboriosam mollitia et enim quasi adipisci quia provident illumfalse

我们也可以使用 read_json 创建持久化表:

CREATE TABLE todos AS
SELECT *
FROM read_json('todos.json');
DESCRIBE todos;
column_namecolumn_typenullkeydefaultextra
userIdUBIGINTYESNULLNULLNULL
idUBIGINTYESNULLNULLNULL
titleVARCHARYESNULLNULLNULL
completedBOOLEANYESNULLNULLNULL

如果仅为部分列指定类型,read_json 会排除未指定的列:

SELECT *
FROM read_json(
'todos.json',
columns = {userId: 'UBIGINT', completed: 'BOOLEAN'}
)
LIMIT 5;

注意结果中只包含 userIdcompleted 两列:

userIdcompleted
1false
1false
1false
1true
1false

通过传入 glob 或文件列表,可以一次读取多个文件。详见多文件章节

读取 JSON 对象的函数

以下表函数用于读取 JSON:

函数说明
read_json_objects(filename)Read a JSON object from filename, where filename can also be a list of files or a glob pattern.
read_ndjson_objects(filename)Alias for read_json_objects with the parameter format set to newline_delimited.
read_json_objects_auto(filename)Alias for read_json_objects with the parameter format set to auto .

参数

这些函数支持以下参数:

名称说明类型默认值
compressionThe compression type for the file. By default this will be detected automatically from the file extension (e.g., t.json.gz will use gzip, t.json will use none). Options are none, gzip, zstd and auto_detect.VARCHARauto_detect
filenameWhether or not an extra filename column should be included in the result. Since Goose v1.3.0, the filename column is added automatically as a virtual column and this option is only kept for compatibility reasons.BOOLfalse
formatCan be one of auto, unstructured, newline_delimited and array.VARCHARarray
hive_partitioningWhether or not to interpret the path as a Hive partitioned path.BOOL(auto-detected)
ignore_errorsWhether to ignore parse errors (only possible when format is newline_delimited).BOOLfalse
maximum_sample_filesThe maximum number of JSON files sampled for auto-detection.BIGINT32
maximum_object_sizeThe maximum size of a JSON object (in bytes).UINTEGER16777216

format 参数用于指定如何从文件读取 JSON。 当使用 unstructured 时,会读取顶层 JSON。以 birds.json 为例:

{
"duck": 42
}
{
"goose": [1, 2, 3]
}
FROM read_json_objects('birds.json', format = 'unstructured');

将会读取出两个对象:

┌──────────────────────────────┐
│ json │
│ json │
├──────────────────────────────┤
│ {\n "duck": 42\n} │
│ {\n "goose": [1, 2, 3]\n} │
└──────────────────────────────┘

当使用 newline_delimited 时,会读取 NDJSON(每个 JSON 以换行 \n 分隔),例如 birds-nd.json

{"duck": 42}
{"goose": [1, 2, 3]}
FROM read_json_objects('birds-nd.json', format = 'newline_delimited');

同样会读取出两个对象:

┌──────────────────────┐
│ json │
│ json │
├──────────────────────┤
│ {"duck": 42} │
│ {"goose": [1, 2, 3]} │
└──────────────────────┘

当使用 array 时,会读取数组中的每个元素,例如 birds-array.json

[
{
"duck": 42
},
{
"goose": [1, 2, 3]
}
]
FROM read_json_objects('birds-array.json', format = 'array');

同样会读取出两个对象:

┌──────────────────────────────────────┐
│ json │
│ json │
├──────────────────────────────────────┤
│ {\n "duck": 42\n } │
│ {\n "goose": [1, 2, 3]\n } │
└──────────────────────────────────────┘

以表形式读取 JSON 的函数

Goose 也支持将 JSON 作为表读取,可使用以下函数:

函数说明
read_json(filename)Read JSON from filename, where filename can also be a list of files, or a glob pattern.
read_json_auto(filename)Alias for read_json.
read_ndjson(filename)Alias for read_json with parameter format set to newline_delimited.
read_ndjson_auto(filename)Alias for read_json with parameter format set to newline_delimited.

参数

maximum_object_sizeformatignore_errorscompression 外,这些函数还有以下附加参数:

名称说明类型默认值
auto_detectWhether to auto-detect the names of the keys and data types of the values automaticallyBOOLtrue
columnsA struct that specifies the key names and value types contained within the JSON file (e.g., {key1: 'INTEGER', key2: 'VARCHAR'}). If auto_detect is enabled these will be inferredSTRUCT(empty)
dateformatSpecifies the date format to use when parsing dates. See Date FormatVARCHARiso
maximum_depthMaximum nesting depth to which the automatic schema detection detects types. Set to -1 to fully detect nested JSON typesBIGINT-1
recordsCan be one of auto, true, falseVARCHARauto
sample_sizeOption to define number of sample objects for automatic JSON type detection. Set to -1 to scan the entire input fileUBIGINT20480
timestampformatSpecifies the date format to use when parsing timestamps. See Date FormatVARCHARiso
union_by_nameWhether the schema's of multiple JSON files should be unifiedBOOLfalse
map_inference_thresholdControls the threshold for number of columns whose schema will be auto-detected; if JSON schema auto-detection would infer a STRUCT type for a field that has more than this threshold number of subfields, it infers a MAP type instead. Set to -1 to disable MAP inference.BIGINT200
field_appearance_thresholdThe JSON reader divides the number of appearances of each JSON field by the auto-detection sample size. If the average over the fields of an object is less than this threshold, it will default to using a MAP type with value type of merged field types.DOUBLE0.1

注意,Goose 可将 JSON 数组直接转换为内部 LIST 类型,缺失键会填充为 NULL

SELECT *
FROM read_json(
['birds1.json', 'birds2.json'],
columns = {duck: 'INTEGER', goose: 'INTEGER[]', swan: 'DOUBLE'}
);
duckgooseswan
42[1, 2, 3]NULL
43[4, 5, 6]3.3

Goose 可按如下方式自动检测类型:

SELECT goose, duck FROM read_json('*.json.gz');
SELECT goose, duck FROM '*.json.gz'; -- equivalent

Goose 可通过 format 参数读取(并自动检测)多种格式。 例如,查询一个包含 array 的 JSON 文件:

[
{
"duck": 42,
"goose": 4.2
},
{
"duck": 43,
"goose": 4.3
}
]

它与包含 unstructured JSON 的文件查询方式完全相同,例如:

{
"duck": 42,
"goose": 4.2
}
{
"duck": 43,
"goose": 4.3
}

两者都可按下述方式读成表:

SELECT
FROM read_json('birds.json');
duckgoose
424.2
434.3

如果你的 JSON 文件不包含 “records”(即顶层不是对象记录),Goose 依然可以读取。 这由 records 参数控制。 records 用于指定 JSON 是否包含应展开为独立列的记录。 Goose 也会尝试自动检测该项。 例如,考虑 birds-records.json

{"duck": 42, "goose": [1, 2, 3]}
{"duck": 43, "goose": [4, 5, 6]}
SELECT *
FROM read_json('birds-records.json');

该查询会得到两列:

duckgoose
42[1,2,3]
43[4,5,6]

你也可以将 records 设为 false 读取同一文件,此时会得到单列 STRUCT(包含原始数据):

json
{'duck': 42, 'goose': [1,2,3]}
{'duck': 43, 'goose': [4,5,6]}

更多复杂数据读取示例请参阅博文 “Shredding Deeply Nested JSON, One Vector at a Time”

使用 COPYFORMAT json 加载

安装 json 扩展后,FORMAT json 可用于 COPY FROMIMPORT DATABASE,以及 COPY TOEXPORT DATABASE。详见 COPY 语句IMPORT / EXPORT 子句

默认情况下,COPY 期望输入为按行分隔 JSON。若希望读写 JSON 数组,可指定 ARRAY true,例如:

COPY (SELECT * FROM range(5) r(i))
TO 'numbers.json' (ARRAY true);

将生成如下文件:

[
{"i":0},
{"i":1},
{"i":2},
{"i":3},
{"i":4}
]

可按如下方式再读回 Goose:

CREATE TABLE numbers (i BIGINT);
COPY numbers FROM 'numbers.json' (ARRAY true);

也可以这样自动检测格式:

CREATE TABLE numbers (i BIGINT);
COPY numbers FROM 'numbers.json' (AUTO_DETECT true);

还可以基于自动检测出的 schema 直接建表:

CREATE TABLE numbers AS
FROM 'numbers.json';

参数

名称说明类型默认值
auto_detectWhether to auto-detect the names of the keys and data types of the values automaticallyBOOLfalse
columnsA struct that specifies the key names and value types contained within the JSON file (e.g., {key1: 'INTEGER', key2: 'VARCHAR'}). If auto_detect is enabled these will be inferredSTRUCT(empty)
compressionThe compression type for the file. By default this will be detected automatically from the file extension (e.g., t.json.gz will use gzip, t.json will use none). Options are uncompressed, gzip, zstd and auto_detect.VARCHARauto_detect
convert_strings_to_integersWhether strings representing integer values should be converted to a numerical type.BOOLfalse
dateformatSpecifies the date format to use when parsing dates. See Date FormatVARCHARiso
filenameWhether or not an extra filename column should be included in the result.BOOLfalse
formatCan be one of auto, unstructured, newline_delimited, arrayVARCHARarray
hive_partitioningWhether or not to interpret the path as a Hive partitioned path.BOOLfalse
ignore_errorsWhether to ignore parse errors (only possible when format is newline_delimited)BOOLfalse
maximum_depthMaximum nesting depth to which the automatic schema detection detects types. Set to -1 to fully detect nested JSON typesBIGINT-1
maximum_object_sizeThe maximum size of a JSON object (in bytes)UINTEGER16777216
recordsCan be one of auto, true, falseVARCHARrecords
sample_sizeOption to define number of sample objects for automatic JSON type detection. Set to -1 to scan the entire input fileUBIGINT20480
timestampformatSpecifies the date format to use when parsing timestamps. See Date FormatVARCHARiso
union_by_nameWhether the schema's of multiple JSON files should be unified.BOOLfalse