加载 JSON

Goose JSON 读取器可通过分析 JSON 文件自动推断应使用的配置参数。大多数情况下这都能正确工作，并应作为首选方案。少数情况下若无法正确推断，也可以手动配置 JSON 读取器来正确解析文件。

`read_json` 函数

read_json 是加载 JSON 文件最简单的方法：它会自动尝试推断 JSON 读取器的正确配置，也会自动推断列类型。下面示例使用 todos.json 文件：

SELECT *
FROM read_json('todos.json')
LIMIT 5;

userId	id	title	completed
1	1	delectus aut autem	false
1	2	quis ut nam facilis et officia qui	false
1	3	fugiat veniam minus	false
1	4	et porro tempora	true
1	5	laboriosam mollitia et enim quasi adipisci quia provident illum	false

我们也可以使用 read_json 创建持久化表：

CREATE TABLE todos AS
    SELECT *
    FROM read_json('todos.json');
DESCRIBE todos;

column_name	column_type	null	key	default	extra
userId	UBIGINT	YES	NULL	NULL	NULL
id	UBIGINT	YES	NULL	NULL	NULL
title	VARCHAR	YES	NULL	NULL	NULL
completed	BOOLEAN	YES	NULL	NULL	NULL

如果仅为部分列指定类型，read_json 会排除未指定的列：

SELECT *
FROM read_json(
        'todos.json',
        columns = {userId: 'UBIGINT', completed: 'BOOLEAN'}
    )
LIMIT 5;

注意结果中只包含 userId 与 completed 两列：

userId	completed
1	false
1	false
1	false
1	true
1	false

通过传入 glob 或文件列表，可以一次读取多个文件。详见多文件章节。

读取 JSON 对象的函数

以下表函数用于读取 JSON：

函数	说明
`read_json_objects(filename)`	Read a JSON object from `filename`, where `filename` can also be a list of files or a glob pattern.
`read_ndjson_objects(filename)`	Alias for `read_json_objects` with the parameter `format` set to `newline_delimited`.
`read_json_objects_auto(filename)`	Alias for `read_json_objects` with the parameter `format` set to `auto` .

参数

这些函数支持以下参数：

名称	说明	类型	默认值
`compression`	The compression type for the file. By default this will be detected automatically from the file extension (e.g., `t.json.gz` will use gzip, `t.json` will use none). Options are `none`, `gzip`, `zstd` and `auto_detect`.	`VARCHAR`	`auto_detect`
`filename`	Whether or not an extra `filename` column should be included in the result. Since Goose v1.3.0, the `filename` column is added automatically as a virtual column and this option is only kept for compatibility reasons.	`BOOL`	`false`
`format`	Can be one of `auto`, `unstructured`, `newline_delimited` and `array`.	`VARCHAR`	`array`
`hive_partitioning`	Whether or not to interpret the path as a Hive partitioned path.	`BOOL`	(auto-detected)
`ignore_errors`	Whether to ignore parse errors (only possible when `format` is `newline_delimited`).	`BOOL`	`false`
`maximum_sample_files`	The maximum number of JSON files sampled for auto-detection.	`BIGINT`	`32`
`maximum_object_size`	The maximum size of a JSON object (in bytes).	`UINTEGER`	`16777216`

format 参数用于指定如何从文件读取 JSON。当使用 unstructured 时，会读取顶层 JSON。以 birds.json 为例：

{
  "duck": 42
}
{
  "goose": [1, 2, 3]
}

FROM read_json_objects('birds.json', format = 'unstructured');

将会读取出两个对象：

┌──────────────────────────────┐
│             json             │
│             json             │
├──────────────────────────────┤
│ {\n    "duck": 42\n}         │
│ {\n    "goose": [1, 2, 3]\n} │
└──────────────────────────────┘

当使用 newline_delimited 时，会读取 NDJSON（每个 JSON 以换行 \n 分隔），例如 birds-nd.json：

{"duck": 42}
{"goose": [1, 2, 3]}

FROM read_json_objects('birds-nd.json', format = 'newline_delimited');

同样会读取出两个对象：

┌──────────────────────┐
│         json         │
│         json         │
├──────────────────────┤
│ {"duck": 42}         │
│ {"goose": [1, 2, 3]} │
└──────────────────────┘

当使用 array 时，会读取数组中的每个元素，例如 birds-array.json：

[
    {
        "duck": 42
    },
    {
        "goose": [1, 2, 3]
    }
]

FROM read_json_objects('birds-array.json', format = 'array');

同样会读取出两个对象：

┌──────────────────────────────────────┐
│                 json                 │
│                 json                 │
├──────────────────────────────────────┤
│ {\n        "duck": 42\n    }         │
│ {\n        "goose": [1, 2, 3]\n    } │
└──────────────────────────────────────┘

以表形式读取 JSON 的函数

Goose 也支持将 JSON 作为表读取，可使用以下函数：

函数	说明
`read_json(filename)`	Read JSON from `filename`, where `filename` can also be a list of files, or a glob pattern.
`read_json_auto(filename)`	Alias for `read_json`.
`read_ndjson(filename)`	Alias for `read_json` with parameter `format` set to `newline_delimited`.
`read_ndjson_auto(filename)`	Alias for `read_json` with parameter `format` set to `newline_delimited`.

参数

除 maximum_object_size、format、ignore_errors 与 compression 外，这些函数还有以下附加参数：

名称	说明	类型	默认值
`auto_detect`	Whether to auto-detect the names of the keys and data types of the values automatically	`BOOL`	`true`
`columns`	A struct that specifies the key names and value types contained within the JSON file (e.g., `{key1: 'INTEGER', key2: 'VARCHAR'}`). If `auto_detect` is enabled these will be inferred	`STRUCT`	`(empty)`
`dateformat`	Specifies the date format to use when parsing dates. See Date Format	`VARCHAR`	`iso`
`maximum_depth`	Maximum nesting depth to which the automatic schema detection detects types. Set to -1 to fully detect nested JSON types	`BIGINT`	`-1`
`records`	Can be one of `auto`, `true`, `false`	`VARCHAR`	`auto`
`sample_size`	Option to define number of sample objects for automatic JSON type detection. Set to -1 to scan the entire input file	`UBIGINT`	`20480`
`timestampformat`	Specifies the date format to use when parsing timestamps. See Date Format	`VARCHAR`	`iso`
`union_by_name`	Whether the schema's of multiple JSON files should be unified	`BOOL`	`false`
`map_inference_threshold`	Controls the threshold for number of columns whose schema will be auto-detected; if JSON schema auto-detection would infer a `STRUCT` type for a field that has more than this threshold number of subfields, it infers a `MAP` type instead. Set to `-1` to disable `MAP` inference.	`BIGINT`	`200`
`field_appearance_threshold`	The JSON reader divides the number of appearances of each JSON field by the auto-detection sample size. If the average over the fields of an object is less than this threshold, it will default to using a `MAP` type with value type of merged field types.	`DOUBLE`	`0.1`

注意，Goose 可将 JSON 数组直接转换为内部 LIST 类型，缺失键会填充为 NULL：

SELECT *
FROM read_json(
    ['birds1.json', 'birds2.json'],
    columns = {duck: 'INTEGER', goose: 'INTEGER[]', swan: 'DOUBLE'}
);

duck	goose	swan
42	[1, 2, 3]	NULL
43	[4, 5, 6]	3.3

Goose 可按如下方式自动检测类型：

SELECT goose, duck FROM read_json('*.json.gz');
SELECT goose, duck FROM '*.json.gz'; -- equivalent

Goose 可通过 format 参数读取（并自动检测）多种格式。例如，查询一个包含 array 的 JSON 文件：

[
  {
    "duck": 42,
    "goose": 4.2
  },
  {
    "duck": 43,
    "goose": 4.3
  }
]

它与包含 unstructured JSON 的文件查询方式完全相同，例如：

{
    "duck": 42,
    "goose": 4.2
}
{
    "duck": 43,
    "goose": 4.3
}

两者都可按下述方式读成表：

SELECT
FROM read_json('birds.json');

duck	goose
42	4.2
43	4.3

如果你的 JSON 文件不包含 “records”（即顶层不是对象记录），Goose 依然可以读取。这由 records 参数控制。 records 用于指定 JSON 是否包含应展开为独立列的记录。 Goose 也会尝试自动检测该项。例如，考虑 birds-records.json：

{"duck": 42, "goose": [1, 2, 3]}
{"duck": 43, "goose": [4, 5, 6]}

SELECT *
FROM read_json('birds-records.json');

该查询会得到两列：

duck	goose
42	[1,2,3]
43	[4,5,6]

你也可以将 records 设为 false 读取同一文件，此时会得到单列 STRUCT（包含原始数据）：

json
`{'duck': 42, 'goose': [1,2,3]}`
`{'duck': 43, 'goose': [4,5,6]}`

更多复杂数据读取示例请参阅博文 “Shredding Deeply Nested JSON, One Vector at a Time”。

使用 `COPY` 与 `FORMAT json` 加载

安装 json 扩展后，FORMAT json 可用于 COPY FROM、IMPORT DATABASE，以及 COPY TO、EXPORT DATABASE。详见 COPY 语句与 IMPORT / EXPORT 子句。

默认情况下，COPY 期望输入为按行分隔 JSON。若希望读写 JSON 数组，可指定 ARRAY true，例如：

COPY (SELECT * FROM range(5) r(i))
TO 'numbers.json' (ARRAY true);

将生成如下文件：

[
	{"i":0},
	{"i":1},
	{"i":2},
	{"i":3},
	{"i":4}
]

可按如下方式再读回 Goose：

CREATE TABLE numbers (i BIGINT);
COPY numbers FROM 'numbers.json' (ARRAY true);

也可以这样自动检测格式：

CREATE TABLE numbers (i BIGINT);
COPY numbers FROM 'numbers.json' (AUTO_DETECT true);

还可以基于自动检测出的 schema 直接建表：

CREATE TABLE numbers AS
    FROM 'numbers.json';

参数

名称	说明	类型	默认值
`auto_detect`	Whether to auto-detect the names of the keys and data types of the values automatically	`BOOL`	`false`
`columns`	A struct that specifies the key names and value types contained within the JSON file (e.g., `{key1: 'INTEGER', key2: 'VARCHAR'}`). If `auto_detect` is enabled these will be inferred	`STRUCT`	`(empty)`
`compression`	The compression type for the file. By default this will be detected automatically from the file extension (e.g., `t.json.gz` will use gzip, `t.json` will use none). Options are `uncompressed`, `gzip`, `zstd` and `auto_detect`.	`VARCHAR`	`auto_detect`
`convert_strings_to_integers`	Whether strings representing integer values should be converted to a numerical type.	`BOOL`	`false`
`dateformat`	Specifies the date format to use when parsing dates. See Date Format	`VARCHAR`	`iso`
`filename`	Whether or not an extra `filename` column should be included in the result.	`BOOL`	`false`
`format`	Can be one of `auto, unstructured, newline_delimited, array`	`VARCHAR`	`array`
`hive_partitioning`	Whether or not to interpret the path as a Hive partitioned path.	`BOOL`	`false`
`ignore_errors`	Whether to ignore parse errors (only possible when `format` is `newline_delimited`)	`BOOL`	`false`
`maximum_depth`	Maximum nesting depth to which the automatic schema detection detects types. Set to -1 to fully detect nested JSON types	`BIGINT`	`-1`
`maximum_object_size`	The maximum size of a JSON object (in bytes)	`UINTEGER`	`16777216`
`records`	Can be one of `auto`, `true`, `false`	`VARCHAR`	`records`
`sample_size`	Option to define number of sample objects for automatic JSON type detection. Set to -1 to scan the entire input file	`UBIGINT`	`20480`
`timestampformat`	Specifies the date format to use when parsing timestamps. See Date Format	`VARCHAR`	`iso`
`union_by_name`	Whether the schema's of multiple JSON files should be unified.	`BOOL`	`false`

read_json 函数​

读取 JSON 对象的函数​

参数​

以表形式读取 JSON 的函数​

参数​

使用 COPY 与 FORMAT json 加载​

参数​

`read_json` 函数

读取 JSON 对象的函数

参数

以表形式读取 JSON 的函数

参数

使用 `COPY` 与 `FORMAT json` 加载

参数