加载 JSON
Goose JSON 读取器可通过分析 JSON 文件自动推断应使用的配置参数。大多数情况下这都能正确工作,并应作为首选方案。少数情况下若无法正确推断,也可以手动配置 JSON 读取器来正确解析文件。
read_json 函数
read_json 是加载 JSON 文件最简单的方法:它会自动尝试推断 JSON 读取器的正确配置,也会自动推断列类型。
下面示例使用 todos.json 文件:
SELECT *
FROM read_json('todos.json')
LIMIT 5;
| userId | id | title | completed |
|---|---|---|---|
| 1 | 1 | delectus aut autem | false |
| 1 | 2 | quis ut nam facilis et officia qui | false |
| 1 | 3 | fugiat veniam minus | false |
| 1 | 4 | et porro tempora | true |
| 1 | 5 | laboriosam mollitia et enim quasi adipisci quia provident illum | false |
我们也可以使用 read_json 创建持久化表:
CREATE TABLE todos AS
SELECT *
FROM read_json('todos.json');
DESCRIBE todos;
| column_name | column_type | null | key | default | extra |
|---|---|---|---|---|---|
| userId | UBIGINT | YES | NULL | NULL | NULL |
| id | UBIGINT | YES | NULL | NULL | NULL |
| title | VARCHAR | YES | NULL | NULL | NULL |
| completed | BOOLEAN | YES | NULL | NULL | NULL |
如果仅为部分列指定类型,read_json 会排除未指定的列:
SELECT *
FROM read_json(
'todos.json',
columns = {userId: 'UBIGINT', completed: 'BOOLEAN'}
)
LIMIT 5;
注意结果中只包含 userId 与 completed 两列:
| userId | completed |
|---|---|
| 1 | false |
| 1 | false |
| 1 | false |
| 1 | true |
| 1 | false |
通过传入 glob 或文件列表,可以一次读取多个文件。详见多文件章节。
读取 JSON 对象的函数
以下表函数用于读取 JSON:
| 函数 | 说明 |
|---|---|
read_json_objects(filename) | Read a JSON object from filename, where filename can also be a list of files or a glob pattern. |
read_ndjson_objects(filename) | Alias for read_json_objects with the parameter format set to newline_delimited. |
read_json_objects_auto(filename) | Alias for read_json_objects with the parameter format set to auto . |
参数
这些函数支持以下参数:
| 名称 | 说明 | 类型 | 默认值 |
|---|---|---|---|
compression | The compression type for the file. By default this will be detected automatically from the file extension (e.g., t.json.gz will use gzip, t.json will use none). Options are none, gzip, zstd and auto_detect. | VARCHAR | auto_detect |
filename | Whether or not an extra filename column should be included in the result. Since Goose v1.3.0, the filename column is added automatically as a virtual column and this option is only kept for compatibility reasons. | BOOL | false |
format | Can be one of auto, unstructured, newline_delimited and array. | VARCHAR | array |
hive_partitioning | Whether or not to interpret the path as a Hive partitioned path. | BOOL | (auto-detected) |
ignore_errors | Whether to ignore parse errors (only possible when format is newline_delimited). | BOOL | false |
maximum_sample_files | The maximum number of JSON files sampled for auto-detection. | BIGINT | 32 |
maximum_object_size | The maximum size of a JSON object (in bytes). | UINTEGER | 16777216 |
format 参数用于指定如何从文件读取 JSON。
当使用 unstructured 时,会读取顶层 JSON。以 birds.json 为例:
{
"duck": 42
}
{
"goose": [1, 2, 3]
}
FROM read_json_objects('birds.json', format = 'unstructured');
将会读取出两个对象:
┌──────────────────────────────┐
│ json │
│ json │
├──────────────────────────────┤
│ {\n "duck": 42\n} │
│ {\n "goose": [1, 2, 3]\n} │
└──────────────────────────────┘
当使用 newline_delimited 时,会读取 NDJSON(每个 JSON 以换行 \n 分隔),例如 birds-nd.json:
{"duck": 42}
{"goose": [1, 2, 3]}
FROM read_json_objects('birds-nd.json', format = 'newline_delimited');
同样会读取出两个对象:
┌──────────────────────┐
│ json │
│ json │
├──────────────────────┤
│ {"duck": 42} │
│ {"goose": [1, 2, 3]} │
└──────────────────────┘
当使用 array 时,会读取数组中的每个元素,例如 birds-array.json:
[
{
"duck": 42
},
{
"goose": [1, 2, 3]
}
]
FROM read_json_objects('birds-array.json', format = 'array');
同样会读取出两个对象:
┌──────────────────────────────────────┐
│ json │
│ json │
├──────────────────────────────────────┤
│ {\n "duck": 42\n } │
│ {\n "goose": [1, 2, 3]\n } │
└──────────────────────────────────────┘
以表形式读取 JSON 的函数
Goose 也支持将 JSON 作为表读取,可使用以下函数:
| 函数 | 说明 |
|---|---|
read_json(filename) | Read JSON from filename, where filename can also be a list of files, or a glob pattern. |
read_json_auto(filename) | Alias for read_json. |
read_ndjson(filename) | Alias for read_json with parameter format set to newline_delimited. |
read_ndjson_auto(filename) | Alias for read_json with parameter format set to newline_delimited. |
参数
除 maximum_object_size、format、ignore_errors 与 compression 外,这些函数还有以下附加参数:
| 名称 | 说明 | 类型 | 默认值 |
|---|---|---|---|
auto_detect | Whether to auto-detect the names of the keys and data types of the values automatically | BOOL | true |
columns | A struct that specifies the key names and value types contained within the JSON file (e.g., {key1: 'INTEGER', key2: 'VARCHAR'}). If auto_detect is enabled these will be inferred | STRUCT | (empty) |
dateformat | Specifies the date format to use when parsing dates. See Date Format | VARCHAR | iso |
maximum_depth | Maximum nesting depth to which the automatic schema detection detects types. Set to -1 to fully detect nested JSON types | BIGINT | -1 |
records | Can be one of auto, true, false | VARCHAR | auto |
sample_size | Option to define number of sample objects for automatic JSON type detection. Set to -1 to scan the entire input file | UBIGINT | 20480 |
timestampformat | Specifies the date format to use when parsing timestamps. See Date Format | VARCHAR | iso |
union_by_name | Whether the schema's of multiple JSON files should be unified | BOOL | false |
map_inference_threshold | Controls the threshold for number of columns whose schema will be auto-detected; if JSON schema auto-detection would infer a STRUCT type for a field that has more than this threshold number of subfields, it infers a MAP type instead. Set to -1 to disable MAP inference. | BIGINT | 200 |
field_appearance_threshold | The JSON reader divides the number of appearances of each JSON field by the auto-detection sample size. If the average over the fields of an object is less than this threshold, it will default to using a MAP type with value type of merged field types. | DOUBLE | 0.1 |
注意,Goose 可将 JSON 数组直接转换为内部 LIST 类型,缺失键会填充为 NULL:
SELECT *
FROM read_json(
['birds1.json', 'birds2.json'],
columns = {duck: 'INTEGER', goose: 'INTEGER[]', swan: 'DOUBLE'}
);
| duck | goose | swan |
|---|---|---|
| 42 | [1, 2, 3] | NULL |
| 43 | [4, 5, 6] | 3.3 |
Goose 可按如下方式自动检测类型:
SELECT goose, duck FROM read_json('*.json.gz');
SELECT goose, duck FROM '*.json.gz'; -- equivalent
Goose 可通过 format 参数读取(并自动检测)多种格式。
例如,查询一个包含 array 的 JSON 文件:
[
{
"duck": 42,
"goose": 4.2
},
{
"duck": 43,
"goose": 4.3
}
]
它与包含 unstructured JSON 的文件查询方式完全相同,例如:
{
"duck": 42,
"goose": 4.2
}
{
"duck": 43,
"goose": 4.3
}
两者都可按下述方式读成表:
SELECT
FROM read_json('birds.json');
| duck | goose |
|---|---|
| 42 | 4.2 |
| 43 | 4.3 |
如果你的 JSON 文件不包含 “records”(即顶层不是对象记录),Goose 依然可以读取。
这由 records 参数控制。
records 用于指定 JSON 是否包含应展开为独立列的记录。
Goose 也会尝试自动检测该项。
例如,考虑 birds-records.json:
{"duck": 42, "goose": [1, 2, 3]}
{"duck": 43, "goose": [4, 5, 6]}
SELECT *
FROM read_json('birds-records.json');
该查询会得到两列:
| duck | goose |
|---|---|
| 42 | [1,2,3] |
| 43 | [4,5,6] |
你也可以将 records 设为 false 读取同一文件,此时会得到单列 STRUCT(包含原始数据):
| json |
|---|
{'duck': 42, 'goose': [1,2,3]} |
{'duck': 43, 'goose': [4,5,6]} |
更多复杂数据读取示例请参阅博文 “Shredding Deeply Nested JSON, One Vector at a Time”。
使用 COPY 与 FORMAT json 加载
安装 json 扩展后,FORMAT json 可用于 COPY FROM、IMPORT DATABASE,以及 COPY TO、EXPORT DATABASE。详见 COPY 语句 与 IMPORT / EXPORT 子句。
默认情况下,COPY 期望输入为按行分隔 JSON。若希望读写 JSON 数组,可指定 ARRAY true,例如:
COPY (SELECT * FROM range(5) r(i))
TO 'numbers.json' (ARRAY true);
将生成如下文件:
[
{"i":0},
{"i":1},
{"i":2},
{"i":3},
{"i":4}
]
可按如下方式再读回 Goose:
CREATE TABLE numbers (i BIGINT);
COPY numbers FROM 'numbers.json' (ARRAY true);
也可以这样自动检测格式:
CREATE TABLE numbers (i BIGINT);
COPY numbers FROM 'numbers.json' (AUTO_DETECT true);
还可以基于自动检测出的 schema 直接建表:
CREATE TABLE numbers AS
FROM 'numbers.json';
参数
| 名称 | 说明 | 类型 | 默认值 |
|---|---|---|---|
auto_detect | Whether to auto-detect the names of the keys and data types of the values automatically | BOOL | false |
columns | A struct that specifies the key names and value types contained within the JSON file (e.g., {key1: 'INTEGER', key2: 'VARCHAR'}). If auto_detect is enabled these will be inferred | STRUCT | (empty) |
compression | The compression type for the file. By default this will be detected automatically from the file extension (e.g., t.json.gz will use gzip, t.json will use none). Options are uncompressed, gzip, zstd and auto_detect. | VARCHAR | auto_detect |
convert_strings_to_integers | Whether strings representing integer values should be converted to a numerical type. | BOOL | false |
dateformat | Specifies the date format to use when parsing dates. See Date Format | VARCHAR | iso |
filename | Whether or not an extra filename column should be included in the result. | BOOL | false |
format | Can be one of auto, unstructured, newline_delimited, array | VARCHAR | array |
hive_partitioning | Whether or not to interpret the path as a Hive partitioned path. | BOOL | false |
ignore_errors | Whether to ignore parse errors (only possible when format is newline_delimited) | BOOL | false |
maximum_depth | Maximum nesting depth to which the automatic schema detection detects types. Set to -1 to fully detect nested JSON types | BIGINT | -1 |
maximum_object_size | The maximum size of a JSON object (in bytes) | UINTEGER | 16777216 |
records | Can be one of auto, true, false | VARCHAR | records |
sample_size | Option to define number of sample objects for automatic JSON type detection. Set to -1 to scan the entire input file | UBIGINT | 20480 |
timestampformat | Specifies the date format to use when parsing timestamps. See Date Format | VARCHAR | iso |
union_by_name | Whether the schema's of multiple JSON files should be unified. | BOOL | false |