CSV 导入

示例

以下示例使用 flights.csv 文件。

从磁盘读取 CSV 文件，自动推断参数：

SELECT * FROM 'flights.csv';

使用 read_csv 并传入自定义选项：

SELECT *
FROM read_csv('flights.csv',
    delim = '|',
    header = true,
    columns = {
        'FlightDate': 'DATE',
        'UniqueCarrier': 'VARCHAR',
        'OriginCityName': 'VARCHAR',
        'DestCityName': 'VARCHAR'
    });

从 stdin 读取 CSV，自动推断参数：

cat flights.csv | goose -c "SELECT * FROM read_csv('/dev/stdin')"

将 CSV 文件读入数据表：

CREATE TABLE ontime (
    FlightDate DATE,
    UniqueCarrier VARCHAR,
    OriginCityName VARCHAR,
    DestCityName VARCHAR
);
COPY ontime FROM 'flights.csv';

或者使用 CREATE TABLE ... AS SELECT 语句，无需手动指定 schema 即可建表：

CREATE TABLE ontime AS
    SELECT * FROM 'flights.csv';

我们也可以用 FROM-first 语法省略 SELECT *。

CREATE TABLE ontime AS
    FROM 'flights.csv';

CSV 加载

CSV 加载（即将 CSV 文件导入数据库）是非常常见但又出奇棘手的任务。CSV 表面看似简单，实际却常有不一致之处，导致加载困难。CSV 文件形式多样、常有损坏、且通常没有 schema。CSV 读取器需要应对所有这些情况。

Goose CSV 读取器可通过 CSV sniffer 分析 CSV 文件并自动推断配置参数。大多数情况下这都能正确工作，应该优先尝试。少数情况下若无法正确推断，也可以手动设置读取器参数以正确解析 CSV。详见自动检测页面。

参数

下面是可传给 read_csv 函数的参数。在语义适用的场景下，这些参数也可用于 COPY 语句。

Name	Description	Type	Default
`all_varchar`	Skip type detection and assume all columns are of type `VARCHAR`. This option is only supported by the `read_csv` function.	`BOOL`	`false`
`allow_quoted_nulls`	Allow the conversion of quoted values to `NULL` values	`BOOL`	`true`
`auto_detect`	Auto detect CSV parameters.	`BOOL`	`true`
`auto_type_candidates`	Types that the sniffer uses when detecting column types. The `VARCHAR` type is always included as a fallback option. See example.	`TYPE[]`	default types
`buffer_size`	Size of the buffers used to read files, in bytes. Must be large enough to hold four lines and can significantly impact performance.	`BIGINT`	`16 * max_line_size`
`columns`	Column names and types, as a struct (e.g., `{'col1': 'INTEGER', 'col2': 'VARCHAR'}`). Using this option disables auto detection of the schema.	`STRUCT`	(empty)
`comment`	Character used to initiate comments. Lines starting with a comment character (optionally preceded by space characters) are completely ignored; other lines containing a comment character are parsed only up to that point.	`VARCHAR`	(empty)
`compression`	Method used to compress CSV files. By default this is detected automatically from the file extension (e.g., `t.csv.gz` will use gzip, `t.csv` will use `none`). Options are `none`, `gzip`, `zstd`.	`VARCHAR`	`auto`
`dateformat`	Date format used when parsing and writing dates.	`VARCHAR`	(empty)
`date_format`	Alias for `dateformat`; only available in the `COPY` statement.	`VARCHAR`	(empty)
`decimal_separator`	Decimal separator for numbers.	`VARCHAR`	`.`
`delim`	Delimiter character used to separate columns within each line, e.g., `,` `;` `\t`. The delimiter character can be up to 4 bytes, e.g., 🦆. Alias for `sep`.	`VARCHAR`	`,`
`delimiter`	Alias for `delim`; only available in the `COPY` statement.	`VARCHAR`	`,`
`escape`	String used to escape the `quote` character within quoted values.	`VARCHAR`	`"`
`encoding`	Encoding used by the CSV file. Options are `utf-8`, `utf-16`, `latin-1`. Not available in the `COPY` statement (which always uses `utf-8`).	`VARCHAR`	`utf-8`
`filename`	Add path of the containing file to each row, as a string column named `filename`. Relative or absolute paths are returned depending on the path or glob pattern provided to `read_csv`, not just filenames. Since Goose v1.3.0, the `filename` column is added automatically as a virtual column and this option is only kept for compatibility reasons.	`BOOL`	`false`
`force_not_null`	Do not match values in the specified columns against the `NULL` string. In the default case where the `NULL` string is empty, this means that empty values are read as zero-length strings instead of `NULL`s.	`VARCHAR[]`	`[]`
`header`	First line of each file contains the column names.	`BOOL`	`false`
`hive_partitioning`	Interpret the path as a Hive partitioned path.	`BOOL`	(auto-detected)
`ignore_errors`	Ignore any parsing errors encountered.	`BOOL`	`false`
`max_line_size` or `maximum_line_size`. Not available in the `COPY` statement.	Maximum line size, in bytes.	`BIGINT`	2000000
`names` or `column_names`	Column names, as a list. See example.	`VARCHAR[]`	(empty)
`new_line`	New line character(s). Options are `'\r'`,`'\n'`, or `'\r\n'`. The CSV parser only distinguishes between single-character and double-character line delimiters. Therefore, it does not differentiate between `'\r'` and `'\n'`.	`VARCHAR`	(empty)
`normalize_names`	Normalize column names. This removes any non-alphanumeric characters from them. Column names that are reserved SQL keywords are prefixed with an underscore character (`_`).	`BOOL`	`false`
`null_padding`	Pad the remaining columns on the right with `NULL` values when a line lacks columns.	`BOOL`	`false`
`nullstr` or `null`	Strings that represent a `NULL` value.	`VARCHAR` or `VARCHAR[]`	(empty)
`parallel`	Use the parallel CSV reader.	`BOOL`	`true`
`quote`	String used to quote values.	`VARCHAR`	`"`
`rejects_scan`	Name of the temporary table where information on faulty scans is stored.	`VARCHAR`	`reject_scans`
`rejects_table`	Name of the temporary table where information on faulty lines is stored.	`VARCHAR`	`reject_errors`
`rejects_limit`	Upper limit on the number of faulty lines per file that are recorded in the rejects table. Setting this to `0` means that no limit is applied.	`BIGINT`	`0`
`sample_size`	Number of sample lines for auto detection of parameters.	`BIGINT`	20480
`sep`	Delimiter character used to separate columns within each line, e.g., `,` `;` `\t`. The delimiter character can be up to 4 bytes, e.g., 🦆. Alias for `delim`.	`VARCHAR`	`,`
`skip`	Number of lines to skip at the start of each file.	`BIGINT`	0
`store_rejects`	Skip any lines with errors and store them in the rejects table.	`BOOL`	`false`
`strict_mode`	Enforces the strictness level of the CSV Reader. When set to `true`, the parser will throw an error upon encountering any issues. When set to `false`, the parser will attempt to read structurally incorrect files. It is important to note that reading structurally incorrect files can cause ambiguity; therefore, this option should be used with caution.	`BOOL`	`true`
`thousands`	Character used to identify thousands separators in numeric values. It must be a single character and different from the `decimal_separator` option.	`VARCHAR`	(empty)
`timestampformat`	Timestamp format used when parsing and writing timestamps.	`VARCHAR`	(empty)
`timestamp_format`	Alias for `timestampformat`; only available in the `COPY` statement.	`VARCHAR`	(empty)
`types` or `dtypes` or `column_types`	Column types, as either a list (by position) or a struct (by name). See example.	`VARCHAR[]` or `STRUCT`	(empty)
`union_by_name`	Align columns from different files by column name instead of position. Using this option increases memory consumption.	`BOOL`	`false`

Tip Goose's CSV reader supports UTF-8 (default), UTF-16 and Latin-1 encodings. For other encodings, you can either use the encodings extension or convert them e.g. using the iconv command-line tool:
iconv -f ISO-8859-2 -t UTF-8 input.csv > input-utf-8.csv

`auto_type_candidates` 详情

auto_type_candidates 允许你指定 CSV 读取器在列类型检测时要考虑的数据类型。示例：

SELECT * FROM read_csv('csv_file.csv', auto_type_candidates = ['BIGINT', 'DATE']);

auto_type_candidates 的默认值为 ['NULL', 'BOOLEAN', 'BIGINT', 'DOUBLE', 'TIME', 'DATE', 'TIMESTAMP', 'VARCHAR']。

CSV 函数

read_csv 会使用 CSV sniffer 自动推断 CSV 读取器配置，也会自动推断列类型。若 CSV 有表头，则使用表头中的列名；否则列名会被命名为 column0, column1, column2, ...。以下是 flights.csv 示例：

SELECT * FROM read_csv('flights.csv');

FlightDate	UniqueCarrier	OriginCityName	DestCityName
1988-01-01	AA	New York, NY	Los Angeles, CA
1988-01-02	AA	New York, NY	Los Angeles, CA
1988-01-03	AA	New York, NY	Los Angeles, CA

路径可以是相对路径（相对于当前工作目录）或绝对路径。

我们也可以使用 read_csv 创建持久化表：

CREATE TABLE ontime AS
    SELECT * FROM read_csv('flights.csv');
DESCRIBE ontime;

column_name	column_type	null	key	default	extra
FlightDate	DATE	YES	NULL	NULL	NULL
UniqueCarrier	VARCHAR	YES	NULL	NULL	NULL
OriginCityName	VARCHAR	YES	NULL	NULL	NULL
DestCityName	VARCHAR	YES	NULL	NULL	NULL

SELECT * FROM read_csv('flights.csv', sample_size = 20_000);

若显式设置 delim / sep、quote、escape 或 header，则可跳过对应参数的自动检测：

SELECT * FROM read_csv('flights.csv', header = true);

通过传入 glob 或文件列表，可以一次读取多个文件。更多信息请参阅多文件章节。

使用 `COPY` 语句加载

可以使用 COPY 语句将 CSV 数据加载到表中。该语句语法与 PostgreSQL 相同。使用 COPY 前，需要先创建具有正确 schema 的表（列顺序与 CSV 一致，类型也要与 CSV 数据匹配）。COPY 会自动检测 CSV 配置参数。

CREATE TABLE ontime (
    flightdate DATE,
    uniquecarrier VARCHAR,
    origincityname VARCHAR,
    destcityname VARCHAR
);
COPY ontime FROM 'flights.csv';
SELECT * FROM ontime;

flightdate	uniquecarrier	origincityname	destcityname
1988-01-01	AA	New York, NY	Los Angeles, CA
1988-01-02	AA	New York, NY	Los Angeles, CA
1988-01-03	AA	New York, NY	Los Angeles, CA

若要手动指定 CSV 格式，可通过 COPY 的配置选项实现。

CREATE TABLE ontime (flightdate DATE, uniquecarrier VARCHAR, origincityname VARCHAR, destcityname VARCHAR);
COPY ontime FROM 'flights.csv' (DELIMITER '|', HEADER);
SELECT * FROM ontime;

读取有错误的 CSV 文件

Goose 支持读取存在错误的 CSV 文件。详见读取有错误的 CSV 文件页面。

顺序保持

CSV 读取器遵循 preserve_insertion_order 配置项来保持插入顺序。当其为 true（默认）时，CSV 读取器返回结果集中的行顺序与文件中对应行顺序一致。当其为 false 时，不保证顺序保持。

写出 CSV 文件

Goose 可通过 COPY ... TO 语句写出 CSV 文件。

示例​

CSV 加载​

参数​

auto_type_candidates 详情​

CSV 函数​

使用 COPY 语句加载​

读取有错误的 CSV 文件​

顺序保持​

写出 CSV 文件​

示例