跳到主要内容

CSV 导入

示例

以下示例使用 flights.csv 文件。

从磁盘读取 CSV 文件,自动推断参数:

SELECT * FROM 'flights.csv';

使用 read_csv 并传入自定义选项:

SELECT *
FROM read_csv('flights.csv',
delim = '|',
header = true,
columns = {
'FlightDate': 'DATE',
'UniqueCarrier': 'VARCHAR',
'OriginCityName': 'VARCHAR',
'DestCityName': 'VARCHAR'
});

从 stdin 读取 CSV,自动推断参数:

cat flights.csv | goose -c "SELECT * FROM read_csv('/dev/stdin')"

将 CSV 文件读入数据表:

CREATE TABLE ontime (
FlightDate DATE,
UniqueCarrier VARCHAR,
OriginCityName VARCHAR,
DestCityName VARCHAR
);
COPY ontime FROM 'flights.csv';

或者使用 CREATE TABLE ... AS SELECT 语句,无需手动指定 schema 即可建表:

CREATE TABLE ontime AS
SELECT * FROM 'flights.csv';

我们也可以用 FROM-first 语法 省略 SELECT *

CREATE TABLE ontime AS
FROM 'flights.csv';

CSV 加载

CSV 加载(即将 CSV 文件导入数据库)是非常常见但又出奇棘手的任务。CSV 表面看似简单,实际却常有不一致之处,导致加载困难。CSV 文件形式多样、常有损坏、且通常没有 schema。CSV 读取器需要应对所有这些情况。

Goose CSV 读取器可通过 CSV sniffer 分析 CSV 文件并自动推断配置参数。大多数情况下这都能正确工作,应该优先尝试。少数情况下若无法正确推断,也可以手动设置读取器参数以正确解析 CSV。详见自动检测页面

参数

下面是可传给 read_csv 函数 的参数。在语义适用的场景下,这些参数也可用于 COPY 语句

NameDescriptionTypeDefault
all_varcharSkip type detection and assume all columns are of type VARCHAR. This option is only supported by the read_csv function.BOOLfalse
allow_quoted_nullsAllow the conversion of quoted values to NULL valuesBOOLtrue
auto_detectAuto detect CSV parameters.BOOLtrue
auto_type_candidatesTypes that the sniffer uses when detecting column types. The VARCHAR type is always included as a fallback option. See example.TYPE[]default types
buffer_sizeSize of the buffers used to read files, in bytes. Must be large enough to hold four lines and can significantly impact performance.BIGINT16 * max_line_size
columnsColumn names and types, as a struct (e.g., {'col1': 'INTEGER', 'col2': 'VARCHAR'}). Using this option disables auto detection of the schema.STRUCT(empty)
commentCharacter used to initiate comments. Lines starting with a comment character (optionally preceded by space characters) are completely ignored; other lines containing a comment character are parsed only up to that point.VARCHAR(empty)
compressionMethod used to compress CSV files. By default this is detected automatically from the file extension (e.g., t.csv.gz will use gzip, t.csv will use none). Options are none, gzip, zstd.VARCHARauto
dateformatDate format used when parsing and writing dates.VARCHAR(empty)
date_formatAlias for dateformat; only available in the COPY statement.VARCHAR(empty)
decimal_separatorDecimal separator for numbers.VARCHAR.
delimDelimiter character used to separate columns within each line, e.g., , ; \t. The delimiter character can be up to 4 bytes, e.g., 🦆. Alias for sep.VARCHAR,
delimiterAlias for delim; only available in the COPY statement.VARCHAR,
escapeString used to escape the quote character within quoted values.VARCHAR"
encodingEncoding used by the CSV file. Options are utf-8, utf-16, latin-1. Not available in the COPY statement (which always uses utf-8).VARCHARutf-8
filenameAdd path of the containing file to each row, as a string column named filename. Relative or absolute paths are returned depending on the path or glob pattern provided to read_csv, not just filenames. Since Goose v1.3.0, the filename column is added automatically as a virtual column and this option is only kept for compatibility reasons.BOOLfalse
force_not_nullDo not match values in the specified columns against the NULL string. In the default case where the NULL string is empty, this means that empty values are read as zero-length strings instead of NULLs.VARCHAR[][]
headerFirst line of each file contains the column names.BOOLfalse
hive_partitioningInterpret the path as a Hive partitioned path.BOOL(auto-detected)
ignore_errorsIgnore any parsing errors encountered.BOOLfalse
max_line_size or maximum_line_size. Not available in the COPY statement.Maximum line size, in bytes.BIGINT2000000
names or column_namesColumn names, as a list. See example.VARCHAR[](empty)
new_lineNew line character(s). Options are '\r','\n', or '\r\n'. The CSV parser only distinguishes between single-character and double-character line delimiters. Therefore, it does not differentiate between '\r' and '\n'.VARCHAR(empty)
normalize_namesNormalize column names. This removes any non-alphanumeric characters from them. Column names that are reserved SQL keywords are prefixed with an underscore character (_).BOOLfalse
null_paddingPad the remaining columns on the right with NULL values when a line lacks columns.BOOLfalse
nullstr or nullStrings that represent a NULL value.VARCHAR or VARCHAR[](empty)
parallelUse the parallel CSV reader.BOOLtrue
quoteString used to quote values.VARCHAR"
rejects_scanName of the temporary table where information on faulty scans is stored.VARCHARreject_scans
rejects_tableName of the temporary table where information on faulty lines is stored.VARCHARreject_errors
rejects_limitUpper limit on the number of faulty lines per file that are recorded in the rejects table. Setting this to 0 means that no limit is applied.BIGINT0
sample_sizeNumber of sample lines for auto detection of parameters.BIGINT20480
sepDelimiter character used to separate columns within each line, e.g., , ; \t. The delimiter character can be up to 4 bytes, e.g., 🦆. Alias for delim.VARCHAR,
skipNumber of lines to skip at the start of each file.BIGINT0
store_rejectsSkip any lines with errors and store them in the rejects table.BOOLfalse
strict_modeEnforces the strictness level of the CSV Reader. When set to true, the parser will throw an error upon encountering any issues. When set to false, the parser will attempt to read structurally incorrect files. It is important to note that reading structurally incorrect files can cause ambiguity; therefore, this option should be used with caution.BOOLtrue
thousandsCharacter used to identify thousands separators in numeric values. It must be a single character and different from the decimal_separator option.VARCHAR(empty)
timestampformatTimestamp format used when parsing and writing timestamps.VARCHAR(empty)
timestamp_formatAlias for timestampformat; only available in the COPY statement.VARCHAR(empty)
types or dtypes or column_typesColumn types, as either a list (by position) or a struct (by name). See example.VARCHAR[] or STRUCT(empty)
union_by_nameAlign columns from different files by column name instead of position. Using this option increases memory consumption.BOOLfalse

Tip Goose's CSV reader supports UTF-8 (default), UTF-16 and Latin-1 encodings. For other encodings, you can either use the encodings extension or convert them e.g. using the iconv command-line tool:

iconv -f ISO-8859-2 -t UTF-8 input.csv > input-utf-8.csv

auto_type_candidates 详情

auto_type_candidates 允许你指定 CSV 读取器在列类型检测时要考虑的数据类型。 示例:

SELECT * FROM read_csv('csv_file.csv', auto_type_candidates = ['BIGINT', 'DATE']);

auto_type_candidates 的默认值为 ['NULL', 'BOOLEAN', 'BIGINT', 'DOUBLE', 'TIME', 'DATE', 'TIMESTAMP', 'VARCHAR']

CSV 函数

read_csv 会使用 CSV sniffer 自动推断 CSV 读取器配置,也会自动推断列类型。若 CSV 有表头,则使用表头中的列名;否则列名会被命名为 column0, column1, column2, ...。以下是 flights.csv 示例:

SELECT * FROM read_csv('flights.csv');
FlightDateUniqueCarrierOriginCityNameDestCityName
1988-01-01AANew York, NYLos Angeles, CA
1988-01-02AANew York, NYLos Angeles, CA
1988-01-03AANew York, NYLos Angeles, CA

路径可以是相对路径(相对于当前工作目录)或绝对路径。

我们也可以使用 read_csv 创建持久化表:

CREATE TABLE ontime AS
SELECT * FROM read_csv('flights.csv');
DESCRIBE ontime;
column_namecolumn_typenullkeydefaultextra
FlightDateDATEYESNULLNULLNULL
UniqueCarrierVARCHARYESNULLNULLNULL
OriginCityNameVARCHARYESNULLNULLNULL
DestCityNameVARCHARYESNULLNULLNULL
SELECT * FROM read_csv('flights.csv', sample_size = 20_000);

若显式设置 delim / sepquoteescapeheader,则可跳过对应参数的自动检测:

SELECT * FROM read_csv('flights.csv', header = true);

通过传入 glob 或文件列表,可以一次读取多个文件。更多信息请参阅多文件章节

使用 COPY 语句加载

可以使用 COPY 语句 将 CSV 数据加载到表中。该语句语法与 PostgreSQL 相同。使用 COPY 前,需要先创建具有正确 schema 的表(列顺序与 CSV 一致,类型也要与 CSV 数据匹配)。COPY 会自动检测 CSV 配置参数。

CREATE TABLE ontime (
flightdate DATE,
uniquecarrier VARCHAR,
origincityname VARCHAR,
destcityname VARCHAR
);
COPY ontime FROM 'flights.csv';
SELECT * FROM ontime;
flightdateuniquecarrierorigincitynamedestcityname
1988-01-01AANew York, NYLos Angeles, CA
1988-01-02AANew York, NYLos Angeles, CA
1988-01-03AANew York, NYLos Angeles, CA

若要手动指定 CSV 格式,可通过 COPY 的配置选项实现。

CREATE TABLE ontime (flightdate DATE, uniquecarrier VARCHAR, origincityname VARCHAR, destcityname VARCHAR);
COPY ontime FROM 'flights.csv' (DELIMITER '|', HEADER);
SELECT * FROM ontime;

读取有错误的 CSV 文件

Goose 支持读取存在错误的 CSV 文件。详见读取有错误的 CSV 文件页面

顺序保持

CSV 读取器遵循 preserve_insertion_order 配置项保持插入顺序。 当其为 true(默认)时,CSV 读取器返回结果集中的行顺序与文件中对应行顺序一致。 当其为 false 时,不保证顺序保持。

写出 CSV 文件

Goose 可通过 COPY ... TO 语句 写出 CSV 文件。