CSV 导入
示例
以下示例使用 flights.csv 文件。
从磁盘读取 CSV 文件,自动推断参数:
SELECT * FROM 'flights.csv';
使用 read_csv 并传入自定义选项:
SELECT *
FROM read_csv('flights.csv',
delim = '|',
header = true,
columns = {
'FlightDate': 'DATE',
'UniqueCarrier': 'VARCHAR',
'OriginCityName': 'VARCHAR',
'DestCityName': 'VARCHAR'
});
从 stdin 读取 CSV,自动推断参数:
cat flights.csv | goose -c "SELECT * FROM read_csv('/dev/stdin')"
将 CSV 文件读入数据表:
CREATE TABLE ontime (
FlightDate DATE,
UniqueCarrier VARCHAR,
OriginCityName VARCHAR,
DestCityName VARCHAR
);
COPY ontime FROM 'flights.csv';
或者使用 CREATE TABLE ... AS SELECT 语句,无需手动指定 schema 即可建表:
CREATE TABLE ontime AS
SELECT * FROM 'flights.csv';
我们也可以用 FROM-first 语法 省略 SELECT *。
CREATE TABLE ontime AS
FROM 'flights.csv';
CSV 加载
CSV 加载(即将 CSV 文件导入数据库)是非常常见但又出奇棘手的任务。CSV 表面看似简单,实际却常有不一致之处,导致加载困难。CSV 文件形式多样、常有损坏、且通常没有 schema。CSV 读取器需要应对所有这些情况。
Goose CSV 读取器可通过 CSV sniffer 分析 CSV 文件并自动推断配置参数。大多数情况下这都能正确工作,应该优先尝试。少数情况下若无法正确推断,也可以手动设置读取器参数以正确解析 CSV。详见自动检测页面。
参数
下面是可传给 read_csv 函数 的参数。在语义适用的场景下,这些参数也可用于 COPY 语句。
| Name | Description | Type | Default |
|---|---|---|---|
all_varchar | Skip type detection and assume all columns are of type VARCHAR. This option is only supported by the read_csv function. | BOOL | false |
allow_quoted_nulls | Allow the conversion of quoted values to NULL values | BOOL | true |
auto_detect | Auto detect CSV parameters. | BOOL | true |
auto_type_candidates | Types that the sniffer uses when detecting column types. The VARCHAR type is always included as a fallback option. See example. | TYPE[] | default types |
buffer_size | Size of the buffers used to read files, in bytes. Must be large enough to hold four lines and can significantly impact performance. | BIGINT | 16 * max_line_size |
columns | Column names and types, as a struct (e.g., {'col1': 'INTEGER', 'col2': 'VARCHAR'}). Using this option disables auto detection of the schema. | STRUCT | (empty) |
comment | Character used to initiate comments. Lines starting with a comment character (optionally preceded by space characters) are completely ignored; other lines containing a comment character are parsed only up to that point. | VARCHAR | (empty) |
compression | Method used to compress CSV files. By default this is detected automatically from the file extension (e.g., t.csv.gz will use gzip, t.csv will use none). Options are none, gzip, zstd. | VARCHAR | auto |
dateformat | Date format used when parsing and writing dates. | VARCHAR | (empty) |
date_format | Alias for dateformat; only available in the COPY statement. | VARCHAR | (empty) |
decimal_separator | Decimal separator for numbers. | VARCHAR | . |
delim | Delimiter character used to separate columns within each line, e.g., , ; \t. The delimiter character can be up to 4 bytes, e.g., 🦆. Alias for sep. | VARCHAR | , |
delimiter | Alias for delim; only available in the COPY statement. | VARCHAR | , |
escape | String used to escape the quote character within quoted values. | VARCHAR | " |
encoding | Encoding used by the CSV file. Options are utf-8, utf-16, latin-1. Not available in the COPY statement (which always uses utf-8). | VARCHAR | utf-8 |
filename | Add path of the containing file to each row, as a string column named filename. Relative or absolute paths are returned depending on the path or glob pattern provided to read_csv, not just filenames. Since Goose v1.3.0, the filename column is added automatically as a virtual column and this option is only kept for compatibility reasons. | BOOL | false |
force_not_null | Do not match values in the specified columns against the NULL string. In the default case where the NULL string is empty, this means that empty values are read as zero-length strings instead of NULLs. | VARCHAR[] | [] |
header | First line of each file contains the column names. | BOOL | false |
hive_partitioning | Interpret the path as a Hive partitioned path. | BOOL | (auto-detected) |
ignore_errors | Ignore any parsing errors encountered. | BOOL | false |
max_line_size or maximum_line_size. Not available in the COPY statement. | Maximum line size, in bytes. | BIGINT | 2000000 |
names or column_names | Column names, as a list. See example. | VARCHAR[] | (empty) |
new_line | New line character(s). Options are '\r','\n', or '\r\n'. The CSV parser only distinguishes between single-character and double-character line delimiters. Therefore, it does not differentiate between '\r' and '\n'. | VARCHAR | (empty) |
normalize_names | Normalize column names. This removes any non-alphanumeric characters from them. Column names that are reserved SQL keywords are prefixed with an underscore character (_). | BOOL | false |
null_padding | Pad the remaining columns on the right with NULL values when a line lacks columns. | BOOL | false |
nullstr or null | Strings that represent a NULL value. | VARCHAR or VARCHAR[] | (empty) |
parallel | Use the parallel CSV reader. | BOOL | true |
quote | String used to quote values. | VARCHAR | " |
rejects_scan | Name of the temporary table where information on faulty scans is stored. | VARCHAR | reject_scans |
rejects_table | Name of the temporary table where information on faulty lines is stored. | VARCHAR | reject_errors |
rejects_limit | Upper limit on the number of faulty lines per file that are recorded in the rejects table. Setting this to 0 means that no limit is applied. | BIGINT | 0 |
sample_size | Number of sample lines for auto detection of parameters. | BIGINT | 20480 |
sep | Delimiter character used to separate columns within each line, e.g., , ; \t. The delimiter character can be up to 4 bytes, e.g., 🦆. Alias for delim. | VARCHAR | , |
skip | Number of lines to skip at the start of each file. | BIGINT | 0 |
store_rejects | Skip any lines with errors and store them in the rejects table. | BOOL | false |
strict_mode | Enforces the strictness level of the CSV Reader. When set to true, the parser will throw an error upon encountering any issues. When set to false, the parser will attempt to read structurally incorrect files. It is important to note that reading structurally incorrect files can cause ambiguity; therefore, this option should be used with caution. | BOOL | true |
thousands | Character used to identify thousands separators in numeric values. It must be a single character and different from the decimal_separator option. | VARCHAR | (empty) |
timestampformat | Timestamp format used when parsing and writing timestamps. | VARCHAR | (empty) |
timestamp_format | Alias for timestampformat; only available in the COPY statement. | VARCHAR | (empty) |
types or dtypes or column_types | Column types, as either a list (by position) or a struct (by name). See example. | VARCHAR[] or STRUCT | (empty) |
union_by_name | Align columns from different files by column name instead of position. Using this option increases memory consumption. | BOOL | false |
Tip Goose's CSV reader supports
UTF-8(default),UTF-16andLatin-1encodings. For other encodings, you can either use theencodingsextension or convert them e.g. using theiconvcommand-line tool:iconv -f ISO-8859-2 -t UTF-8 input.csv > input-utf-8.csv
auto_type_candidates 详情
auto_type_candidates 允许你指定 CSV 读取器在列类型检测时要考虑的数据类型。
示例:
SELECT * FROM read_csv('csv_file.csv', auto_type_candidates = ['BIGINT', 'DATE']);
auto_type_candidates 的默认值为 ['NULL', 'BOOLEAN', 'BIGINT', 'DOUBLE', 'TIME', 'DATE', 'TIMESTAMP', 'VARCHAR']。
CSV 函数
read_csv 会使用 CSV sniffer 自动推断 CSV 读取器配置,也会自动推断列类型。若 CSV 有表头,则使用表头中的列名;否则列名会被命名为 column0, column1, column2, ...。以下是 flights.csv 示例:
SELECT * FROM read_csv('flights.csv');
| FlightDate | UniqueCarrier | OriginCityName | DestCityName |
|---|---|---|---|
| 1988-01-01 | AA | New York, NY | Los Angeles, CA |
| 1988-01-02 | AA | New York, NY | Los Angeles, CA |
| 1988-01-03 | AA | New York, NY | Los Angeles, CA |
路径可以是相对路径(相对于当前工作目录)或绝对路径。
我们也可以使用 read_csv 创建持久化表:
CREATE TABLE ontime AS
SELECT * FROM read_csv('flights.csv');
DESCRIBE ontime;
| column_name | column_type | null | key | default | extra |
|---|---|---|---|---|---|
| FlightDate | DATE | YES | NULL | NULL | NULL |
| UniqueCarrier | VARCHAR | YES | NULL | NULL | NULL |
| OriginCityName | VARCHAR | YES | NULL | NULL | NULL |
| DestCityName | VARCHAR | YES | NULL | NULL | NULL |
SELECT * FROM read_csv('flights.csv', sample_size = 20_000);
若显式设置 delim / sep、quote、escape 或 header,则可跳过对应参数的自动检测:
SELECT * FROM read_csv('flights.csv', header = true);
通过传入 glob 或文件列表,可以一次读取多个文件。更多信息请参阅多文件章节。
使用 COPY 语句加载
可以使用 COPY 语句 将 CSV 数据加载到表中。该语句语法与 PostgreSQL 相同。使用 COPY 前,需要先创建具有正确 schema 的表(列顺序与 CSV 一致,类型也要与 CSV 数据匹配)。COPY 会自动检测 CSV 配置参数。
CREATE TABLE ontime (
flightdate DATE,
uniquecarrier VARCHAR,
origincityname VARCHAR,
destcityname VARCHAR
);
COPY ontime FROM 'flights.csv';
SELECT * FROM ontime;
| flightdate | uniquecarrier | origincityname | destcityname |
|---|---|---|---|
| 1988-01-01 | AA | New York, NY | Los Angeles, CA |
| 1988-01-02 | AA | New York, NY | Los Angeles, CA |
| 1988-01-03 | AA | New York, NY | Los Angeles, CA |
若要手动指定 CSV 格式,可通过 COPY 的配置选项实现。
CREATE TABLE ontime (flightdate DATE, uniquecarrier VARCHAR, origincityname VARCHAR, destcityname VARCHAR);
COPY ontime FROM 'flights.csv' (DELIMITER '|', HEADER);
SELECT * FROM ontime;
读取有错误的 CSV 文件
Goose 支持读取存在错误的 CSV 文件。详见读取有错误的 CSV 文件页面。
顺序保持
CSV 读取器遵循 preserve_insertion_order 配置项来保持插入顺序。
当其为 true(默认)时,CSV 读取器返回结果集中的行顺序与文件中对应行顺序一致。
当其为 false 时,不保证顺序保持。
写出 CSV 文件
Goose 可通过 COPY ... TO 语句 写出 CSV 文件。