跳到主要内容

读取有错误的 CSV 文件

CSV 文件形态各异,其中不少会包含大量错误,导致“无损读取”本身就很困难。为帮助用户处理这类文件,Goose 提供了详细错误信息、跳过错误行能力,以及将错误行保存到临时表的功能,便于后续数据清洗。

结构性错误

Goose 支持检测并跳过多种结构性错误。本节将逐一说明并给出示例。 示例中使用如下表结构:

CREATE TABLE people (name VARCHAR, birth_date DATE);

Goose 可检测以下错误类型:

  • CAST: Casting errors occur when a column in the CSV file cannot be cast to the expected schema value. For example, the line Pedro,The 90s would cause an error since the string The 90s cannot be cast to a date.
  • MISSING COLUMNS: This error occurs if a line in the CSV file has fewer columns than expected. In our example, we expect two columns; therefore, a row with just one value, e.g., Pedro, would cause this error.
  • TOO MANY COLUMNS: This error occurs if a line in the CSV has more columns than expected. In our example, any line with more than two columns would cause this error, e.g., Pedro,01-01-1992,pdet.
  • UNQUOTED VALUE: Quoted values in CSV lines must always be unquoted at the end; if a quoted value remains quoted throughout, it will cause an error. For example, assuming our scanner uses quote='"', the line "pedro"holanda, 01-01-1992 would present an unquoted value error.
  • LINE SIZE OVER MAXIMUM: Goose has a parameter that sets the maximum line size a CSV file can have, which by default is set to 2,097,152 bytes. Assuming our scanner is set to max_line_size = 25, the line Pedro Holanda, 01-01-1992 would produce an error, as it exceeds 25 bytes.
  • INVALID ENCODING: Goose supports UTF-8 strings, UTF-16 and Latin-1 encodings. Lines containing other characters will produce an error. For example, the line pedro\xff\xff, 01-01-1992 would be problematic.

CSV 错误信息结构

默认情况下,读取 CSV 时一旦遇到结构性错误,扫描器会立刻停止并将错误抛给用户。 这些错误信息会尽可能包含完整上下文,方便用户直接回到 CSV 文件定位问题。

以下是完整错误信息示例:

Conversion Error:
CSV Error on Line: 5648
Original Line: Pedro,The 90s
Error when converting column "birth_date". date field value out of range: "The 90s", expected format is (DD-MM-YYYY)

Column date is being converted as type DATE
This type was auto-detected from the CSV file.
Possible solutions:
* Override the type for this column manually by setting the type explicitly, e.g., types={'birth_date': 'VARCHAR'}
* Set the sample size to a larger value to enable the auto-detection to scan more values, e.g., sample_size=-1
* Use a COPY statement to automatically derive types from an existing table.

file= people.csv
delimiter = , (Auto-Detected)
quote = " (Auto-Detected)
escape = " (Auto-Detected)
new_line = \r\n (Auto-Detected)
header = true (Auto-Detected)
skip_rows = 0 (Auto-Detected)
date_format = (DD-MM-YYYY) (Auto-Detected)
timestamp_format = (Auto-Detected)
null_padding=0
sample_size=20480
ignore_errors=false
all_varchar=0

第一段会告诉我们错误发生位置,包括行号、原始 CSV 行以及出问题的字段:

Conversion Error:
CSV Error on Line: 5648
Original Line: Pedro,The 90s
Error when converting column "birth_date". date field value out of range: "The 90s", expected format is (DD-MM-YYYY)

第二段会给出可能的解决方案:

Column date is being converted as type DATE
This type was auto-detected from the CSV file.
Possible solutions:
* Override the type for this column manually by setting the type explicitly, e.g., types={'birth_date': 'VARCHAR'}
* Set the sample size to a larger value to enable the auto-detection to scan more values, e.g., sample_size=-1
* Use a COPY statement to automatically derive types from an existing table.

由于该字段类型是自动检测得出,因此建议将该字段显式定义为 VARCHAR,或增大样本范围以改进类型检测。

最后一段会展示扫描器使用的关键参数(可能与错误相关),并标明它们是自动检测还是用户手动设置。

使用 ignore_errors 选项

有些 CSV 文件可能包含多个结构性错误,而用户只想跳过错误行读取其余正确数据。此时可以使用 ignore_errors 选项。启用后,凡是会导致 CSV 解析报错的行都会被忽略。下面示例演示 CAST 错误;实际上“结构性错误”一节中的任意错误类型都会触发跳过。

例如,考虑如下 CSV 文件 faulty.csv

Pedro,31
Oogie Boogie, three

若读取该文件并指定第一列为 VARCHAR、第二列为 INTEGER,则会加载失败,因为 three 不能转换成 INTEGER

如下查询会抛出类型转换错误:

FROM read_csv('faulty.csv', columns = {'name': 'VARCHAR', 'age': 'INTEGER'});

但设置 ignore_errors 后,文件第二行会被跳过,只输出完整的第一行。例如:

FROM read_csv(
'faulty.csv',
columns = {'name': 'VARCHAR', 'age': 'INTEGER'},
ignore_errors = true
);

输出:

nameage
Pedro31

需要注意,CSV 解析会受到“投影下推(projection pushdown)”优化影响。若我们只选择 name 列,则两行都会被视为有效,因为 age 列上的类型转换错误不会被触发。例如:

SELECT name
FROM read_csv('faulty.csv', columns = {'name': 'VARCHAR', 'age': 'INTEGER'});

输出:

name
Pedro
Oogie Boogie

获取错误 CSV 行

能读取错误 CSV 很重要,但在很多数据清洗场景下,还需要精确知道哪些行损坏、以及解析器在这些行上发现了什么错误。对此可以使用 Goose 的 CSV Rejects Table 功能。 默认情况下,该功能会创建两张临时表。

  1. reject_scans: Stores information regarding the parameters of the CSV Scanner.
  2. reject_errors: Stores information regarding each CSV faulty line and in which CSV Scanner they happened.

Note that any of the errors described in our Structural Error section will be stored in the rejects tables. Also, if a line has multiple errors, multiple entries will be stored for the same line, one for each error.

Reject Scans

CSV Reject Scans 表会返回以下信息:

列名说明类型
scan_idThe internal ID used in Goose to represent that scannerUBIGINT
file_idA scanner might happen over multiple files, so the file_id represents a unique file in a scannerUBIGINT
file_pathThe file pathVARCHAR
delimiterThe delimiter used e.g., ;VARCHAR
quoteThe quote used e.g., "VARCHAR
escapeThe quote used e.g., "VARCHAR
newline_delimiterThe newline delimiter used e.g., \r\nVARCHAR
skip_rowsIf any rows were skipped from the top of the fileUINTEGER
has_headerIf the file has a headerBOOLEAN
columnsThe schema of the file (i.e., all column names and types)VARCHAR
date_formatThe format used for date typesVARCHAR
timestamp_formatThe format used for timestamp typesVARCHAR
user_argumentsAny extra scanner parameters manually set by the userVARCHAR

Reject Errors

CSV Reject Errors 表会返回以下信息:

列名说明类型
scan_idThe internal ID used in Goose to represent that scanner, used to join with reject scans tablesUBIGINT
file_idThe file_id represents a unique file in a scanner, used to join with reject scans tablesUBIGINT
lineLine number, from the CSV File, where the error occurred.UBIGINT
line_byte_positionByte Position of the start of the line, where the error occurred.UBIGINT
byte_positionByte Position where the error occurred.UBIGINT
column_idxIf the error happens in a specific column, the index of the column.UBIGINT
column_nameIf the error happens in a specific column, the name of the column.VARCHAR
error_typeThe type of the error that happened.ENUM
csv_lineThe original CSV line.VARCHAR
error_messageThe error message produced by Goose.VARCHAR

参数

下列参数可在 read_csv 中使用,用于配置 CSV Rejects Table。

名称说明类型默认值
store_rejectsIf set to true, any errors in the file will be skipped and stored in the default rejects temporary tables.BOOLEANFalse
rejects_scanName of a temporary table where the information of the scan information of faulty CSV file are stored.VARCHARreject_scans
rejects_tableName of a temporary table where the information of the faulty lines of a CSV file are stored.VARCHARreject_errors
rejects_limitUpper limit on the number of faulty records from a CSV file that will be recorded in the rejects table. 0 is used when no limit should be applied.BIGINT0

若要将错误 CSV 行的信息写入 rejects 表,只需将 store_rejects 设为 true。例如:

FROM read_csv(
'faulty.csv',
columns = {'name': 'VARCHAR', 'age': 'INTEGER'},
store_rejects = true
);

随后你可以查询 reject_scansreject_errors 两张表,以获取被拒绝元组的信息。例如:

FROM reject_scans;

输出:

scan_idfile_idfile_pathdelimiterquoteescapenewline_delimiterskip_rowshas_headercolumnsdate_formattimestamp_formatuser_arguments
50faulty.csv,""\n0false{'name': 'VARCHAR','age': 'INTEGER'}store_rejects=true
FROM reject_errors;

输出:

scan_idfile_idlineline_byte_positionbyte_positioncolumn_idxcolumn_nameerror_typecsv_lineerror_message
50210232ageCASTOogie Boogie, threeError when converting column "age". Could not convert string " three" to 'INTEGER'