读取有错误的 CSV 文件
CSV 文件形态各异,其中不少会包含大量错误,导致“无损读取”本身就很困难。为帮助用户处理这类文件,Goose 提供了详细错误信息、跳过错误行能力,以及将错误行保存到临时表的功能,便于后续数据清洗。
结构性错误
Goose 支持检测并跳过多种结构性错误。本节将逐一说明并给出示例。 示例中使用如下表结构:
CREATE TABLE people (name VARCHAR, birth_date DATE);
Goose 可检测以下错误类型:
CAST: Casting errors occur when a column in the CSV file cannot be cast to the expected schema value. For example, the linePedro,The 90swould cause an error since the stringThe 90scannot be cast to a date.MISSING COLUMNS: This error occurs if a line in the CSV file has fewer columns than expected. In our example, we expect two columns; therefore, a row with just one value, e.g.,Pedro, would cause this error.TOO MANY COLUMNS: This error occurs if a line in the CSV has more columns than expected. In our example, any line with more than two columns would cause this error, e.g.,Pedro,01-01-1992,pdet.UNQUOTED VALUE: Quoted values in CSV lines must always be unquoted at the end; if a quoted value remains quoted throughout, it will cause an error. For example, assuming our scanner usesquote='"', the line"pedro"holanda, 01-01-1992would present an unquoted value error.LINE SIZE OVER MAXIMUM: Goose has a parameter that sets the maximum line size a CSV file can have, which by default is set to 2,097,152 bytes. Assuming our scanner is set tomax_line_size = 25, the linePedro Holanda, 01-01-1992would produce an error, as it exceeds 25 bytes.INVALID ENCODING: Goose supports UTF-8 strings, UTF-16 and Latin-1 encodings. Lines containing other characters will produce an error. For example, the linepedro\xff\xff, 01-01-1992would be problematic.
CSV 错误信息结构
默认情况下,读取 CSV 时一旦遇到结构性错误,扫描器会立刻停止并将错误抛给用户。 这些错误信息会尽可能包含完整上下文,方便用户直接回到 CSV 文件定位问题。
以下是完整错误信息示例:
Conversion Error:
CSV Error on Line: 5648
Original Line: Pedro,The 90s
Error when converting column "birth_date". date field value out of range: "The 90s", expected format is (DD-MM-YYYY)
Column date is being converted as type DATE
This type was auto-detected from the CSV file.
Possible solutions:
* Override the type for this column manually by setting the type explicitly, e.g., types={'birth_date': 'VARCHAR'}
* Set the sample size to a larger value to enable the auto-detection to scan more values, e.g., sample_size=-1
* Use a COPY statement to automatically derive types from an existing table.
file= people.csv
delimiter = , (Auto-Detected)
quote = " (Auto-Detected)
escape = " (Auto-Detected)
new_line = \r\n (Auto-Detected)
header = true (Auto-Detected)
skip_rows = 0 (Auto-Detected)
date_format = (DD-MM-YYYY) (Auto-Detected)
timestamp_format = (Auto-Detected)
null_padding=0
sample_size=20480
ignore_errors=false
all_varchar=0
第一段会告诉我们错误发生位置,包括行号、原始 CSV 行以及出问题的字段:
Conversion Error:
CSV Error on Line: 5648
Original Line: Pedro,The 90s
Error when converting column "birth_date". date field value out of range: "The 90s", expected format is (DD-MM-YYYY)
第二段会给出可能的解决方案:
Column date is being converted as type DATE
This type was auto-detected from the CSV file.
Possible solutions:
* Override the type for this column manually by setting the type explicitly, e.g., types={'birth_date': 'VARCHAR'}
* Set the sample size to a larger value to enable the auto-detection to scan more values, e.g., sample_size=-1
* Use a COPY statement to automatically derive types from an existing table.
由于该字段类型是自动检测得出,因此建议将该字段显式定义为 VARCHAR,或增大样本范围以改进类型检测。
最后一段会展示扫描器使用的关键参数(可能与错误相关),并标明它们是自动检测还是用户手动设置。
使用 ignore_errors 选项
有些 CSV 文件可能包含多个结构性错误,而用户只想跳过错误行读取其余正确数据。此时可以使用 ignore_errors 选项。启用后,凡是会导致 CSV 解析报错的行都会被忽略。下面示例演示 CAST 错误;实际上“结构性错误”一节中的任意错误类型都会触发跳过。
例如,考虑如下 CSV 文件 faulty.csv:
Pedro,31
Oogie Boogie, three
若读取该文件并指定第一列为 VARCHAR、第二列为 INTEGER,则会加载失败,因为 three 不能转换成 INTEGER。
如下查询会抛出类型转换错误:
FROM read_csv('faulty.csv', columns = {'name': 'VARCHAR', 'age': 'INTEGER'});
但设置 ignore_errors 后,文件第二行会被跳过,只输出完整的第一行。例如:
FROM read_csv(
'faulty.csv',
columns = {'name': 'VARCHAR', 'age': 'INTEGER'},
ignore_errors = true
);
输出:
| name | age |
|---|---|
| Pedro | 31 |
需要注意,CSV 解析会受到“投影下推(projection pushdown)”优化影响。若我们只选择 name 列,则两行都会被视为有效,因为 age 列上的类型转换错误不会被触发。例如:
SELECT name
FROM read_csv('faulty.csv', columns = {'name': 'VARCHAR', 'age': 'INTEGER'});
输出:
| name |
|---|
| Pedro |
| Oogie Boogie |
获取错误 CSV 行
能读取错误 CSV 很重要,但在很多数据清洗场景下,还需要精确知道哪些行损坏、以及解析器在这些行上发现了什么错误。对此可以使用 Goose 的 CSV Rejects Table 功能。 默认情况下,该功能会创建两张临时表。
reject_scans: Stores information regarding the parameters of the CSV Scanner.reject_errors: Stores information regarding each CSV faulty line and in which CSV Scanner they happened.
Note that any of the errors described in our Structural Error section will be stored in the rejects tables. Also, if a line has multiple errors, multiple entries will be stored for the same line, one for each error.
Reject Scans
CSV Reject Scans 表会返回以下信息:
| 列名 | 说明 | 类型 |
|---|---|---|
scan_id | The internal ID used in Goose to represent that scanner | UBIGINT |
file_id | A scanner might happen over multiple files, so the file_id represents a unique file in a scanner | UBIGINT |
file_path | The file path | VARCHAR |
delimiter | The delimiter used e.g., ; | VARCHAR |
quote | The quote used e.g., " | VARCHAR |
escape | The quote used e.g., " | VARCHAR |
newline_delimiter | The newline delimiter used e.g., \r\n | VARCHAR |
skip_rows | If any rows were skipped from the top of the file | UINTEGER |
has_header | If the file has a header | BOOLEAN |
columns | The schema of the file (i.e., all column names and types) | VARCHAR |
date_format | The format used for date types | VARCHAR |
timestamp_format | The format used for timestamp types | VARCHAR |
user_arguments | Any extra scanner parameters manually set by the user | VARCHAR |
Reject Errors
CSV Reject Errors 表会返回以下信息:
| 列名 | 说明 | 类型 |
|---|---|---|
scan_id | The internal ID used in Goose to represent that scanner, used to join with reject scans tables | UBIGINT |
file_id | The file_id represents a unique file in a scanner, used to join with reject scans tables | UBIGINT |
line | Line number, from the CSV File, where the error occurred. | UBIGINT |
line_byte_position | Byte Position of the start of the line, where the error occurred. | UBIGINT |
byte_position | Byte Position where the error occurred. | UBIGINT |
column_idx | If the error happens in a specific column, the index of the column. | UBIGINT |
column_name | If the error happens in a specific column, the name of the column. | VARCHAR |
error_type | The type of the error that happened. | ENUM |
csv_line | The original CSV line. | VARCHAR |
error_message | The error message produced by Goose. | VARCHAR |
参数
下列参数可在 read_csv 中使用,用于配置 CSV Rejects Table。
| 名称 | 说明 | 类型 | 默认值 |
|---|---|---|---|
store_rejects | If set to true, any errors in the file will be skipped and stored in the default rejects temporary tables. | BOOLEAN | False |
rejects_scan | Name of a temporary table where the information of the scan information of faulty CSV file are stored. | VARCHAR | reject_scans |
rejects_table | Name of a temporary table where the information of the faulty lines of a CSV file are stored. | VARCHAR | reject_errors |
rejects_limit | Upper limit on the number of faulty records from a CSV file that will be recorded in the rejects table. 0 is used when no limit should be applied. | BIGINT | 0 |
若要将错误 CSV 行的信息写入 rejects 表,只需将 store_rejects 设为 true。例如:
FROM read_csv(
'faulty.csv',
columns = {'name': 'VARCHAR', 'age': 'INTEGER'},
store_rejects = true
);
随后你可以查询 reject_scans 和 reject_errors 两张表,以获取被拒绝元组的信息。例如:
FROM reject_scans;
输出:
| scan_id | file_id | file_path | delimiter | quote | escape | newline_delimiter | skip_rows | has_header | columns | date_format | timestamp_format | user_arguments |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5 | 0 | faulty.csv | , | " | " | \n | 0 | false | {'name': 'VARCHAR','age': 'INTEGER'} | store_rejects=true |
FROM reject_errors;
输出:
| scan_id | file_id | line | line_byte_position | byte_position | column_idx | column_name | error_type | csv_line | error_message |
|---|---|---|---|---|---|---|---|---|---|
| 5 | 0 | 2 | 10 | 23 | 2 | age | CAST | Oogie Boogie, three | Error when converting column "age". Could not convert string " three" to 'INTEGER' |