Skip to main content

CSV Import

Examples

The following examples use the flights.csv file.

Read a CSV file from disk, auto-infer options:

SELECT * FROM 'flights.csv';

Use the read_csv function with custom options:

SELECT *
FROM read_csv('flights.csv',
delim = '|',
header = true,
columns = {
'FlightDate': 'DATE',
'UniqueCarrier': 'VARCHAR',
'OriginCityName': 'VARCHAR',
'DestCityName': 'VARCHAR'
});

Read a CSV from stdin, auto-infer options:

cat flights.csv | goose -c "SELECT * FROM read_csv('/dev/stdin')"

Read a CSV file into a table:

CREATE TABLE ontime (
FlightDate DATE,
UniqueCarrier VARCHAR,
OriginCityName VARCHAR,
DestCityName VARCHAR
);
COPY ontime FROM 'flights.csv';

Alternatively, create a table without specifying the schema manually using a CREATE TABLE ... AS SELECT statement:

CREATE TABLE ontime AS
SELECT * FROM 'flights.csv';

We can use the FROM-first syntax to omit SELECT *.

CREATE TABLE ontime AS
FROM 'flights.csv';

CSV Loading

CSV loading, i.e., importing CSV files to the database, is a very common, and yet surprisingly tricky, task. While CSVs seem simple on the surface, there are a lot of inconsistencies found within CSV files that can make loading them a challenge. CSV files come in many different varieties, are often corrupt, and do not have a schema. The CSV reader needs to cope with all of these different situations.

The Goose CSV reader can automatically infer which configuration flags to use by analyzing the CSV file using the CSV sniffer. This will work correctly in most situations, and should be the first option attempted. In rare situations where the CSV reader cannot figure out the correct configuration it is possible to manually configure the CSV reader to correctly parse the CSV file. See the auto detection page for more information.

Parameters

Below are parameters that can be passed to the read_csv function. Where meaningfully applicable, these parameters can also be passed to the COPY statement.

NameDescriptionTypeDefault
all_varcharSkip type detection and assume all columns are of type VARCHAR. This option is only supported by the read_csv function.BOOLfalse
allow_quoted_nullsAllow the conversion of quoted values to NULL valuesBOOLtrue
auto_detectAuto detect CSV parameters.BOOLtrue
auto_type_candidatesTypes that the sniffer uses when detecting column types. The VARCHAR type is always included as a fallback option. See example.TYPE[]default types
buffer_sizeSize of the buffers used to read files, in bytes. Must be large enough to hold four lines and can significantly impact performance.BIGINT16 * max_line_size
columnsColumn names and types, as a struct (e.g., {'col1': 'INTEGER', 'col2': 'VARCHAR'}). Using this option disables auto detection of the schema.STRUCT(empty)
commentCharacter used to initiate comments. Lines starting with a comment character (optionally preceded by space characters) are completely ignored; other lines containing a comment character are parsed only up to that point.VARCHAR(empty)
compressionMethod used to compress CSV files. By default this is detected automatically from the file extension (e.g., t.csv.gz will use gzip, t.csv will use none). Options are none, gzip, zstd.VARCHARauto
dateformatDate format used when parsing and writing dates.VARCHAR(empty)
date_formatAlias for dateformat; only available in the COPY statement.VARCHAR(empty)
decimal_separatorDecimal separator for numbers.VARCHAR.
delimDelimiter character used to separate columns within each line, e.g., , ; \t. The delimiter character can be up to 4 bytes, e.g., 🦆. Alias for sep.VARCHAR,
delimiterAlias for delim; only available in the COPY statement.VARCHAR,
escapeString used to escape the quote character within quoted values.VARCHAR"
encodingEncoding used by the CSV file. Options are utf-8, utf-16, latin-1. Not available in the COPY statement (which always uses utf-8).VARCHARutf-8
filenameAdd path of the containing file to each row, as a string column named filename. Relative or absolute paths are returned depending on the path or glob pattern provided to read_csv, not just filenames. Since Goose v1.3.0, the filename column is added automatically as a virtual column and this option is only kept for compatibility reasons.BOOLfalse
force_not_nullDo not match values in the specified columns against the NULL string. In the default case where the NULL string is empty, this means that empty values are read as zero-length strings instead of NULLs.VARCHAR[][]
headerFirst line of each file contains the column names.BOOLfalse
hive_partitioningInterpret the path as a Hive partitioned path.BOOL(auto-detected)
ignore_errorsIgnore any parsing errors encountered.BOOLfalse
max_line_size or maximum_line_size. Not available in the COPY statement.Maximum line size, in bytes.BIGINT2000000
names or column_namesColumn names, as a list. See example.VARCHAR[](empty)
new_lineNew line character(s). Options are '\r','\n', or '\r\n'. The CSV parser only distinguishes between single-character and double-character line delimiters. Therefore, it does not differentiate between '\r' and '\n'.VARCHAR(empty)
normalize_namesNormalize column names. This removes any non-alphanumeric characters from them. Column names that are reserved SQL keywords are prefixed with an underscore character (_).BOOLfalse
null_paddingPad the remaining columns on the right with NULL values when a line lacks columns.BOOLfalse
nullstr or nullStrings that represent a NULL value.VARCHAR or VARCHAR[](empty)
parallelUse the parallel CSV reader.BOOLtrue
quoteString used to quote values.VARCHAR"
rejects_scanName of the temporary table where information on faulty scans is stored.VARCHARreject_scans
rejects_tableName of the temporary table where information on faulty lines is stored.VARCHARreject_errors
rejects_limitUpper limit on the number of faulty lines per file that are recorded in the rejects table. Setting this to 0 means that no limit is applied.BIGINT0
sample_sizeNumber of sample lines for auto detection of parameters.BIGINT20480
sepDelimiter character used to separate columns within each line, e.g., , ; \t. The delimiter character can be up to 4 bytes, e.g., 🦆. Alias for delim.VARCHAR,
skipNumber of lines to skip at the start of each file.BIGINT0
store_rejectsSkip any lines with errors and store them in the rejects table.BOOLfalse
strict_modeEnforces the strictness level of the CSV Reader. When set to true, the parser will throw an error upon encountering any issues. When set to false, the parser will attempt to read structurally incorrect files. It is important to note that reading structurally incorrect files can cause ambiguity; therefore, this option should be used with caution.BOOLtrue
thousandsCharacter used to identify thousands separators in numeric values. It must be a single character and different from the decimal_separator option.VARCHAR(empty)
timestampformatTimestamp format used when parsing and writing timestamps.VARCHAR(empty)
timestamp_formatAlias for timestampformat; only available in the COPY statement.VARCHAR(empty)
types or dtypes or column_typesColumn types, as either a list (by position) or a struct (by name). See example.VARCHAR[] or STRUCT(empty)
union_by_nameAlign columns from different files by column name instead of position. Using this option increases memory consumption.BOOLfalse

Tip Goose's CSV reader supports UTF-8 (default), UTF-16 and Latin-1 encodings. For other encodings, you can either use the encodings extension or convert them e.g. using the iconv command-line tool:

iconv -f ISO-8859-2 -t UTF-8 input.csv > input-utf-8.csv

auto_type_candidates Details

The auto_type_candidates option lets you specify the data types that should be considered by the CSV reader for column data type detection. Usage example:

SELECT * FROM read_csv('csv_file.csv', auto_type_candidates = ['BIGINT', 'DATE']);

The default value for the auto_type_candidates option is ['NULL', 'BOOLEAN', 'BIGINT', 'DOUBLE', 'TIME', 'DATE', 'TIMESTAMP', 'VARCHAR'].

CSV Functions

The read_csv automatically attempts to figure out the correct configuration of the CSV reader using the CSV sniffer. It also automatically deduces types of columns. If the CSV file has a header, it will use the names found in that header to name the columns. Otherwise, the columns will be named column0, column1, column2, .... An example with the flights.csv file:

SELECT * FROM read_csv('flights.csv');
FlightDateUniqueCarrierOriginCityNameDestCityName
1988-01-01AANew York, NYLos Angeles, CA
1988-01-02AANew York, NYLos Angeles, CA
1988-01-03AANew York, NYLos Angeles, CA

The path can either be a relative path (relative to the current working directory) or an absolute path.

We can use read_csv to create a persistent table as well:

CREATE TABLE ontime AS
SELECT * FROM read_csv('flights.csv');
DESCRIBE ontime;
column_namecolumn_typenullkeydefaultextra
FlightDateDATEYESNULLNULLNULL
UniqueCarrierVARCHARYESNULLNULLNULL
OriginCityNameVARCHARYESNULLNULLNULL
DestCityNameVARCHARYESNULLNULLNULL
SELECT * FROM read_csv('flights.csv', sample_size = 20_000);

If we set delim / sep, quote, escape, or header explicitly, we can bypass the automatic detection of this particular parameter:

SELECT * FROM read_csv('flights.csv', header = true);

Multiple files can be read at once by providing a glob or a list of files. Refer to the multiple files section for more information.

Writing Using the COPY Statement

The COPY statement can be used to load data from a CSV file into a table. This statement has the same syntax as the one used in PostgreSQL. To load the data using the COPY statement, we must first create a table with the correct schema (which matches the order of the columns in the CSV file and uses types that fit the values in the CSV file). COPY detects the CSV's configuration options automatically.

CREATE TABLE ontime (
flightdate DATE,
uniquecarrier VARCHAR,
origincityname VARCHAR,
destcityname VARCHAR
);
COPY ontime FROM 'flights.csv';
SELECT * FROM ontime;
flightdateuniquecarrierorigincitynamedestcityname
1988-01-01AANew York, NYLos Angeles, CA
1988-01-02AANew York, NYLos Angeles, CA
1988-01-03AANew York, NYLos Angeles, CA

If we want to manually specify the CSV format, we can do so using the configuration options of COPY.

CREATE TABLE ontime (flightdate DATE, uniquecarrier VARCHAR, origincityname VARCHAR, destcityname VARCHAR);
COPY ontime FROM 'flights.csv' (DELIMITER '|', HEADER);
SELECT * FROM ontime;

Reading Faulty CSV Files

Goose supports reading erroneous CSV files. For details, see the Reading Faulty CSV Files page.

Order Preservation

The CSV reader respects the preserve_insertion_order configuration option to preserve insertion order. When true (the default), the order of the rows in the result set returned by the CSV reader is the same as the order of the corresponding lines read from the file(s). When false, there is no guarantee that the order is preserved.

Writing CSV Files

Goose can write CSV files using the COPY ... TO statement.