Python API

The latest stable version of the Goose Python client is {{ site.current_goose_version }}.

Installation

The Goose Python API can be installed using pip: pip install goose. Please see the installation page for details. It is also possible to install Goose using conda: conda install python-goose -c conda-forge.

Python version: Goose requires Python 3.9 or newer.

Basic API Usage

The most straight-forward manner of running SQL queries using Goose is using the goose.sql command.

import goose

goose.sql("SELECT 42").show()

This will run queries using an in-memory database that is stored globally inside the Python module. The result of the query is returned as a Relation. A relation is a symbolic representation of the query. The query is not executed until the result is fetched or requested to be printed to the screen.

Relations can be referenced in subsequent queries by storing them inside variables, and using them as tables. This way queries can be constructed incrementally.

import goose

r1 = goose.sql("SELECT 42 AS i")
goose.sql("SELECT i * 2 AS k FROM r1").show()

Data Input

Goose can ingest data from a wide variety of formats – both on-disk and in-memory. See the data ingestion page for more information.

import goose

goose.read_csv("example.csv")                # read a CSV file into a Relation
goose.read_parquet("example.parquet")        # read a Parquet file into a Relation
goose.read_json("example.json")              # read a JSON file into a Relation

goose.sql("SELECT * FROM 'example.csv'")     # directly query a CSV file
goose.sql("SELECT * FROM 'example.parquet'") # directly query a Parquet file
goose.sql("SELECT * FROM 'example.json'")    # directly query a JSON file

DataFrames

Goose can directly query Pandas DataFrames, Polars DataFrames and Arrow tables. Note that these are read-only, i.e., editing these tables via INSERT or UPDATE statements is not possible.

Pandas

To directly query a Pandas DataFrame, run:

import goose
import pandas as pd

pandas_df = pd.DataFrame({"a": [42]})
goose.sql("SELECT * FROM pandas_df")

┌───────┐
│   a   │
│ int64 │
├───────┤
│    42 │
└───────┘

Polars

To directly query a Polars DataFrame, run:

import goose
import polars as pl

polars_df = pl.DataFrame({"a": [42]})
goose.sql("SELECT * FROM polars_df")

┌───────┐
│   a   │
│ int64 │
├───────┤
│    42 │
└───────┘

PyArrow

To directly query a PyArrow table, run:

import goose
import pyarrow as pa

arrow_table = pa.Table.from_pydict({"a": [42]})
goose.sql("SELECT * FROM arrow_table")

┌───────┐
│   a   │
│ int64 │
├───────┤
│    42 │
└───────┘

Result Conversion

Goose supports converting query results efficiently to a variety of formats. See the result conversion page for more information.

import goose

goose.sql("SELECT 42").fetchall()   # Python objects
goose.sql("SELECT 42").df()         # Pandas DataFrame
goose.sql("SELECT 42").pl()         # Polars DataFrame
goose.sql("SELECT 42").arrow()      # Arrow Table
goose.sql("SELECT 42").fetchnumpy() # NumPy Arrays

Writing Data to Disk

Goose supports writing Relation objects directly to disk in a variety of formats. The COPY statement can be used to write data to disk using SQL as an alternative.

import goose

goose.sql("SELECT 42").write_parquet("out.parquet") # Write to a Parquet file
goose.sql("SELECT 42").write_csv("out.csv")         # Write to a CSV file
goose.sql("COPY (SELECT 42) TO 'out.parquet'")      # Copy to a Parquet file

Connection Options

Applications can open a new Goose connection via the goose.connect() method.

Using an In-Memory Database

When using Goose through goose.sql(), it operates on an in-memory database, i.e., no tables are persisted on disk. Invoking the goose.connect() method without arguments returns a connection, which also uses an in-memory database:

import goose

con = goose.connect()
con.sql("SELECT 42 AS x").show()

Persistent Storage

The goose.connect(dbname) creates a connection to a persistent database. Any data written to that connection will be persisted, and can be reloaded by reconnecting to the same file, both from Python and from other Goose clients.

import goose

# create a connection to a file called 'file.db'
con = goose.connect("file.db")
# create a table and load data into it
con.sql("CREATE TABLE test (i INTEGER)")
con.sql("INSERT INTO test VALUES (42)")
# query the table
con.table("test").show()
# explicitly close the connection
con.close()
# Note: connections also closed implicitly when they go out of scope

You can also use a context manager to ensure that the connection is closed:

import goose

with goose.connect("file.db") as con:
    con.sql("CREATE TABLE test (i INTEGER)")
    con.sql("INSERT INTO test VALUES (42)")
    con.table("test").show()
    # the context manager closes the connection automatically

Configuration

The goose.connect() accepts a config dictionary, where configuration options can be specified. For example:

import goose

con = goose.connect(config = {'threads': 1})

To specify the storage version, pass the storage_compatibility_version option:

import goose

con = goose.connect(config = {'storage_compatibility_version': 'latest'})

Connection Object and Module

The connection object and the goose module can be used interchangeably – they support the same methods. The only difference is that when using the goose module a global in-memory database is used.

If you are developing a package designed for others to use, and use Goose in the package, it is recommended that you create connection objects instead of using the methods on the goose module. That is because the goose module uses a shared global database – which can cause hard to debug issues if used from within multiple different packages.

Using Connections in Parallel Python Programs

Thread Safety of `goose.sql()` and the Global Connection

goose.sql() and goose.connect(':default:') use a shared global in-memory connection. This connection is not thread-safe, and running queries on it from multiple threads can cause issues. To run Goose in parallel, each thread must have its own connection:

def good_use():
    con = goose.connect()
    # uses new connection
    con.sql("SELECT 1").fetchall()

Conversely, the following could cause concurrency issues because they rely on a global connection:

def bad_use():
    con = goose.connect(':default:')
    # uses global connection
    return con.sql("SELECT 1").fetchall()

Or:

def also_bad():
    return goose.sql("SELECT 1").fetchall()
    # uses global connection 

Avoid using goose.sql() or sharing a single connection across threads.

About `cursor()`

A GoosePyConnection.cursor() method creates another handle on the same connection. It does not open a new connection. Therefore, all cursors created from one connection cannot run queries at the same time.

Community Extensions

To load community extensions, use the repository="community" argument with the install_extension method.

For example, install and load the h3 community extension as follows:

import goose

con = goose.connect()
con.install_extension("h3", repository="community")
con.load_extension("h3")

Unsigned Extensions

To load unsigned extensions, use:

con = goose.connect(config={"allow_unsigned_extensions": "true"})

Warning Only load unsigned extensions from sources you trust. Avoid loading unsigned extensions over HTTP. Consult the Securing Goose page for guidelines on how set up Goose in a secure manner.

Installation​

Basic API Usage​

Data Input​

DataFrames​

Pandas​

Polars​

PyArrow​

Result Conversion​

Writing Data to Disk​

Connection Options​

Using an In-Memory Database​

Persistent Storage​

Configuration​

Connection Object and Module​

Using Connections in Parallel Python Programs​

Thread Safety of goose.sql() and the Global Connection​

About cursor()​

Community Extensions​

Unsigned Extensions​