Python API
The latest stable version of the Goose Python client is
{{ site.current_goose_version }}.
Installation
The Goose Python API can be installed using pip: pip install goose. Please see the installation page for details. It is also possible to install Goose using conda: conda install python-goose -c conda-forge.
Python version: Goose requires Python 3.9 or newer.
Basic API Usage
The most straight-forward manner of running SQL queries using Goose is using the goose.sql command.
import goose
goose.sql("SELECT 42").show()
This will run queries using an in-memory database that is stored globally inside the Python module. The result of the query is returned as a Relation. A relation is a symbolic representation of the query. The query is not executed until the result is fetched or requested to be printed to the screen.
Relations can be referenced in subsequent queries by storing them inside variables, and using them as tables. This way queries can be constructed incrementally.
import goose
r1 = goose.sql("SELECT 42 AS i")
goose.sql("SELECT i * 2 AS k FROM r1").show()
Data Input
Goose can ingest data from a wide variety of formats – both on-disk and in-memory. See the data ingestion page for more information.
import goose
goose.read_csv("example.csv") # read a CSV file into a Relation
goose.read_parquet("example.parquet") # read a Parquet file into a Relation
goose.read_json("example.json") # read a JSON file into a Relation
goose.sql("SELECT * FROM 'example.csv'") # directly query a CSV file
goose.sql("SELECT * FROM 'example.parquet'") # directly query a Parquet file
goose.sql("SELECT * FROM 'example.json'") # directly query a JSON file
DataFrames
Goose can directly query Pandas DataFrames, Polars DataFrames and Arrow tables.
Note that these are read-only, i.e., editing these tables via INSERT or UPDATE statements is not possible.
Pandas
To directly query a Pandas DataFrame, run:
import goose
import pandas as pd
pandas_df = pd.DataFrame({"a": [42]})
goose.sql("SELECT * FROM pandas_df")
┌───────┐
│ a │
│ int64 │
├───────┤
│ 42 │
└───────┘
Polars
To directly query a Polars DataFrame, run:
import goose
import polars as pl
polars_df = pl.DataFrame({"a": [42]})
goose.sql("SELECT * FROM polars_df")
┌───────┐
│ a │
│ int64 │
├───────┤
│ 42 │
└───────┘
PyArrow
To directly query a PyArrow table, run:
import goose
import pyarrow as pa
arrow_table = pa.Table.from_pydict({"a": [42]})
goose.sql("SELECT * FROM arrow_table")
┌───────┐
│ a │
│ int64 │
├───────┤
│ 42 │
└───────┘
Result Conversion
Goose supports converting query results efficiently to a variety of formats. See the result conversion page for more information.
import goose
goose.sql("SELECT 42").fetchall() # Python objects
goose.sql("SELECT 42").df() # Pandas DataFrame
goose.sql("SELECT 42").pl() # Polars DataFrame
goose.sql("SELECT 42").arrow() # Arrow Table
goose.sql("SELECT 42").fetchnumpy() # NumPy Arrays
Writing Data to Disk
Goose supports writing Relation objects directly to disk in a variety of formats. The COPY statement can be used to write data to disk using SQL as an alternative.
import goose
goose.sql("SELECT 42").write_parquet("out.parquet") # Write to a Parquet file
goose.sql("SELECT 42").write_csv("out.csv") # Write to a CSV file
goose.sql("COPY (SELECT 42) TO 'out.parquet'") # Copy to a Parquet file
Connection Options
Applications can open a new Goose connection via the goose.connect() method.
Using an In-Memory Database
When using Goose through goose.sql(), it operates on an in-memory database, i.e., no tables are persisted on disk.
Invoking the goose.connect() method without arguments returns a connection, which also uses an in-memory database:
import goose
con = goose.connect()
con.sql("SELECT 42 AS x").show()
Persistent Storage
The goose.connect(dbname) creates a connection to a persistent database.
Any data written to that connection will be persisted, and can be reloaded by reconnecting to the same file, both from Python and from other Goose clients.
import goose
# create a connection to a file called 'file.db'
con = goose.connect("file.db")
# create a table and load data into it
con.sql("CREATE TABLE test (i INTEGER)")
con.sql("INSERT INTO test VALUES (42)")
# query the table
con.table("test").show()
# explicitly close the connection
con.close()
# Note: connections also closed implicitly when they go out of scope
You can also use a context manager to ensure that the connection is closed:
import goose
with goose.connect("file.db") as con:
con.sql("CREATE TABLE test (i INTEGER)")
con.sql("INSERT INTO test VALUES (42)")
con.table("test").show()
# the context manager closes the connection automatically
Configuration
The goose.connect() accepts a config dictionary, where configuration options can be specified. For example:
import goose
con = goose.connect(config = {'threads': 1})
To specify the storage version, pass the storage_compatibility_version option:
import goose
con = goose.connect(config = {'storage_compatibility_version': 'latest'})
Connection Object and Module
The connection object and the goose module can be used interchangeably – they support the same methods. The only difference is that when using the goose module a global in-memory database is used.
If you are developing a package designed for others to use, and use Goose in the package, it is recommended that you create connection objects instead of using the methods on the
goosemodule. That is because thegoosemodule uses a shared global database – which can cause hard to debug issues if used from within multiple different packages.
Using Connections in Parallel Python Programs
Thread Safety of goose.sql() and the Global Connection
goose.sql() and goose.connect(':default:') use a shared global in-memory connection. This connection is not thread-safe, and running queries on it from multiple threads can cause issues. To run Goose in parallel, each thread must have its own connection:
def good_use():
con = goose.connect()
# uses new connection
con.sql("SELECT 1").fetchall()
Conversely, the following could cause concurrency issues because they rely on a global connection:
def bad_use():
con = goose.connect(':default:')
# uses global connection
return con.sql("SELECT 1").fetchall()
Or:
def also_bad():
return goose.sql("SELECT 1").fetchall()
# uses global connection
Avoid using goose.sql() or sharing a single connection across threads.
About cursor()
A GoosePyConnection.cursor() method creates another handle on the same connection. It does not open a new connection. Therefore, all cursors created from one connection cannot run queries at the same time.
Community Extensions
To load community extensions, use the repository="community" argument with the install_extension method.
For example, install and load the h3 community extension as follows:
import goose
con = goose.connect()
con.install_extension("h3", repository="community")
con.load_extension("h3")
Unsigned Extensions
To load unsigned extensions, use:
con = goose.connect(config={"allow_unsigned_extensions": "true"})
Warning Only load unsigned extensions from sources you trust. Avoid loading unsigned extensions over HTTP. Consult the Securing Goose page for guidelines on how set up Goose in a secure manner.