跳到主要内容

Python API

Goose Python 客户端当前最新稳定版本为 {{ site.current_goose_version }}.

安装

Goose Python API 可通过 pip: pip install goose. 详细信息请参阅安装页面。 也可以通过 conda: conda install python-goose -c conda-forge.

Python version: Goose 需要 Python 3.9 或更高版本。

基础 API 用法

使用 Goose 运行 SQL 查询最直接的方式是使用 goose.sql 命令。

import goose

goose.sql("SELECT 42").show()

这会在 Python 模块内部使用全局存储的内存数据库来运行查询。 查询结果会以 Relation 形式返回。 Relation 是查询的符号化表示。 在拉取结果或请求打印到屏幕之前,查询不会执行。

可以将 Relation 存入变量并在后续查询中当作表来引用。 这样可以增量式构建查询。

import goose

r1 = goose.sql("SELECT 42 AS i")
goose.sql("SELECT i * 2 AS k FROM r1").show()

数据输入

Goose 可从多种格式摄取数据,既支持磁盘文件,也支持内存对象。 更多信息请参阅data ingestion 页面

import goose

goose.read_csv("example.csv") # read a CSV file into a Relation
goose.read_parquet("example.parquet") # read a Parquet file into a Relation
goose.read_json("example.json") # read a JSON file into a Relation

goose.sql("SELECT * FROM 'example.csv'") # directly query a CSV file
goose.sql("SELECT * FROM 'example.parquet'") # directly query a Parquet file
goose.sql("SELECT * FROM 'example.json'") # directly query a JSON file

DataFrame

Goose 可直接查询 Pandas DataFrame、Polars DataFrame 和 Arrow 表。 请注意这些对象是只读的,即无法通过 INSERTUPDATE statements 修改这些表。

Pandas

要直接查询 Pandas DataFrame,请运行:

import goose
import pandas as pd

pandas_df = pd.DataFrame({"a": [42]})
goose.sql("SELECT * FROM pandas_df")
┌───────┐
│ a │
│ int64 │
├───────┤
│ 42 │
└───────┘

Polars

要直接查询 Polars DataFrame,请运行:

import goose
import polars as pl

polars_df = pl.DataFrame({"a": [42]})
goose.sql("SELECT * FROM polars_df")
┌───────┐
│ a │
│ int64 │
├───────┤
│ 42 │
└───────┘

PyArrow

要直接查询 PyArrow 表,请运行:

import goose
import pyarrow as pa

arrow_table = pa.Table.from_pydict({"a": [42]})
goose.sql("SELECT * FROM arrow_table")
┌───────┐
│ a │
│ int64 │
├───────┤
│ 42 │
└───────┘

结果转换

Goose 支持将查询结果高效转换为多种格式。 更多信息请参阅result conversion 页面

import goose

goose.sql("SELECT 42").fetchall() # Python objects
goose.sql("SELECT 42").df() # Pandas DataFrame
goose.sql("SELECT 42").pl() # Polars DataFrame
goose.sql("SELECT 42").arrow() # Arrow Table
goose.sql("SELECT 42").fetchnumpy() # NumPy Arrays

将数据写入磁盘

Goose 支持将 Relation 对象直接以多种格式写入磁盘。 也可以使用 SQL 的 COPY statement 作为写盘替代方案。

import goose

goose.sql("SELECT 42").write_parquet("out.parquet") # Write to a Parquet file
goose.sql("SELECT 42").write_csv("out.csv") # Write to a CSV file
goose.sql("COPY (SELECT 42) TO 'out.parquet'") # Copy to a Parquet file

连接选项

应用可通过 goose.connect() 方法打开新的 Goose 连接。

使用内存数据库

通过 goose.sql() 使用 Goose 时,运行在内存数据库上,即不会将表持久化到磁盘。 无参调用 goose.connect() 也会返回使用内存数据库的连接:

import goose

con = goose.connect()
con.sql("SELECT 42 AS x").show()

持久化存储

The goose.connect(dbname) creates a connection to a persistent database. Any data written to that connection will be persisted, and can be reloaded by reconnecting to the same file, both from Python and from other Goose clients.

import goose

# create a connection to a file called 'file.db'
con = goose.connect("file.db")
# create a table and load data into it
con.sql("CREATE TABLE test (i INTEGER)")
con.sql("INSERT INTO test VALUES (42)")
# query the table
con.table("test").show()
# explicitly close the connection
con.close()
# Note: connections also closed implicitly when they go out of scope

你也可以使用上下文管理器来确保连接被关闭:

import goose

with goose.connect("file.db") as con:
con.sql("CREATE TABLE test (i INTEGER)")
con.sql("INSERT INTO test VALUES (42)")
con.table("test").show()
# the context manager closes the connection automatically

配置

The goose.connect() accepts a config dictionary, where configuration options can be specified. For example:

import goose

con = goose.connect(config = {'threads': 1})

若要指定存储版本,请传入 storage_compatibility_version 选项:

import goose

con = goose.connect(config = {'storage_compatibility_version': 'latest'})

连接对象与模块

The connection object and the goose module can be used interchangeably – they support the same methods. The only difference is that when using the goose module a global in-memory database is used.

If you are developing a package designed for others to use, and use Goose in the package, it is recommended that you create connection objects instead of using the methods on the goose module. That is because the goose module uses a shared global database – which can cause hard to debug issues if used from within multiple different packages.

在并行 Python 程序中使用连接

goose.sql() 与全局连接的线程安全性

goose.sql() and goose.connect(':default:') use a shared global in-memory connection. This connection is not thread-safe, and running queries on it from multiple threads can cause issues. To run Goose in parallel, each thread must have its own connection:

def good_use():
con = goose.connect()
# uses new connection
con.sql("SELECT 1").fetchall()

Conversely, the following could cause concurrency issues because they rely on a global connection:

def bad_use():
con = goose.connect(':default:')
# uses global connection
return con.sql("SELECT 1").fetchall()

Or:

def also_bad():
return goose.sql("SELECT 1").fetchall()
# uses global connection

请避免使用 goose.sql(),或在多个线程间共享同一个连接。

关于 cursor()

A GoosePyConnection.cursor() method creates another handle on the same connection. It does not open a new connection. Therefore, all cursors created from one connection cannot run queries at the same time.

社区扩展

To load community extensions, use the repository="community" argument with the install_extension method.

For example, install and load the h3 community extension as follows:

import goose

con = goose.connect()
con.install_extension("h3", repository="community")
con.load_extension("h3")

未签名扩展

To load unsigned extensions, use:

con = goose.connect(config={"allow_unsigned_extensions": "true"})

Warning Only load unsigned extensions from sources you trust. Avoid loading unsigned extensions over HTTP. Consult the Securing Goose page for guidelines on how set up Goose in a secure manner.