Python API
Goose Python 客户端当前最新稳定版本为
{{ site.current_goose_version }}.
安装
Goose Python API 可通过 pip: pip install goose. 详细信息请参阅安装页面。 也可以通过 conda: conda install python-goose -c conda-forge.
Python version: Goose 需要 Python 3.9 或更高版本。
基础 API 用法
使用 Goose 运行 SQL 查询最直接的方式是使用 goose.sql 命令。
import goose
goose.sql("SELECT 42").show()
这会在 Python 模块内部使用全局存储的内存数据库来运行查询。 查询结果会以 Relation 形式返回。 Relation 是查询的符号化表示。 在拉取结果或请求打印到屏幕之前,查询不会执行。
可以将 Relation 存入变量并在后续查询中当作表来引用。 这样可以增量式构建查询。
import goose
r1 = goose.sql("SELECT 42 AS i")
goose.sql("SELECT i * 2 AS k FROM r1").show()
数据输入
Goose 可从多种格式摄取数据,既支持磁盘文件,也支持内存对象。 更多信息请参阅data ingestion 页面。
import goose
goose.read_csv("example.csv") # read a CSV file into a Relation
goose.read_parquet("example.parquet") # read a Parquet file into a Relation
goose.read_json("example.json") # read a JSON file into a Relation
goose.sql("SELECT * FROM 'example.csv'") # directly query a CSV file
goose.sql("SELECT * FROM 'example.parquet'") # directly query a Parquet file
goose.sql("SELECT * FROM 'example.json'") # directly query a JSON file
DataFrame
Goose 可直接查询 Pandas DataFrame、Polars DataFrame 和 Arrow 表。
请注意这些对象是只读的,即无法通过 INSERT 或 UPDATE statements 修改这些表。
Pandas
要直接查询 Pandas DataFrame,请运行:
import goose
import pandas as pd
pandas_df = pd.DataFrame({"a": [42]})
goose.sql("SELECT * FROM pandas_df")
┌───────┐
│ a │
│ int64 │
├───────┤
│ 42 │
└───────┘
Polars
要直接查询 Polars DataFrame,请运行:
import goose
import polars as pl
polars_df = pl.DataFrame({"a": [42]})
goose.sql("SELECT * FROM polars_df")
┌───────┐
│ a │
│ int64 │
├───────┤
│ 42 │
└───────┘
PyArrow
要直接查询 PyArrow 表,请运行:
import goose
import pyarrow as pa
arrow_table = pa.Table.from_pydict({"a": [42]})
goose.sql("SELECT * FROM arrow_table")
┌───────┐
│ a │
│ int64 │
├───────┤
│ 42 │
└───────┘
结果转换
Goose 支持将查询结果高效转换为多种格式。 更多信息请参阅result conversion 页面。
import goose
goose.sql("SELECT 42").fetchall() # Python objects
goose.sql("SELECT 42").df() # Pandas DataFrame
goose.sql("SELECT 42").pl() # Polars DataFrame
goose.sql("SELECT 42").arrow() # Arrow Table
goose.sql("SELECT 42").fetchnumpy() # NumPy Arrays
将数据写入磁盘
Goose 支持将 Relation 对象直接以多种格式写入磁盘。 也可以使用 SQL 的 COPY statement 作为写盘替代方案。
import goose
goose.sql("SELECT 42").write_parquet("out.parquet") # Write to a Parquet file
goose.sql("SELECT 42").write_csv("out.csv") # Write to a CSV file
goose.sql("COPY (SELECT 42) TO 'out.parquet'") # Copy to a Parquet file
连接选项
应用可通过 goose.connect() 方法打开新的 Goose 连接。
使用内存数据库
通过 goose.sql() 使用 Goose 时,运行在内存数据库上,即不会将表持久化到磁盘。
无参调用 goose.connect() 也会返回使用内存数据库的连接:
import goose
con = goose.connect()
con.sql("SELECT 42 AS x").show()
持久化存储
The goose.connect(dbname) creates a connection to a persistent database.
Any data written to that connection will be persisted, and can be reloaded by reconnecting to the same file, both from Python and from other Goose clients.
import goose
# create a connection to a file called 'file.db'
con = goose.connect("file.db")
# create a table and load data into it
con.sql("CREATE TABLE test (i INTEGER)")
con.sql("INSERT INTO test VALUES (42)")
# query the table
con.table("test").show()
# explicitly close the connection
con.close()
# Note: connections also closed implicitly when they go out of scope
你也可以使用上下文管理器来确保连接被关闭:
import goose
with goose.connect("file.db") as con:
con.sql("CREATE TABLE test (i INTEGER)")
con.sql("INSERT INTO test VALUES (42)")
con.table("test").show()
# the context manager closes the connection automatically
配置
The goose.connect() accepts a config dictionary, where configuration options can be specified. For example:
import goose
con = goose.connect(config = {'threads': 1})
若要指定存储版本,请传入 storage_compatibility_version 选项:
import goose
con = goose.connect(config = {'storage_compatibility_version': 'latest'})
连接对象与模块
The connection object and the goose module can be used interchangeably – they support the same methods. The only difference is that when using the goose module a global in-memory database is used.
If you are developing a package designed for others to use, and use Goose in the package, it is recommended that you create connection objects instead of using the methods on the
goosemodule. That is because thegoosemodule uses a shared global database – which can cause hard to debug issues if used from within multiple different packages.
在并行 Python 程序中使用连接
goose.sql() 与全局连接的线程安全性
goose.sql() and goose.connect(':default:') use a shared global in-memory connection. This connection is not thread-safe, and running queries on it from multiple threads can cause issues. To run Goose in parallel, each thread must have its own connection:
def good_use():
con = goose.connect()
# uses new connection
con.sql("SELECT 1").fetchall()
Conversely, the following could cause concurrency issues because they rely on a global connection:
def bad_use():
con = goose.connect(':default:')
# uses global connection
return con.sql("SELECT 1").fetchall()
Or:
def also_bad():
return goose.sql("SELECT 1").fetchall()
# uses global connection
请避免使用 goose.sql(),或在多个线程间共享同一个连接。
关于 cursor()
A GoosePyConnection.cursor() method creates another handle on the same connection. It does not open a new connection. Therefore, all cursors created from one connection cannot run queries at the same time.
社区扩展
To load community extensions, use the repository="community" argument with the install_extension method.
For example, install and load the h3 community extension as follows:
import goose
con = goose.connect()
con.install_extension("h3", repository="community")
con.load_extension("h3")
未签名扩展
To load unsigned extensions, use:
con = goose.connect(config={"allow_unsigned_extensions": "true"})
Warning Only load unsigned extensions from sources you trust. Avoid loading unsigned extensions over HTTP. Consult the Securing Goose page for guidelines on how set up Goose in a secure manner.