Spark API

The Goose Spark API implements the PySpark API, allowing you to use the familiar Spark API to interact with Goose. All statements are translated to Goose's internal plans using our relational API and executed using Goose's query engine.

Warning The Goose Spark API is currently experimental and features are still missing. We are very interested in feedback. Please report any functionality that you are missing, either through Discord or on GitHub.

Example

from goose.experimental.spark.sql import SparkSession as session
from goose.experimental.spark.sql.functions import lit, col
import pandas as pd

spark = session.builder.getOrCreate()

pandas_df = pd.DataFrame({
    'age': [34, 45, 23, 56],
    'name': ['Joan', 'Peter', 'John', 'Bob']
})

df = spark.createDataFrame(pandas_df)
df = df.withColumn(
    'location', lit('Seattle')
)
res = df.select(
    col('age'),
    col('location')
).collect()

print(res)

[
    Row(age=34, location='Seattle'),
    Row(age=45, location='Seattle'),
    Row(age=23, location='Seattle'),
    Row(age=56, location='Seattle')
]

Contribution Guidelines

Contributions to the experimental Spark API are welcome. When making a contribution, please follow these guidelines:

Instead of using temporary files, use our pytest testing framework.
When adding new functions, ensure that method signatures comply with those in the PySpark API.

Example​

Contribution Guidelines​

Example

Contribution Guidelines