Spark API
The Goose Spark API implements the PySpark API, allowing you to use the familiar Spark API to interact with Goose. All statements are translated to Goose's internal plans using our relational API and executed using Goose's query engine.
Warning The Goose Spark API is currently experimental and features are still missing. We are very interested in feedback. Please report any functionality that you are missing, either through Discord or on GitHub.
Example
from goose.experimental.spark.sql import SparkSession as session
from goose.experimental.spark.sql.functions import lit, col
import pandas as pd
spark = session.builder.getOrCreate()
pandas_df = pd.DataFrame({
'age': [34, 45, 23, 56],
'name': ['Joan', 'Peter', 'John', 'Bob']
})
df = spark.createDataFrame(pandas_df)
df = df.withColumn(
'location', lit('Seattle')
)
res = df.select(
col('age'),
col('location')
).collect()
print(res)
[
Row(age=34, location='Seattle'),
Row(age=45, location='Seattle'),
Row(age=23, location='Seattle'),
Row(age=56, location='Seattle')
]
Contribution Guidelines
Contributions to the experimental Spark API are welcome. When making a contribution, please follow these guidelines:
- Instead of using temporary files, use our
pytesttesting framework. - When adding new functions, ensure that method signatures comply with those in the PySpark API.