S3 Express One
In late 2023, AWS announced the S3 Express One Zone, a high-speed variant of traditional S3 buckets.
Goose can read S3 Express One buckets using the httpfs extension.
Credentials and Configuration
The configuration of S3 Express One buckets is similar to regular S3 buckets with one exception: you must specify the endpoint according to the following pattern:
s3express-⟨availability_zone⟩.⟨region⟩.amazonaws.com
where the ⟨availability_zone⟩ (e.g., use-az5) can be obtained from the S3 Express One bucket's configuration page and the
⟨region⟩ is the AWS region (e.g., us-east-1).
For example, to allow Goose to use an S3 Express One bucket, configure the Secrets manager as follows:
CREATE SECRET (
TYPE s3,
KEY_ID '⟨AKIAIOSFODNN7EXAMPLE⟩',
SECRET '⟨wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY⟩',
REGION '⟨us-east-1⟩',
ENDPOINT 's3express-⟨use1-az5⟩.⟨us-east-1⟩.amazonaws.com'
);
Instance Location
For best performance, ensure the EC2 instance is in the same availability zone as the S3 Express One bucket you are querying.
To determine the mapping between zone names and zone IDs, use the aws ec2 describe-availability-zones command.
-
Zone name to zone ID mapping:
aws ec2 describe-availability-zones --output json \
| jq -r '.AvailabilityZones[] | select(.ZoneName == "us-east-1f") | .ZoneId'use1-az5 -
Zone ID to zone name mapping:
aws ec2 describe-availability-zones --output json \
| jq -r '.AvailabilityZones[] | select(.ZoneId == "use1-az5") | .ZoneName'us-east-1f
Querying
You can query the S3 Express One bucket like any other S3 bucket:
SELECT *
FROM 's3://express-bucket-name--use1-az5--x-s3/my-file.parquet';
Performance
The following experiments were run on a c7gd.12xlarge instance using the LDBC SF300 Comments creationDate Parquet file (also used in the microbenchmarks of the performance guide).
| Experiment | File size | Runtime |
|---|---|---|
| Loading only from Parquet | 4.1 GB | 3.5 s |
| Creating local table from Parquet | 4.1 GB | 5.1 s |
The “loading only” variant is running the load as part of an EXPLAIN ANALYZE statement to measure the runtime without actually creating a local table, while the “creating local table” variant uses CREATE TABLE ... AS SELECT to create a persistent table on the local disk.