Custom BigQuery partitioning and clustering
BigQuery partitioning and clustering are powerful optimization techniques that significantly improve query performance and reduce costs by organizing your data more efficiently on disk.
Ascend's BigQuery integration allows you to implement these features directly in your Component definitions using YAML, SQL, or Python Read Components or Transforms, giving you precise control over your data storage layout and query efficiency.
Supported operations​
Ascend supports three BigQuery partition types: time unit partitioning with DAY, HOUR, MONTH, or YEAR granularity; ingestion timestamp partitioning using BigQuery's internal _PARTITIONTIME
column; and integer range partitioning with configurable start, end, and interval values.
For clustering columns, you can select up to 4 columns with moderate cardinality to optimize queries that filter, join, or aggregate on these columns, with the most commonly filtered columns listed first since order matters for query optimization.
Refer to the BigQuery partitioned tables documentation and the BigQuery clustered tables documentation for more details.
Custom BigQuery partitioning and clustering can be applied to all Simple and Incremental Components in Ascend. For Smart Components, specifying cluster by
or partition by
configuration options will override the default Smart strategies, providing full control when needed.
YAML examples​
These examples show how to apply custom partitioning and clustering strategies to YAML Read Components.
First, let's look at how to use custom partitioning with a GCS file source:
component:
read:
connection: gcs_lake
gcs:
path: some_directory/file.parquet/year=
include:
- glob: "*/month=*day=*/*.parquet"
For reading specific partition IDs from BigQuery tables:
component:
read:
connection: read_bigquery
bigquery:
table:
dataset: my_dataset
name: my_table
partitioning:
time:
include:
- eq: 20230831
- eq: 20230826
To read a date range of partitions (exclusive date conditions):
component:
read:
connection: read_bigquery
bigquery:
table:
dataset: my_dataset
name: my_table
partitioning:
time:
include:
- before: 2023-02-01 # all partition ids before this date, not including this date
after: 2023-01-01 # all partition ids after this date, not including this date
For inclusive date conditions:
component:
read:
connection: read_bigquery
bigquery:
table:
dataset: my_dataset
name: my_table
partitioning:
time:
include:
- on_or_after: 2023-01-01 # all partition ids including and after this date
- on_or_before: 2023-01-01 # all partition ids including and before this date
SQL examples​
These examples demonstrate how to implement partitioning and clustering in SQL Transforms across different Component types:
For a Simple table materialization:
{{
config(
materialized="table",
partition_by={"field": "timestamp", "granularity": "DAY"},
cluster_by=["route_id", "timestamp"]
)
}}
SELECT *
FROM {{ ref("my_table") }}
For an Incremental load with merge strategy:
{{
config(
materialized="incremental",
incremental_strategy="merge",
unique_key="id",
partition_by={"field": "timestamp", "granularity": "DAY"},
cluster_by=["route_id", "timestamp"]
)
}}
SELECT *
FROM {{ ref("my_table") }}
A Smart Component that doesn't override the default strategies:
SELECT *
FROM {{ ref("my_table"), reshape="map" }}
A Smart Component with custom clustering:
{{ config(cluster_by=["timestamp"]) }}
SELECT *
FROM {{ ref("my_table"), reshape="map" }}
A Smart Component with both custom clustering and partitioning:
{{
config(
cluster_by=["timestamp", "route_id"],
partition_by={"field": "timestamp", "granularity": "DAY"}
)
}}
SELECT *
FROM {{ ref("my_table"), reshape="map" }}
Python example​
This example shows a basic Python Transform that applies clustering to the cab_rides
dataset by cab_type
:
from ascend.resources import ref, transform
@transform(
inputs=[
ref("cab_rides"),
],
event_time="pickup_datetime",
cluster_by=["cab_type"],
)
def cab_rides_simple_python(cab_rides, context):
return cab_rides
This example demonstrates a more advanced Python Transform that includes data reshaping with time-based granularity while also clustering by cab_type
:
from ascend.resources import ref, transform
@transform(
inputs=[
ref("cab_rides", reshape={"time": {"column": "pickup_datetime", "granularity": "day"}}),
],
event_time="pickup_datetime",
cluster_by=["cab_type"],
)
def cab_rides_smart_reshape_python(cab_rides, context):
return cab_rides