Custom BigQuery partitioning and clustering

BigQuery partitioning and clustering are powerful optimization techniques that significantly improve query performance and reduce costs by organizing your data more efficiently on disk.

Ascend's BigQuery integration allows you to implement these features directly in your Component definitions using YAML, SQL, or Python Read Components or Transforms, giving you precise control over your data storage layout and query efficiency.

Supported operations

Ascend supports three BigQuery partition types: time unit partitioning with DAY, HOUR, MONTH, or YEAR granularity; ingestion timestamp partitioning using BigQuery's internal _PARTITIONTIME column; and integer range partitioning with configurable start, end, and interval values.

For clustering columns, you can select up to 4 columns with moderate cardinality to optimize queries that filter, join, or aggregate on these columns, with the most commonly filtered columns listed first since order matters for query optimization.

Refer to the BigQuery partitioned tables documentation and the BigQuery clustered tables documentation for more details.

Custom BigQuery partitioning and clustering can be applied to all Simple and Incremental Components in Ascend. For Smart Components, specifying cluster by or partition by configuration options will override the default Smart strategies, providing full control when needed.

YAML examples

These examples show how to apply custom partitioning and clustering strategies to YAML Read Components.

First, let's look at how to use custom partitioning with a GCS file source:

read_partitioned.yaml
component:
  read:
    connection: gcs_lake
    gcs:
      path: some_directory/file.parquet/year=
      include:
        - glob: "*/month=*day=*/*.parquet"

For reading specific partition IDs from BigQuery tables:

read_specific_partition_ids.yaml
component:
  read:
    connection: read_bigquery
    bigquery:
      table:
        dataset: my_dataset
        name: my_table
        partitioning:
          time:
            include:
            - eq: 20230831
            - eq: 20230826

To read a date range of partitions (exclusive date conditions):

read_partition_date_range.yaml
component:
  read:
    connection: read_bigquery
    bigquery:
      table:
        dataset: my_dataset
        name: my_table
        partitioning:
          time:
            include:
            - before: 2023-02-01 # all partition ids before this date, not including this date
              after: 2023-01-01  # all partition ids after this date, not including this date

For inclusive date conditions:

read_partition_date_conditions.yaml
component:
  read:
    connection: read_bigquery
    bigquery:
      table:
        dataset: my_dataset
        name: my_table
        partitioning:
          time:
            include:
            - on_or_after: 2023-01-01 # all partition ids including and after this date
            - on_or_before: 2023-01-01 # all partition ids including and before this date

SQL examples

These examples demonstrate how to implement partitioning and clustering in SQL Transforms across different Component types:

For a Simple table materialization:

simple.sql
{{
    config(
        materialized="table",
        partition_by={"field": "timestamp", "granularity": "DAY"},
        cluster_by=["route_id", "timestamp"]
    )
}}
SELECT *
FROM {{ ref("my_table") }}

For an Incremental load with merge strategy:

incremental.sql
{{
    config(
        materialized="incremental",
        incremental_strategy="merge",
        unique_key="id",
        partition_by={"field": "timestamp", "granularity": "DAY"},
        cluster_by=["route_id", "timestamp"]
    )
}}
SELECT *
FROM {{ ref("my_table") }}

A Smart Component that doesn't override the default strategies:

smart_no_overrides.sql
SELECT *
FROM {{ ref("my_table"), reshape="map" }}

A Smart Component with custom clustering:

smart_override_cluster_by.sql
{{ config(cluster_by=["timestamp"]) }}
SELECT *
FROM {{ ref("my_table"), reshape="map" }}

A Smart Component with both custom clustering and partitioning:

smart_override_partition_by.sql
{{
    config(
        cluster_by=["timestamp", "route_id"],
        partition_by={"field": "timestamp", "granularity": "DAY"}
    )
}}
SELECT *
FROM {{ ref("my_table"), reshape="map" }}

Python example

This example shows a basic Python Transform that applies clustering to the cab_rides dataset by cab_type:

simple_cluster.py
from ascend.resources import ref, transform


@transform(
    inputs=[
        ref("cab_rides"),
    ],
    event_time="pickup_datetime",
    cluster_by=["cab_type"],
)
def cab_rides_simple_python(cab_rides, context):
    return cab_rides

This example demonstrates a more advanced Python Transform that includes data reshaping with time-based granularity while also clustering by cab_type:

smart_cluster.py
from ascend.resources import ref, transform


@transform(
    inputs=[
        ref("cab_rides", reshape={"time": {"column": "pickup_datetime", "granularity": "day"}}),
    ],
    event_time="pickup_datetime",
    cluster_by=["cab_type"],
)
def cab_rides_smart_reshape_python(cab_rides, context):
    return cab_rides

Supported operations​

YAML examples​

SQL examples​

Python example​

Supported operations

YAML examples

SQL examples

Python example