Schema change

In this tutorial, you'll learn how to implement schema change strategies in your Ascend pipelines to avoid schema mismatch errors and make your pipelines robust and resilient.

What you'll learn

In this tutorial, you'll learn:

How to identify and handle schema changes in your data pipelines
Different schema change strategies and when to use each one
How to implement schema change handling in Read and Transform Components
Best practices for managing schema evolution in production environments

Why is schema change handling important?

Data schemas evolve over time as business requirements change. Without proper schema change handling, your data pipelines can break when:

New columns are added to source tables
Column data types change
Columns are renamed or removed
Table structures are reorganized

Ascend's schema change strategies allow your pipelines to adapt to these changes automatically, reducing maintenance overhead and ensuring continuous data flow.

Full refresh

Run Components and Flows in full-refresh mode to resolve schema mismatch errors and completely refresh the Data Plane schema for that Component
Full-refresh rebuilds the entire dataset from scratch, ensuring the schema is up-to-date
To perform a full refresh on an entire Flow:
1. Navigate to the Build Info panel and click Run Flow
2. In the Run Flow dialog, scroll down to the Advanced Actions section and check the box for the Full Refresh option
3. Click Run to execute the entire Flow in full-refresh mode
For targeted schema updates, you can perform a full-refresh on individual Components or a selected subset of Components by using the Component context menu

Schema change strategies

Ascend Components provide several options for handling schema changes through the on_schema_change parameter:

sync_all_columns

The "sync_all_columns" strategy synchronizes the target table's columns to exactly match the source schema. When using this strategy:

Columns present in the source but missing in the target are added
Columns in the target but not in the source are removed
Data types of existing columns will be updated to match the source
Original column order is preserved for existing columns, with new columns added at the end

This strategy ensures the target always fully reflects the current structure of the source, potentially at the cost of dropping columns that are no longer needed.

append_new_columns

The "append_new_columns" strategy adds new columns from the source to the target table without modifying existing columns. When using this strategy:

New columns detected in the source data are added to the target
Existing columns in the target remain unchanged, even if they no longer exist in the source
Data type changes to existing columns are not propagated

This approach preserves all current data and structure, simply extending the schema to accommodate new fields as they appear over time.

ignore

The "ignore" strategy maintains the existing target schema without any modifications:

No columns are added, removed, or modified regardless of source schema changes
The system will attempt to write data using the existing schema
If incompatible data is encountered (e.g., missing required columns), an error will occur
This is useful when you want to strictly control schema through other means

fail

The "fail" strategy provides the strictest approach to schema management:

Any schema mismatch between source and target immediately raises an error
No automatic schema modifications are permitted
Forces manual intervention for any schema changes
Useful when you want complete control over schema evolution and data migrations

You can specify your chosen schema change strategy in the on_schema_change parameter available in Read and Transform Components.

Examples

This section demonstrates how to specify a schema change strategy in different types of Components in Ascend. By configuring the on_schema_change parameter correctly, you can ensure your data pipelines remain resilient when source schemas evolve.

Incremental Python Read

In this example, an Incremental Read Component specifies the "sync_all_columns" schema change strategy in the @read decorator:

read_metabook.py
import polars as pl
import pyarrow as pa
from ascend.application.context import ComponentExecutionContext
from ascend.common.events import log
from ascend.resources import read


@read(
    strategy="incremental",
    incremental_strategy="merge",
    unique_key="id",
    on_schema_change="sync_all_columns",
)
def read_metabook(context: ComponentExecutionContext) -> pa.Table:
    df = pl.read_parquet("gs://ascend-io-gcs-public/ottos-expeditions/lakev0/generated/events/metabook.parquet/year=*/month=*/day=*/*.parquet")
    current_data = context.current_data()
    if current_data is not None:
        current_data = current_data.to_polars()
        max_ts = current_data["timestamp"].max()
        log(f"Reading data after {max_ts}")
        df = df.filter(df["timestamp"] > max_ts)
    else:
        log("No current data found, reading all data")

    log(f"Returning {df.height} rows")
    return df.to_arrow()

Local files Read Component

This example demonstrates how to configure a local file system Read Component with the ignore schema change strategy:

read_local_files_ignore.yaml
component:
  read:
    connection: local_fs
    local_file:
      path: rc_schema_evolution/
      include:
      - regex: .*\.csv
      parser:
        csv:
          has_header: true
    strategy:
      partitioned:
        enable_substitution_by_partition_name: false
      on_schema_change: ignore

Incremental MySQL Read Component

This example shows an incremental MySQL Read Component configured with the sync_all_columns strategy and merge-based incremental loading:

incremental_mysql_sync.yaml
component:
  skip: false
  read:
    connection: mysql_db
    mysql:
      table:
        name: mysql_incremental_schema_evolution_test
    strategy:
      on_schema_change: sync_all_columns
      replication:
        incremental:
          column_name: updated_at
      incremental:
        merge:
          unique_key: id
          deletion_column: deleted_at
  data_plane:
    databricks:
      table_properties:
        delta.columnMapping.mode: name

Incremental Python Transform

The following Python Transform Component uses the on_schema_change parameter to specify how to handle schema changes when running incrementally:

incremental_transform.py
from ascend.resources import ref, transform


@transform(
    inputs=[ref("schema_shifting_data")],
    materialized="incremental",
    incremental_strategy="merge",
    unique_key="key",
    merge_update_columns=["string", "ts"],
    on_schema_change="sync_all_columns",
)
def incremental_transform_schema_evol_python_sync(schema_shifting_data, context):
    def _n(x):
        return x if "string" in schema_shifting_data else x.upper()

    output = schema_shifting_data
    if context.is_incremental:
        current_data = context.current_data()
        output = output[output[_n("ts")] > current_data[_n("ts")].max()]

    return output

Incremental SQL Transform

This SQL Transform example shows how to configure schema change handling in SQL using the config block:

{{
  config(
    materialized="incremental",
    incremental_strategy="merge",
    unique_key="key",
    merge_update_columns=["string", "ts"],
    on_schema_change="sync_all_columns",  # Other options: append_new_columns, ignore, fail
  )
}}
SELECT * FROM {{ ref("schema_shifting_data") }}

{% if is_incremental() %}
  WHERE ts > (SELECT ts FROM {{ this }} ORDER BY ts DESC LIMIT 1)
{% endif %}

Summary

Choose sync_all_columns when you want to ensure target tables always reflect current source structure
Use append_new_columns when you want to preserve historical columns while adding new ones
Select ignore when you want to maintain existing schema without modifications
Use fail when you need strict control over schema changes
Consider full-refresh when making significant schema changes to ensure data consistency
Test schema change strategies in development environments before applying to production
Document your schema change strategy choices for each Component to aid in maintenance

What you'll learn​

Why is schema change handling important?​

Full refresh​

Schema change strategies​

sync_all_columns​

append_new_columns​

ignore​

fail​

Examples​

Incremental Python Read​

Local files Read Component​

Incremental MySQL Read Component​

Incremental Python Transform​

Incremental SQL Transform​

Summary​