Skip to main content
Version: 3.0.0

Transform

Transforms in Ascend are essential for data processing. They allow you to apply business logic, perform data cleansing, shaping, and aggregation, and prepare data for analytics or operational use. You can create Transforms using SQL or Python, offering flexibility and power in data processing.

Key Features of Transforms

  1. Flexibility: Supports both SQL and Python, letting you choose the language that best suits your data transformation needs.
  2. Integration with Flows: Integrates seamlessly into Ascend Flows, enabling the automated orchestration of complex data pipelines.
  3. Efficiency: Provides features like partitioning and incremental processing to optimize resource usage and processing time.
  4. Testing and Validation: Includes built-in support for testing, ensuring data quality and integrity throughout the transformation process.

How Transforms Work

Transforms take data from one or more input components, such as Read Components or other Transforms, apply specified transformations, and produce output for other components in the pipeline, including Write Components. Transforms can perform a wide range of operations, from simple data cleaning to complex aggregations and calculations.

Ascend supports both SQL and Python Transforms. You can choose the language based on the transformation logic's complexity and your familiarity with the languages. Consider the following when selecting between SQL and Python Transforms:

  • SQL Transforms: Best for set-based operations and working directly with relational data. Ideal for tasks like aggregating, filtering, and joining data.
  • Python Transforms: More flexible and suitable for complex logic that SQL cannot easily express, such as text processing, custom calculations, or calling external services.

Ascend infers the transform configuration from the file suffix. For example, .sql.jinja files are treated as SQL Transforms, and .py files are treated as Python Transforms.

Types of Transforms in Ascend

info

See the Transform Reference for more information on the syntax and capabilities of all transform types.

Ascend categorizes Transforms based on the language used (SQL or Python) and the operational strategy (basic, incremental, or smart):

  • Basic Transforms: The simplest form of Transform component. These perform a full refresh of the output table on each run.
  • Incremental Transforms: Process transformations incrementally, focusing on new or changed data since the last run. This approach reduces processing time and resource usage, making it ideal for growing datasets.
  • Smart Transforms: Designed for large datasets, these process data in chunks based on a partition key, improving efficiency and scalability. You can maintain the partitions upstream in your Data Flow or repartition the data, enabling fine grain control over data processing.

Incremental Transforms

Incremental transforms are components that incrementally transform upstream data. The upstream data typically has a column that changes monotonically over time, such as an updated_at, modified_at, or last_modified timestamp column. During each run of an incremental transform, data that has already been processed is filtered out, and only new data is processed through the transform logic. This approach is often used to optimize performance by reducing the volume of data processed on each run, especially in data stores that support indexing or partitioning by a column.

Incremental transforms, whether in SQL or Python, belong to a broader class of materialized components that includes partitioned transforms, read connectors, and write connectors. These are called materialized components because the Ascend platform manages reading data from input components and writing to output tables, while user code focuses strictly on business logic.

What distinguishes incremental transforms from other types of materialized components, such as partitioned components, is that they leverage the ordered nature of the source data and only store metadata relevant to the column that defines the order of records.

Structure of an Incremental Transform

An incremental transform is defined by setting materialized=incremental.

You can express the transformation logic in SQL or Python. The Ascend framework wraps this logic in an upsert or merge query to ensure that newly transformed records are written to the output table. The specifics of this upsert or merge logic depend on the configured incremental_strategy. For example, the merge strategy requires that updated columns are reflected in the output table.

The is_incremental flag is used within the transform logic to determine when to execute backfill logic, which is typically executed the first time a transform is run or when full-refresh is set to true.

The output table includes an additional metadata column called _ascend_batch_id, which is a unique identifier that tracks the run during which each output record was inserted or updated.

Configuration Options

  • materialized = incremental: Specifies that this is an incremental transform.
  • incremental_strategy: Defines how newly processed records are written to the output table. The only currently supported strategy is merge.
  • unique_key: A comma-separated list of columns that together represent a unique key for each record. This parameter is needed for the merge and delete-insert strategies.
  • merge_update_columns and merge_exclude_columns: Specify which columns are updated in the output table for changed rows.
  • incremental_predicates: Specific to the merge strategy. These additional predicates are included in the join expression (expressed in SQL syntax) to limit which records in the output table are merged.

Examples

Incremental SQL Transform
incremental-transform.sql.jinja
{{
config(
materialized="incremental",
incremental_strategy="merge",
unique_key="user_id",
merge_exclude_columns=["name", "created_at"],
)
}}
SELECT * FROM {{ ref("users") }}

{% if is_incremental() %}
WHERE ts > (SELECT ts FROM {{ this }} ORDER BY ts DESC LIMIT 1)
{% endif %}

In this example, the incremental merge strategy reads from a table called users, appends newly inserted rows, and merges updated rows into the output table. For updated rows, all columns except name and created_at are updated in the output table.

The is_incremental() block filters out data that has already been processed. This is achieved by selecting the latest timestamp from {{this}}, which is a reference to the output table. If is_incremental() is true, then the output table already exists and is not empty.

Incremental Python Transform
incremental-transform.py
from ascend.resources import Asset, ref, transform

@transform(
inputs=[ref("users")],
materialized="incremental",
incremental_strategy="merge",
unique_key="user_id",
merge_update_columns=["address", "ts"],
)
def incremental_transform_python(source_data, context) -> Asset:
output = source_data
if context.is_incremental:
current_data = context.current_data()
output = output[output["ts"] > current_data["ts"].max()]

return output

The Python example is similar to the SQL transform above. Here, the source data is provided as an Ibis table expression. The is_incremental flag in the context indicates whether the transform is incremental, and the output table (if it exists) can be referenced as an Ibis table expression using context.current_data(). In this example, for record updates, only the address and ts columns are updated.

Smart Transforms

Smart Transforms are designed for large datasets, processing data in chunks based on a partition key. They utilize additional metadata to track which records have been processed, and only process new or changed partitions on each run. Additionally, they are code and data aware, meaning that if either the code or the data has changed, the transform will process only affected partitions (or all partitions if the code has changed).

Smart Transforms are similar to Partitioned Reads, but with the following differences:

  • Smart Transforms are used for transforming data, while Partitioned Reads are used for reading data.
  • Smart Transforms are not only data aware, but also code aware, meaning that if either the code or the data has changed, the transform will process only affected partitions (or all partitions if the code has changed).

For every run of a Smart Transform, all partitions are eligible for processing. The transform logic is responsible for determining which partitions have changed, and only processing those partitions. The key benefits of this approach are:

  • Guaranteed Consistency: Smart Transforms are fully data aware, meaning that if the data changes, the transform will be re-run, and only the changed partitions will be processed. Even with late arriving data, the transform will be re-run and only the changed partitions will be processed.
  • Reduced Data Volume: By only processing changed partitions, Smart Transforms can significantly reduce the amount of data processed, improving performance and reducing costs.
  • Faster Data Updates: Smart Transforms can quickly update data in Ascend by only processing changed partitions, without needing to reprocess the entire dataset.
  • Increased Parallelism: Smart Transforms can be parallelized, allowing for more efficient use of resources and faster data processing.

Smart Transforms are configured by setting materialized="smart", or by setting reshape in the ref call. This provides Ascend with the information needed to determine which partitions have changed.

Examples

Smart SQL Transform
smart-transform.sql.jinja
SELECT
*
FROM
{{
ref(
"users",
reshape={"time": {"column": "event_ts", "granularity": "day"}}
)
}}
Smart Python Transform
smart-transform.py
from ascend.resources import Asset, ref, transform

@transform(
inputs=[
ref(
"users",
reshape={"time": {"column": "event_ts", "granularity": "day"}}
)
],
materialized="smart",
)
def smart_transform_python(source_data, context) -> Asset:
return source_data

Best Practices for Using Transforms

  • Choosing the Right Type: Select the Transform type based on data volume, transformation complexity, and performance needs.
  • Optimization: Use partitioning and incremental processing to optimize performance, especially for large datasets.
  • Testing: Include Tests within your Transforms to ensure data integrity and the accuracy of the transformation logic.
  • Monitoring and Tuning: Monitor Transform performance and resource usage, tuning as needed to maintain optimal pipeline performance.

Conclusion

Transforms are crucial in Ascend's data engineering platform, providing tools for effective data transformation. By understanding the different types of Transforms and following best practices, you can process data efficiently, ensuring it meets analytical or operational requirements. With support for SQL and Python, Ascend offers a robust and flexible environment for data transformation, catering to various use cases and complexity levels.