Skip to main content
Version: 3.0.0

Create a Smart PySpark Transform

This guide shows you how to build a Smart PySpark Transform that uses intelligent partitioning to process only relevant subsets of data, dramatically improving performance for queries based on partitioned fields.

PySpark is Apache Spark's Python API that enables distributed data processing with Python, allowing you to work with large datasets across a cluster.

Databricks only

Note that PySpark is only available to Ascend Instances running on Databricks. Check out our Quickstart to set up a Databricks Instance

Prerequisites​

Create a Transform​

You can create a Transform in two ways: through the form UI or directly in the Files panel.

  1. Double-click the Flow where you want to add your Transform
  2. Right-click on an existing component (typically a Read component or another Transform) that will provide input data
  3. Select Create Downstream → Transform Creating a Transform from the context menu
  4. Complete the form with these details:
    • Select your Flow
    • Enter a descriptive name for your Transform (e.g., sales_aggregation)
    • Choose the appropriate file type for your Transform logic Transform creation form

Create your Smart PySpark Transform​

Follow these steps to create a Smart PySpark Transform:

  1. Import required packages:

    • Ascend resources (pyspark, ref)
    • PySpark objects (DataFrame, SparkSession)
  2. Apply the @pyspark() decorator with Smart settings:

    • Use reshape="map" to follow the partitioning logic of the input data.
    • Use event_time for time-series processing, such as smart backfills or time-range runs.
    • Include cluster_by to optimize table setup, ideally aligning it with the partitioning strategy when columns match.
  3. Define your transform function:

    • Implement your data processing logic
    • Return a DataFrame with the processed data
What makes a Transform "Smart"

Adding the reshape="map" parameter to your input ref is what transforms a regular PySpark Transform into a Smart Transform. It helps Ascend track which partitions (based on your cluster_by fields) contain which data.

Example​

Here's an example of a Smart PySpark Transform:

smart.py
from pyspark.sql import DataFrame, SparkSession

from ascend.resources import pyspark, ref


@pyspark(
inputs=[
ref("cab_rides", reshape="map"),
],
event_time="pickup_datetime",
cluster_by=["cab_type"],
)
def cab_rides_smart_map_pyspark(spark: SparkSession, cab_rides: DataFrame, context):
return cab_rides

Check out our PySpark Transform reference guide for complete parameter options, advanced configurations, and additional examples.

🎉 Congratulations! You've successfully created a Smart PySpark Transform in Ascend.