PySpark Transform

PySpark Transforms execute PySpark code to transform data.

PySparkTransform

info

PySparkTransform is defined beneath the following ancestor nodes in the YAML structure:

Component
TransformComponent

Below are the properties for the PySparkTransform. Each property links to the specific details section further down in this page.

Property	Default	Type	Required	Description
dependencies		array[None]	No	List of dependencies that must complete before this Component runs.
event_time		string	No	Timestamp column in the Component output used to represent Event time.
microbatch		boolean	No	Whether to process data in microbatches.
batch_size		string	No	Size/time granularity of the microbatch to process.
lookback	1	integer	No	Number of time intervals prior to the current interval (and inclusive of current interval) to process in time-series processing mode.
begin		string	No	'Beginning of time' for this Component. If provided, time intervals before this time will be skipped in a time-series run.
inputs		array[None]	No	List of input components to use as Transform data sources.
strategy		Any of: PartitionedStrategy IncrementalStrategy string ("view", "table")	No	Transform strategy: either incremental, partitioned, or view/table.
pyspark			No	PySpark function to execute for data transformation.

Property Details

Component

A Component is a fundamental building block of a data Flow. Supported Component types include: Read, Transform, Task, Test, and more.

Property	Default	Type	Required	Description
component		One of: CustomPythonReadComponent ApplicationComponent AliasedTableComponent ExternalTableComponent	Yes	Component configuration options.

TransformComponent

Component that executes SQL or Python code to transform data.

Property	Type	Required	Description
data_plane	One of: SnowflakeDataPlane BigQueryDataPlane DuckdbDataPlane DatabricksDataPlane	No	Data Plane-specific configuration options for Components.
skip	boolean	No	Boolean flag indicating whether to skip processing for the Component or not.
retry_strategy		No	Retry strategy configuration options for the Component if any exceptions are encountered.
description	string	No	Brief description of what the model does.
metadata		No	Meta information of a resource. In most cases it doesn't affect the system behavior but may be helpful to analyze project resources.
name	string	Yes	The name of the model
flow_name	string	No	Name of the Flow that the Component belongs to.
data_maintenance		No	The data maintenance configuration options for the Component.
tests		No	Defines tests to run on this Component's data.
transform	One of: SqlTransform PythonTransform SnowparkTransform PySparkTransform	Yes	Transform that executes SQL or Python code for data transformation.

PythonTransformComponent

Python transform function.

Property	Default	Type	Required	Description
entrypoint		string	Yes	Entry point for the Python Transform function.
source		string	Yes	Source file for the Python Transform function.

IncrementalStrategy

Incremental Processing Strategy.

Property	Default	Type	Required	Description
incremental		Any of: append MergeStrategy	Yes	Incremental processing strategy.
on_schema_change		string ("ignore", "fail", "append_new_columns", "sync_all_columns", "smart")	No	Policy to apply when schema changes are detected. Defaults to 'fail' if not provided.

PartitionedStrategy

Partitioned Ingest Strategy. The user is expected to provide 2 functions, a list function that lists partitions in the source, and a read function that reads a partition from the source.

Property	Default	Type	Required	Description
partitioned			No	Options for partitioning data.
on_schema_change		string ("ignore", "fail", "append_new_columns", "sync_all_columns", "smart")	No	Policy to apply when schema changes are detected. Defaults to 'fail' if not provided.

PartitionedOptions

Options related to partition optimization - in particular, the policy that determines which partitions to ingest.

Property	Default	Type	Required	Description
enable_substitution_by_partition_name		boolean	Yes	Enable substitution by partition name.
output_type	table	string ("table", "view")	No	Output type for partitioned data. Must be either 'table' or 'view'. This strategy only applies to Transforms.

SCDType2Strategy

The SCD Type 2 strategy allows users to track changes to records over time, by tracking the start and end times for each version of a record. A brief overview of the strategy can be found at https://en.wikipedia.org/wiki/Slowly_changing_dimension#Type_2:_add_new_row.

Property	Default	Type	Required	Description
scd_type_2			No	Options for SCD Type 2 strategy.

InputComponent

Specification for input Components defining how partitioning behaviors should be handled. This metadata is required when a Component serves as an input to other Components within a Flow. The reshape parameter controls how input data is partitioned and processed. It accepts either full for full reduction operations or map for partition-wise operations.

Property	Type	Required	Description
flow	string	Yes	Name of the parent Flow that the input Component belongs to.
name	string	Yes	Name of the input Component.
alias	string	No	Alias to use for the input Component.
partition_spec	Any of: string ("full_reduction", "map")	No	Internal specification for how Component input data should be partitioned before processing. This field is populated based on the user-facing `reshape` parameter in ref() calls, which accepts `full` (for full reduction operations) or `map` (for partition-wise operations). Input partitioning is applied before the Component's logic is executed.
where	string	No	Optional filter condition to apply to the input Component's data.
partition_binding	Any of: string	No	Optional partition binding specification to apply to the Component on a per-output-partition basis against other inputs' partitions.

MergeStrategy

Strategy that involves merging new data with existing data by updating existing records that match the unique key.

Property	Default	Type	Required	Description
merge			No	Options for merge strategy.

KeyOptions

Column options needed for merge and SCD Type 2 strategies, such as unique key and deletion column name.

Property	Type	Required	Description
unique_key	string	Yes	Column or comma-separated set of columns used as a unique identifier for records, aiding in the merge process.
deletion_column	string	No	Column name used in the upstream source for soft-deleting records. Used when replicating data from a source that supports soft-deletion. If provided, the merge strategy will be able to detect deletions and mark them as deleted in the destination. If not provided, the merge strategy will not be able to detect deletions.
merge_update_columns	Any of: string array[string]	No	List of columns to include when updating values in merge. These columns are mutually exclusive with respect to the columns in `merge_exclude_columns`.
merge_exclude_columns	Any of: string array[string]	No	List of columns to exclude when updating values in merge. These columns are mutually exclusive with respect to the columns in `merge_update_columns`.
incremental_predicates	Any of: string array[string]	No	List of conditions to filter incremental data.

PartitionBinding

Property	Default	Type	Required	Description
logical_operator	logical_operator	string ("AND", "OR")	No	TLogical operator to use to combine the partition binding predicates provided
predicates	predicates	array[string]	No	List of partition binding predicates to apply to the input Component's data

RepartitionSpec

Specification for repartitioning operations on input Component's data

Property	Default	Type	Required	Description
repartition			No	Options for repartitioning the input Component's data.

RepartitionOptions

Options for repartitioning the input Component's data.

Property	Default	Type	Required	Description
partition_by		string	Yes	Column to partition by.
granularity		string	Yes	Granularity to use for the partitioning.