Incremental Strategy
Incremental Processing Strategy.
IncrementalStrategy
IncrementalStrategy
is defined beneath the following ancestor nodes in the YAML structure:
Below are the properties for the IncrementalStrategy
. Each property links to the specific details section further down in this page.
Property | Default | Type | Required | Description |
---|---|---|---|---|
incremental | Any of: append MergeStrategy | Yes | Incremental processing strategy. | |
on_schema_change | string ("ignore", "fail", "append_new_columns", "sync_all_columns") | No | Policy to apply when schema changes are detected. Defaults to 'fail' if not provided. |
Property Details
Component
A component is a fundamental building block of a data flow. Types of components that are supported include: read, transform, task, test, and more.
Property | Default | Type | Required | Description |
---|---|---|---|---|
component | One of: CustomPythonReadComponent ApplicationComponent AliasedTableComponent ExternalTableComponent | Yes | Configuration options for the component. |
CustomPythonReadComponent
A component that reads data using user-defined custom Python code.
Property | Default | Type | Required | Description |
---|---|---|---|---|
data_plane | One of: SnowflakeDataPlane BigQueryDataPlane DatabricksDataPlane | No | Data Plane-specific configuration options for a component. | |
skip | boolean | No | A boolean flag indicating whether to skip processing for the component or not. | |
retry_strategy | No | The retry strategy configuration options for the component if any exceptions are encountered. | ||
description | string | No | A brief description of what the model does. | |
metadata | No | Meta information of a resource. In most cases it doesn't affect the system behavior but may be helpful to analyze project resources. | ||
name | string | Yes | The name of the model | |
flow_name | string | No | The name of the flow that the component belongs to. | |
data_maintenance | No | The data maintenance configuration options for the component. | ||
tests | No | Defines tests to run on the data of this component. | ||
custom_python_read | Yes |
CustomPythonReadOptions
Configuration options for the Custom Python Read component.
Property | Default | Type | Required | Description |
---|---|---|---|---|
dependencies | array[None] | No | List of dependencies that must complete before this component runs. | |
event_time | string | No | Timestamp column in the component output used to represent event time. | |
strategy | full | Any of: full IncrementalStrategy PartitionedStrategy | No | Ingest strategy. |
python | Any of: | Yes | Python code to execute for ingesting data. |
TransformComponent
A component that executes SQL or Python code to transform data.
Property | Default | Type | Required | Description |
---|---|---|---|---|
data_plane | One of: SnowflakeDataPlane BigQueryDataPlane DatabricksDataPlane | No | Data Plane-specific configuration options for a component. | |
skip | boolean | No | A boolean flag indicating whether to skip processing for the component or not. | |
retry_strategy | No | The retry strategy configuration options for the component if any exceptions are encountered. | ||
description | string | No | A brief description of what the model does. | |
metadata | No | Meta information of a resource. In most cases it doesn't affect the system behavior but may be helpful to analyze project resources. | ||
name | string | Yes | The name of the model | |
flow_name | string | No | The name of the flow that the component belongs to. | |
data_maintenance | No | The data maintenance configuration options for the component. | ||
tests | No | Defines tests to run on the data of this component. | ||
transform | One of: SqlTransform PythonTransform SnowparkTransform PySparkTransform | Yes | The transform component that executes SQL or Python code to transform data. |
PySparkTransform
PySpark transforms execute PySpark code to transform data.
Property | Default | Type | Required | Description |
---|---|---|---|---|
dependencies | array[None] | No | List of dependencies that must complete before this component runs. | |
event_time | string | No | Timestamp column in the component output used to represent event time. | |
microbatch | boolean | No | Whether to process data in microbatches. | |
batch_size | string | No | The size/time granularity of the microbatch to process. | |
lookback | 1 | integer | No | The number of time intervals prior to the current interval (and inclusive of current interval) to process in time-series processing mode. |
begin | string | No | The 'beginning of time' for this component. If provided, time intervals before this time will be skipped in a time-series run. | |
inputs | array[None] | No | List of input components to use as data sources for the transform. | |
strategy | Any of: PartitionedStrategy IncrementalStrategy string ("view", "table") | No | Transform strategy - incremental, partitioned, or view/table. | |
pyspark | No | PySpark transform function to execute for transforming the data. |
PythonTransform
Python transforms execute Python code to transform data.
Property | Default | Type | Required | Description |
---|---|---|---|---|
dependencies | array[None] | No | List of dependencies that must complete before this component runs. | |
event_time | string | No | Timestamp column in the component output used to represent event time. | |
microbatch | boolean | No | Whether to process data in microbatches. | |
batch_size | string | No | The size/time granularity of the microbatch to process. | |
lookback | 1 | integer | No | The number of time intervals prior to the current interval (and inclusive of current interval) to process in time-series processing mode. |
begin | string | No | The 'beginning of time' for this component. If provided, time intervals before this time will be skipped in a time-series run. | |
inputs | array[None] | No | List of input components to use as data sources for the transform. | |
strategy | Any of: PartitionedStrategy IncrementalStrategy string ("view", "table") | No | Transform strategy - incremental, partitioned, or view/table. | |
python | No | Python transform function to execute for transforming the data. |
SnowparkTransform
Snowpark transforms execute Python code to transform data within the Snowflake platform.
Property | Default | Type | Required | Description |
---|---|---|---|---|
dependencies | array[None] | No | List of dependencies that must complete before this component runs. | |
event_time | string | No | Timestamp column in the component output used to represent event time. | |
microbatch | boolean | No | Whether to process data in microbatches. | |
batch_size | string | No | The size/time granularity of the microbatch to process. | |
lookback | 1 | integer | No | The number of time intervals prior to the current interval (and inclusive of current interval) to process in time-series processing mode. |
begin | string | No | The 'beginning of time' for this component. If provided, time intervals before this time will be skipped in a time-series run. | |
inputs | array[None] | No | List of input components to use as data sources for the transform. | |
strategy | Any of: PartitionedStrategy IncrementalStrategy string ("view", "table") | No | Transform strategy - incremental, partitioned, or view/table. | |
snowpark | No | Snowpark transform function to execute for transforming the data. |
SqlTransform
SQL transforms execute SQL queries to transform data.
Property | Default | Type | Required | Description |
---|---|---|---|---|
dependencies | array[None] | No | List of dependencies that must complete before this component runs. | |
event_time | string | No | Timestamp column in the component output used to represent event time. | |
microbatch | boolean | No | Whether to process data in microbatches. | |
batch_size | string | No | The size/time granularity of the microbatch to process. | |
lookback | 1 | integer | No | The number of time intervals prior to the current interval (and inclusive of current interval) to process in time-series processing mode. |
begin | string | No | The 'beginning of time' for this component. If provided, time intervals before this time will be skipped in a time-series run. | |
inputs | array[None] | No | List of input components to use as data sources for the transform. | |
strategy | Any of: PartitionedStrategy IncrementalStrategy string ("view", "table") | No | Transform strategy - incremental, partitioned, or view/table. | |
sql | string | No | SQL query to execute for transforming the data. | |
dialect | spark | No | SQL dialect to use for the query. Set to 'None' for the data plane's default dialect, or 'spark' for Spark SQL. |
SCDType2Strategy
The SCD Type 2 strategy allows users to track changes to records over time, by tracking the start and end times for each version of a record. A brief overview of the strategy can be found at https://en.wikipedia.org/wiki/Slowly_changing_dimension#Type_2:_add_new_row.
Property | Default | Type | Required | Description |
---|---|---|---|---|
scd_type_2 | No | Options for SCD Type 2 strategy. |
MergeStrategy
A strategy that involves merging new data with existing data by updating existing records that match the unique key.
Property | Default | Type | Required | Description |
---|---|---|---|---|
merge | No | Options for merge strategy. |
KeyOptions
Column options needed for merge and SCD Type 2 strategies, such as unique key and deletion column name.
Property | Default | Type | Required | Description |
---|---|---|---|---|
unique_key | string | Yes | Column or comma-separated set of columns used as a unique identifier for records, aiding in the merge process. | |
deletion_column | string | No | Column name used in the upstream source for soft-deleting records. Used when replicating data from a source that supports soft-deletion. If provided, the merge strategy will be able to detect deletions and mark them as deleted in the destination. If not provided, the merge strategy will not be able to detect deletions. | |
merge_update_columns | Any of: string array[string] | No | List of columns to include when updating values in merge. These columns are mutually exclusive with respect to the columns in merge_exclude_columns . | |
merge_exclude_columns | Any of: string array[string] | No | List of columns to exclude when updating values in merge. These columns are mutually exclusive with respect to the columns in merge_update_columns . | |
incremental_predicates | Any of: string array[string] | No | List of conditions to filter incremental data. |