Merge Strategy
A strategy that involves merging new data with existing data by updating existing records that match the unique key.
MergeStrategy
MergeStrategy
is defined beneath the following ancestor nodes in the YAML structure:
- Component
- CustomPythonReadComponent
- CustomPythonReadOptions
- ReadComponent
- BigQueryReadComponent
- DatabricksReadComponent
- MSSQLReadComponent
- MySQLReadComponent
- OracleReadComponent
- PostgresReadComponent
- SnowflakeReadComponent
- IncrementalReadStrategy
- TransformComponent
- PySparkTransform
- PythonTransform
- SnowparkTransform
- SqlTransform
- IncrementalStrategy
- WriteComponent
- BigQueryWriteComponent
- MySQLWriteComponent
- OracleWriteComponent
- SnowflakeWriteComponent
- IncrementalWriteStrategyWithSchemaChange
Below are the properties for the MergeStrategy
. Each property links to the specific details section further down in this page.
Property | Default | Type | Required | Description |
---|---|---|---|---|
merge | No | Options for merge strategy. |
Property Details
Component
A component is a fundamental building block of a data flow. Types of components that are supported include: read, transform, task, test, and more.
Property | Default | Type | Required | Description |
---|---|---|---|---|
component | One of: CustomPythonReadComponent ApplicationComponent AliasedTableComponent ExternalTableComponent | Yes | Configuration options for the component. |
CustomPythonReadComponent
A component that reads data using user-defined custom Python code.
Property | Default | Type | Required | Description |
---|---|---|---|---|
data_plane | One of: SnowflakeDataPlane BigQueryDataPlane DatabricksDataPlane | No | Data Plane-specific configuration options for a component. | |
skip | boolean | No | A boolean flag indicating whether to skip processing for the component or not. | |
retry_strategy | No | The retry strategy configuration options for the component if any exceptions are encountered. | ||
description | string | No | A brief description of what the model does. | |
metadata | No | Meta information of a resource. In most cases it doesn't affect the system behavior but may be helpful to analyze project resources. | ||
name | string | Yes | The name of the model | |
flow_name | string | No | The name of the flow that the component belongs to. | |
data_maintenance | No | The data maintenance configuration options for the component. | ||
tests | No | Defines tests to run on the data of this component. | ||
custom_python_read | Yes |
CustomPythonReadOptions
Configuration options for the Custom Python Read component.
Property | Default | Type | Required | Description |
---|---|---|---|---|
dependencies | array[None] | No | List of dependencies that must complete before this component runs. | |
event_time | string | No | Timestamp column in the component output used to represent event time. | |
strategy | full | Any of: full IncrementalStrategy PartitionedStrategy | No | Ingest strategy. |
python | Any of: | Yes | Python code to execute for ingesting data. |
ReadComponent
A component that reads data from a data system.
Property | Default | Type | Required | Description |
---|---|---|---|---|
data_plane | One of: SnowflakeDataPlane BigQueryDataPlane DatabricksDataPlane | No | Data Plane-specific configuration options for a component. | |
skip | boolean | No | A boolean flag indicating whether to skip processing for the component or not. | |
retry_strategy | No | The retry strategy configuration options for the component if any exceptions are encountered. | ||
description | string | No | A brief description of what the model does. | |
metadata | No | Meta information of a resource. In most cases it doesn't affect the system behavior but may be helpful to analyze project resources. | ||
name | string | Yes | The name of the model | |
flow_name | string | No | The name of the flow that the component belongs to. | |
data_maintenance | No | The data maintenance configuration options for the component. | ||
tests | No | Defines tests to run on the data of this component. | ||
read | One of: GenericFileReadComponent LocalFileReadComponent SFTPReadComponent S3ReadComponent GcsReadComponent AbfsReadComponent HttpReadComponent MSSQLReadComponent MySQLReadComponent OracleReadComponent PostgresReadComponent SnowflakeReadComponent BigQueryReadComponent DatabricksReadComponent | Yes | The read component that reads data from a data system. |
BigQueryReadComponent
A component that reads data from a BigQuery table.
Property | Default | Type | Required | Description |
---|---|---|---|---|
dependencies | array[None] | No | List of dependencies that must complete before this component runs. | |
event_time | string | No | Timestamp column in the component output used to represent event time. | |
connection | string | No | The name of the connection to use for reading data. | |
columns | array[None] | No | A list specifying the columns to read from the source and transformations to make during read. | |
normalize | boolean | No | A boolean flag indicating if the output column names should be normalized to a standard naming convention after reading. | |
preserve_case | boolean | No | A boolean flag indicating if the case of the column names should be preserved after reading. | |
uppercase | boolean | No | A boolean flag indicating if the column names should be transformed to uppercase after reading. | |
strategy | Any of: full IncrementalReadStrategy PartitionedStrategy | No | Ingest strategy options. | |
read_options | No | Options for reading from the database or warehouse. | ||
bigquery | Any of: | Yes |
DatabricksReadComponent
A component that reads data from a Databricks table.
Property | Default | Type | Required | Description |
---|---|---|---|---|
dependencies | array[None] | No | List of dependencies that must complete before this component runs. | |
event_time | string | No | Timestamp column in the component output used to represent event time. | |
connection | string | No | The name of the connection to use for reading data. | |
columns | array[None] | No | A list specifying the columns to read from the source and transformations to make during read. | |
normalize | boolean | No | A boolean flag indicating if the output column names should be normalized to a standard naming convention after reading. | |
preserve_case | boolean | No | A boolean flag indicating if the case of the column names should be preserved after reading. | |
uppercase | boolean | No | A boolean flag indicating if the column names should be transformed to uppercase after reading. | |
strategy | Any of: full IncrementalReadStrategy PartitionedStrategy | No | Ingest strategy options. | |
read_options | No | Options for reading from the database or warehouse. | ||
databricks | Any of: | Yes |
MSSQLReadComponent
A component that reads data from a MSSQL Server database, options include ingesting a single table / query, or multiple tables / queries.
Property | Default | Type | Required | Description |
---|---|---|---|---|
dependencies | array[None] | No | List of dependencies that must complete before this component runs. | |
event_time | string | No | Timestamp column in the component output used to represent event time. | |
connection | string | No | The name of the connection to use for reading data. | |
columns | array[None] | No | A list specifying the columns to read from the source and transformations to make during read. | |
normalize | boolean | No | A boolean flag indicating if the output column names should be normalized to a standard naming convention after reading. | |
preserve_case | boolean | No | A boolean flag indicating if the case of the column names should be preserved after reading. | |
uppercase | boolean | No | A boolean flag indicating if the column names should be transformed to uppercase after reading. | |
strategy | Any of: full IncrementalReadStrategy PartitionedStrategy | No | Ingest strategy options. | |
read_options | No | Options for reading from the database or warehouse. | ||
mssql | Any of: | Yes |
MySQLReadComponent
A component that reads data from a MySQL database, options include ingesting a single table / query, or multiple tables / queries.
Property | Default | Type | Required | Description |
---|---|---|---|---|
dependencies | array[None] | No | List of dependencies that must complete before this component runs. | |
event_time | string | No | Timestamp column in the component output used to represent event time. | |
connection | string | No | The name of the connection to use for reading data. | |
columns | array[None] | No | A list specifying the columns to read from the source and transformations to make during read. | |
normalize | boolean | No | A boolean flag indicating if the output column names should be normalized to a standard naming convention after reading. | |
preserve_case | boolean | No | A boolean flag indicating if the case of the column names should be preserved after reading. | |
uppercase | boolean | No | A boolean flag indicating if the column names should be transformed to uppercase after reading. | |
strategy | Any of: full IncrementalReadStrategy PartitionedStrategy | No | Ingest strategy options. | |
read_options | No | Options for reading from the database or warehouse. | ||
use_duckdb | boolean | No | Use DuckDB extension for reading data, which is faster but may have memory limitations with very large tables. Defaults to False | |
mysql | Any of: | Yes | MySQL read options. | |
use_checksum | boolean | No | Use table checksum to detect data changes. if false or unset, will do full re-read for every run for full-sync. |
OracleReadComponent
A component that reads data from an Oracle table.
Property | Default | Type | Required | Description |
---|---|---|---|---|
dependencies | array[None] | No | List of dependencies that must complete before this component runs. | |
event_time | string | No | Timestamp column in the component output used to represent event time. | |
connection | string | No | The name of the connection to use for reading data. | |
columns | array[None] | No | A list specifying the columns to read from the source and transformations to make during read. | |
normalize | boolean | No | A boolean flag indicating if the output column names should be normalized to a standard naming convention after reading. | |
preserve_case | boolean | No | A boolean flag indicating if the case of the column names should be preserved after reading. | |
uppercase | boolean | No | A boolean flag indicating if the column names should be transformed to uppercase after reading. | |
strategy | Any of: full IncrementalReadStrategy PartitionedStrategy | No | Ingest strategy options. | |
read_options | No | Options for reading from the database or warehouse. | ||
oracle | Oracle | Any of: | No | Oracle read options. |
PostgresReadComponent
A component that reads data from a Postgresql table.
Property | Default | Type | Required | Description |
---|---|---|---|---|
dependencies | array[None] | No | List of dependencies that must complete before this component runs. | |
event_time | string | No | Timestamp column in the component output used to represent event time. | |
connection | string | No | The name of the connection to use for reading data. | |
columns | array[None] | No | A list specifying the columns to read from the source and transformations to make during read. | |
normalize | boolean | No | A boolean flag indicating if the output column names should be normalized to a standard naming convention after reading. | |
preserve_case | boolean | No | A boolean flag indicating if the case of the column names should be preserved after reading. | |
uppercase | boolean | No | A boolean flag indicating if the column names should be transformed to uppercase after reading. | |
strategy | Any of: full IncrementalReadStrategy PartitionedStrategy | No | Ingest strategy options. | |
read_options | No | Options for reading from the database or warehouse. | ||
use_duckdb | boolean | No | Use DuckDB extension for reading data, which is faster but may have memory limitations with very large tables. Defaults to False | |
postgres | Postgres | Any of: | No | Postgres read options. |
SnowflakeReadComponent
A component that reads data from a Snowflake table.
Property | Default | Type | Required | Description |
---|---|---|---|---|
dependencies | array[None] | No | List of dependencies that must complete before this component runs. | |
event_time | string | No | Timestamp column in the component output used to represent event time. | |
connection | string | No | The name of the connection to use for reading data. | |
columns | array[None] | No | A list specifying the columns to read from the source and transformations to make during read. | |
normalize | boolean | No | A boolean flag indicating if the output column names should be normalized to a standard naming convention after reading. | |
preserve_case | boolean | No | A boolean flag indicating if the case of the column names should be preserved after reading. | |
uppercase | boolean | No | A boolean flag indicating if the column names should be transformed to uppercase after reading. | |
strategy | Any of: full IncrementalReadStrategy PartitionedStrategy | No | Ingest strategy options. | |
read_options | No | Options for reading from the database or warehouse. | ||
snowflake | Any of: | Yes |
IncrementalReadStrategy
Incremental Read Strategy for database read components - this is a combination of the replication strategy that defines how new data is read from the source, and the incremental strategy that defines how this new data is materialized in the output.
Property | Default | Type | Required | Description |
---|---|---|---|---|
replication | One of: Any of: cdc Any of: incremental | No | Replication strategy to use for data synchronization. | |
incremental | Any of: append MergeStrategy | Yes | Incremental processing strategy. | |
on_schema_change | string ("ignore", "fail", "append_new_columns", "sync_all_columns") | No | Policy to apply when schema changes are detected. Defaults to 'fail' if not provided. |
TransformComponent
A component that executes SQL or Python code to transform data.
Property | Default | Type | Required | Description |
---|---|---|---|---|
data_plane | One of: SnowflakeDataPlane BigQueryDataPlane DatabricksDataPlane | No | Data Plane-specific configuration options for a component. | |
skip | boolean | No | A boolean flag indicating whether to skip processing for the component or not. | |
retry_strategy | No | The retry strategy configuration options for the component if any exceptions are encountered. | ||
description | string | No | A brief description of what the model does. | |
metadata | No | Meta information of a resource. In most cases it doesn't affect the system behavior but may be helpful to analyze project resources. | ||
name | string | Yes | The name of the model | |
flow_name | string | No | The name of the flow that the component belongs to. | |
data_maintenance | No | The data maintenance configuration options for the component. | ||
tests | No | Defines tests to run on the data of this component. | ||
transform | One of: SqlTransform PythonTransform SnowparkTransform PySparkTransform | Yes | The transform component that executes SQL or Python code to transform data. |
PySparkTransform
PySpark transforms execute PySpark code to transform data.
Property | Default | Type | Required | Description |
---|---|---|---|---|
dependencies | array[None] | No | List of dependencies that must complete before this component runs. | |
event_time | string | No | Timestamp column in the component output used to represent event time. | |
microbatch | boolean | No | Whether to process data in microbatches. | |
batch_size | string | No | The size/time granularity of the microbatch to process. | |
lookback | 1 | integer | No | The number of time intervals prior to the current interval (and inclusive of current interval) to process in time-series processing mode. |
begin | string | No | The 'beginning of time' for this component. If provided, time intervals before this time will be skipped in a time-series run. | |
inputs | array[None] | No | List of input components to use as data sources for the transform. | |
strategy | Any of: PartitionedStrategy IncrementalStrategy string ("view", "table") | No | Transform strategy - incremental, partitioned, or view/table. | |
pyspark | No | PySpark transform function to execute for transforming the data. |
PythonTransform
Python transforms execute Python code to transform data.
Property | Default | Type | Required | Description |
---|---|---|---|---|
dependencies | array[None] | No | List of dependencies that must complete before this component runs. | |
event_time | string | No | Timestamp column in the component output used to represent event time. | |
microbatch | boolean | No | Whether to process data in microbatches. | |
batch_size | string | No | The size/time granularity of the microbatch to process. | |
lookback | 1 | integer | No | The number of time intervals prior to the current interval (and inclusive of current interval) to process in time-series processing mode. |
begin | string | No | The 'beginning of time' for this component. If provided, time intervals before this time will be skipped in a time-series run. | |
inputs | array[None] | No | List of input components to use as data sources for the transform. | |
strategy | Any of: PartitionedStrategy IncrementalStrategy string ("view", "table") | No | Transform strategy - incremental, partitioned, or view/table. | |
python | No | Python transform function to execute for transforming the data. |
SnowparkTransform
Snowpark transforms execute Python code to transform data within the Snowflake platform.
Property | Default | Type | Required | Description |
---|---|---|---|---|
dependencies | array[None] | No | List of dependencies that must complete before this component runs. | |
event_time | string | No | Timestamp column in the component output used to represent event time. | |
microbatch | boolean | No | Whether to process data in microbatches. | |
batch_size | string | No | The size/time granularity of the microbatch to process. | |
lookback | 1 | integer | No | The number of time intervals prior to the current interval (and inclusive of current interval) to process in time-series processing mode. |
begin | string | No | The 'beginning of time' for this component. If provided, time intervals before this time will be skipped in a time-series run. | |
inputs | array[None] | No | List of input components to use as data sources for the transform. | |
strategy | Any of: PartitionedStrategy IncrementalStrategy string ("view", "table") | No | Transform strategy - incremental, partitioned, or view/table. | |
snowpark | No | Snowpark transform function to execute for transforming the data. |
SqlTransform
SQL transforms execute SQL queries to transform data.
Property | Default | Type | Required | Description |
---|---|---|---|---|
dependencies | array[None] | No | List of dependencies that must complete before this component runs. | |
event_time | string | No | Timestamp column in the component output used to represent event time. | |
microbatch | boolean | No | Whether to process data in microbatches. | |
batch_size | string | No | The size/time granularity of the microbatch to process. | |
lookback | 1 | integer | No | The number of time intervals prior to the current interval (and inclusive of current interval) to process in time-series processing mode. |
begin | string | No | The 'beginning of time' for this component. If provided, time intervals before this time will be skipped in a time-series run. | |
inputs | array[None] | No | List of input components to use as data sources for the transform. | |
strategy | Any of: PartitionedStrategy IncrementalStrategy string ("view", "table") | No | Transform strategy - incremental, partitioned, or view/table. | |
sql | string | No | SQL query to execute for transforming the data. | |
dialect | spark | No | SQL dialect to use for the query. Set to 'None' for the data plane's default dialect, or 'spark' for Spark SQL. |
IncrementalStrategy
Incremental Processing Strategy.
Property | Default | Type | Required | Description |
---|---|---|---|---|
incremental | Any of: append MergeStrategy | Yes | Incremental processing strategy. | |
on_schema_change | string ("ignore", "fail", "append_new_columns", "sync_all_columns") | No | Policy to apply when schema changes are detected. Defaults to 'fail' if not provided. |
WriteComponent
Property | Default | Type | Required | Description |
---|---|---|---|---|
skip | boolean | No | A boolean flag indicating whether to skip processing for the component or not. | |
retry_strategy | No | The retry strategy configuration options for the component if any exceptions are encountered. | ||
description | string | No | A brief description of what the model does. | |
metadata | No | Meta information of a resource. In most cases it doesn't affect the system behavior but may be helpful to analyze project resources. | ||
name | string | Yes | The name of the model | |
flow_name | string | No | The name of the flow that the component belongs to. | |
write | One of: BigQueryWriteComponent SnowflakeWriteComponent S3WriteComponent SFTPWriteComponent GcsWriteComponent AbfsWriteComponent MySQLWriteComponent OracleWriteComponent | Yes |
BigQueryWriteComponent
A component that writes data to a BigQuery table.
Property | Default | Type | Required | Description |
---|---|---|---|---|
dependencies | array[None] | No | List of dependencies that must complete before this component runs. | |
connection | string | Yes | The name of the connection to use for writing data. | |
input | Yes | Input component name. | ||
normalize | boolean | No | A boolean flag indicating if the output column names should be normalized to a standard naming convention when writing. | |
preserve_case | boolean | No | A boolean flag indicating if the case of the column names should be preserved when writing. | |
uppercase | boolean | No | A boolean flag indicating if the column names should be transformed to uppercase when writing. | |
strategy | full: mode: drop_and_recreate | Any of: snapshot | No | Resource for write strategy. |
pre_sql | Any of: string array[string] | No | SQL statements to execute before the main write operation. Can be a single SQL statement as a string or multiple statements as a list of strings. | |
post_sql | Any of: string array[string] | No | SQL statements to execute after the main write operation. Can be a single SQL statement as a string or multiple statements as a list of strings. | |
bigquery | Yes |
MySQLWriteComponent
A component that writes data to a MySQL table
Property | Default | Type | Required | Description |
---|---|---|---|---|
dependencies | array[None] | No | List of dependencies that must complete before this component runs. | |
connection | string | Yes | The name of the connection to use for writing data. | |
input | Yes | Input component name. | ||
normalize | boolean | No | A boolean flag indicating if the output column names should be normalized to a standard naming convention when writing. | |
preserve_case | boolean | No | A boolean flag indicating if the case of the column names should be preserved when writing. | |
uppercase | boolean | No | A boolean flag indicating if the column names should be transformed to uppercase when writing. | |
strategy | full: mode: drop_and_recreate | Any of: snapshot | No | Resource for write strategy. |
pre_sql | Any of: string array[string] | No | SQL statements to execute before the main write operation. Can be a single SQL statement as a string or multiple statements as a list of strings. | |
post_sql | Any of: string array[string] | No | SQL statements to execute after the main write operation. Can be a single SQL statement as a string or multiple statements as a list of strings. | |
mysql | Yes |
OracleWriteComponent
A component that writes data to a Oracle table
Property | Default | Type | Required | Description |
---|---|---|---|---|
dependencies | array[None] | No | List of dependencies that must complete before this component runs. | |
connection | string | Yes | The name of the connection to use for writing data. | |
input | Yes | Input component name. | ||
normalize | boolean | No | A boolean flag indicating if the output column names should be normalized to a standard naming convention when writing. | |
preserve_case | boolean | No | A boolean flag indicating if the case of the column names should be preserved when writing. | |
uppercase | boolean | No | A boolean flag indicating if the column names should be transformed to uppercase when writing. | |
strategy | full: mode: drop_and_recreate | Any of: snapshot | No | Resource for write strategy. |
pre_sql | Any of: string array[string] | No | SQL statements to execute before the main write operation. Can be a single SQL statement as a string or multiple statements as a list of strings. | |
post_sql | Any of: string array[string] | No | SQL statements to execute after the main write operation. Can be a single SQL statement as a string or multiple statements as a list of strings. | |
oracle | Yes |
SnowflakeWriteComponent
A component that writes data to a Snowflake table.
Property | Default | Type | Required | Description |
---|---|---|---|---|
dependencies | array[None] | No | List of dependencies that must complete before this component runs. | |
connection | string | Yes | The name of the connection to use for writing data. | |
input | Yes | Input component name. | ||
normalize | boolean | No | A boolean flag indicating if the output column names should be normalized to a standard naming convention when writing. | |
preserve_case | boolean | No | A boolean flag indicating if the case of the column names should be preserved when writing. | |
uppercase | boolean | No | A boolean flag indicating if the column names should be transformed to uppercase when writing. | |
strategy | full: mode: drop_and_recreate | Any of: snapshot | No | Resource for write strategy. |
pre_sql | Any of: string array[string] | No | SQL statements to execute before the main write operation. Can be a single SQL statement as a string or multiple statements as a list of strings. | |
post_sql | Any of: string array[string] | No | SQL statements to execute after the main write operation. Can be a single SQL statement as a string or multiple statements as a list of strings. | |
snowflake | Yes |
IncrementalWriteStrategyWithSchemaChange
Incremental write strategy that defines how new data is merged with existing data, along with policy for handling schema changes.
Property | Default | Type | Required | Description |
---|---|---|---|---|
incremental | Any of: append MergeStrategy | Yes | Options to use when incrementally writing data to a Write component. | |
incremental_column | string | Yes | Name of the column to use for tracking incremental updates to the data. | |
on_schema_change | string ("ignore", "fail", "append_new_columns", "sync_all_columns") | No | Policy to apply when schema changes are detected. Defaults to 'fail' if not provided. |
KeyOptions
Column options needed for merge and SCD Type 2 strategies, such as unique key and deletion column name.
Property | Default | Type | Required | Description |
---|---|---|---|---|
unique_key | string | Yes | Column or comma-separated set of columns used as a unique identifier for records, aiding in the merge process. | |
deletion_column | string | No | Column name used in the upstream source for soft-deleting records. Used when replicating data from a source that supports soft-deletion. If provided, the merge strategy will be able to detect deletions and mark them as deleted in the destination. If not provided, the merge strategy will not be able to detect deletions. | |
merge_update_columns | Any of: string array[string] | No | List of columns to include when updating values in merge. These columns are mutually exclusive with respect to the columns in merge_exclude_columns . | |
merge_exclude_columns | Any of: string array[string] | No | List of columns to exclude when updating values in merge. These columns are mutually exclusive with respect to the columns in merge_update_columns . | |
incremental_predicates | Any of: string array[string] | No | List of conditions to filter incremental data. |