Merge Strategy
Strategy that involves merging new data with existing data by updating existing records that match the unique key.
MergeStrategy
MergeStrategy
is defined beneath the following ancestor nodes in the YAML structure:
- Component
- CustomPythonReadComponent
- CustomPythonReadOptions
- ReadComponent
- BigQueryReadComponent
- DatabricksReadComponent
- MSSQLReadComponent
- MySQLReadComponent
- OracleReadComponent
- PostgresReadComponent
- SnowflakeReadComponent
- IncrementalReadStrategy
- TransformComponent
- PySparkTransform
- PythonTransform
- SnowparkTransform
- SqlTransform
- IncrementalStrategy
- WriteComponent
- BigQueryWriteComponent
- MySQLWriteComponent
- OracleWriteComponent
- PostgresWriteComponent
- SnowflakeWriteComponent
- IncrementalWriteStrategyWithSchemaChange
Below are the properties for the MergeStrategy
. Each property links to the specific details section further down in this page.
Property | Default | Type | Required | Description |
---|---|---|---|---|
merge | No | Options for merge strategy. |
Property Details
Component
A Component is a fundamental building block of a data Flow. Supported Component types include: Read, Transform, Task, Test, and more.
Property | Default | Type | Required | Description |
---|---|---|---|---|
component | One of: CustomPythonReadComponent ApplicationComponent AliasedTableComponent ExternalTableComponent FivetranComponent | Yes | Component configuration options. |
CustomPythonReadComponent
Component that reads data using user-defined custom Python code.
Property | Default | Type | Required | Description |
---|---|---|---|---|
data_plane | One of: SnowflakeDataPlane BigQueryDataPlane DuckdbDataPlane DatabricksDataPlane | No | Data Plane-specific configuration options for Components. | |
skip | boolean | No | Boolean flag indicating whether to skip processing for the Component or not. | |
retry_strategy | No | Retry strategy configuration options for the Component if any exceptions are encountered. | ||
description | string | No | Brief description of what the model does. | |
metadata | No | Meta information of a resource. In most cases it doesn't affect the system behavior but may be helpful to analyze project resources. | ||
name | string | Yes | The name of the model | |
flow_name | string | No | Name of the Flow that the Component belongs to. | |
data_maintenance | No | The data maintenance configuration options for the Component. | ||
tests | No | Defines tests to run on this Component's data. | ||
custom_python_read | Yes |
CustomPythonReadOptions
Configuration options for the Custom Python Read Component.
Property | Default | Type | Required | Description |
---|---|---|---|---|
dependencies | array[None] | No | List of dependencies that must complete before this Component runs. | |
event_time | string | No | Timestamp column in the Component output used to represent Event time. | |
strategy | full | Any of: full IncrementalStrategy PartitionedStrategy | No | Ingest strategy. |
python | Any of: | Yes | Python code to execute for ingesting data. |
ReadComponent
Component that reads data from a system.
Property | Default | Type | Required | Description |
---|---|---|---|---|
data_plane | One of: SnowflakeDataPlane BigQueryDataPlane DuckdbDataPlane DatabricksDataPlane | No | Data Plane-specific configuration options for Components. | |
skip | boolean | No | Boolean flag indicating whether to skip processing for the Component or not. | |
retry_strategy | No | Retry strategy configuration options for the Component if any exceptions are encountered. | ||
description | string | No | Brief description of what the model does. | |
metadata | No | Meta information of a resource. In most cases it doesn't affect the system behavior but may be helpful to analyze project resources. | ||
name | string | Yes | The name of the model | |
flow_name | string | No | Name of the Flow that the Component belongs to. | |
data_maintenance | No | The data maintenance configuration options for the Component. | ||
tests | No | Defines tests to run on this Component's data. | ||
read | One of: GenericFileReadComponent LocalFileReadComponent SFTPReadComponent S3ReadComponent GcsReadComponent AbfsReadComponent HttpReadComponent MSSQLReadComponent MySQLReadComponent OracleReadComponent PostgresReadComponent SnowflakeReadComponent BigQueryReadComponent DatabricksReadComponent | Yes | Read component that reads data from a system. |
BigQueryReadComponent
Component that reads data from a BigQuery table.
Property | Default | Type | Required | Description |
---|---|---|---|---|
dependencies | array[None] | No | List of dependencies that must complete before this Component runs. | |
event_time | string | No | Timestamp column in the Component output used to represent Event time. | |
connection | string | No | Name of the Connection to use for reading data. | |
columns | array[None] | No | List specifying the columns to read from the source and transformations to make during read. | |
normalize | boolean | No | Boolean flag indicating whether the output column names should be normalized to a standard naming convention after reading. | |
preserve_case | boolean | No | Boolean flag indicating whether the case of the column names should be preserved after reading. | |
uppercase | boolean | No | Boolean flag indicating whether the column names should be transformed to uppercase after reading. | |
strategy | Any of: full IncrementalReadStrategy PartitionedStrategy | No | Ingest strategy options. | |
read_options | No | Options for reading from the database or warehouse. | ||
bigquery | Any of: | Yes |
DatabricksReadComponent
Component that reads data from a Databricks table.
Property | Default | Type | Required | Description |
---|---|---|---|---|
dependencies | array[None] | No | List of dependencies that must complete before this Component runs. | |
event_time | string | No | Timestamp column in the Component output used to represent Event time. | |
connection | string | No | Name of the Connection to use for reading data. | |
columns | array[None] | No | List specifying the columns to read from the source and transformations to make during read. | |
normalize | boolean | No | Boolean flag indicating whether the output column names should be normalized to a standard naming convention after reading. | |
preserve_case | boolean | No | Boolean flag indicating whether the case of the column names should be preserved after reading. | |
uppercase | boolean | No | Boolean flag indicating whether the column names should be transformed to uppercase after reading. | |
strategy | Any of: full IncrementalReadStrategy PartitionedStrategy | No | Ingest strategy options. | |
read_options | No | Options for reading from the database or warehouse. | ||
databricks | Any of: | Yes |
MSSQLReadComponent
A component that reads data from a MSSQL Server database, options include ingesting a single table / query, or multiple tables / queries.
Property | Default | Type | Required | Description |
---|---|---|---|---|
dependencies | array[None] | No | List of dependencies that must complete before this Component runs. | |
event_time | string | No | Timestamp column in the Component output used to represent Event time. | |
connection | string | No | Name of the Connection to use for reading data. | |
columns | array[None] | No | List specifying the columns to read from the source and transformations to make during read. | |
normalize | boolean | No | Boolean flag indicating whether the output column names should be normalized to a standard naming convention after reading. | |
preserve_case | boolean | No | Boolean flag indicating whether the case of the column names should be preserved after reading. | |
uppercase | boolean | No | Boolean flag indicating whether the column names should be transformed to uppercase after reading. | |
strategy | Any of: full IncrementalReadStrategy PartitionedStrategy | No | Ingest strategy options. | |
read_options | No | Options for reading from the database or warehouse. | ||
mssql | Any of: | Yes |
MySQLReadComponent
Component that reads data from a MySQL database, options include ingesting a single table / query, or multiple tables / queries.
Property | Default | Type | Required | Description |
---|---|---|---|---|
dependencies | array[None] | No | List of dependencies that must complete before this Component runs. | |
event_time | string | No | Timestamp column in the Component output used to represent Event time. | |
connection | string | No | Name of the Connection to use for reading data. | |
columns | array[None] | No | List specifying the columns to read from the source and transformations to make during read. | |
normalize | boolean | No | Boolean flag indicating whether the output column names should be normalized to a standard naming convention after reading. | |
preserve_case | boolean | No | Boolean flag indicating whether the case of the column names should be preserved after reading. | |
uppercase | boolean | No | Boolean flag indicating whether the column names should be transformed to uppercase after reading. | |
strategy | Any of: full IncrementalReadStrategy PartitionedStrategy | No | Ingest strategy options. | |
read_options | No | Options for reading from the database or warehouse. | ||
use_duckdb | boolean | No | Use DuckDB extension for reading data, which is faster but may have memory limitations with very large tables. Defaults to False | |
mysql | Any of: | Yes | MySQL read options. | |
use_checksum | boolean | No | Use table checksum to detect data changes. If false or unset, will do full re-read for every run for full-sync. |
OracleReadComponent
Component that reads data from an Oracle table.
Property | Default | Type | Required | Description |
---|---|---|---|---|
dependencies | array[None] | No | List of dependencies that must complete before this Component runs. | |
event_time | string | No | Timestamp column in the Component output used to represent Event time. | |
connection | string | No | Name of the Connection to use for reading data. | |
columns | array[None] | No | List specifying the columns to read from the source and transformations to make during read. | |
normalize | boolean | No | Boolean flag indicating whether the output column names should be normalized to a standard naming convention after reading. | |
preserve_case | boolean | No | Boolean flag indicating whether the case of the column names should be preserved after reading. | |
uppercase | boolean | No | Boolean flag indicating whether the column names should be transformed to uppercase after reading. | |
strategy | Any of: full IncrementalReadStrategy PartitionedStrategy | No | Ingest strategy options. | |
read_options | No | Options for reading from the database or warehouse. | ||
oracle | Oracle | Any of: | No | Oracle read options. |
PostgresReadComponent
Component that reads data from a Postgresql table.
Property | Default | Type | Required | Description |
---|---|---|---|---|
dependencies | array[None] | No | List of dependencies that must complete before this Component runs. | |
event_time | string | No | Timestamp column in the Component output used to represent Event time. | |
connection | string | No | Name of the Connection to use for reading data. | |
columns | array[None] | No | List specifying the columns to read from the source and transformations to make during read. | |
normalize | boolean | No | Boolean flag indicating whether the output column names should be normalized to a standard naming convention after reading. | |
preserve_case | boolean | No | Boolean flag indicating whether the case of the column names should be preserved after reading. | |
uppercase | boolean | No | Boolean flag indicating whether the column names should be transformed to uppercase after reading. | |
strategy | Any of: full IncrementalReadStrategy PartitionedStrategy | No | Ingest strategy options. | |
read_options | No | Options for reading from the database or warehouse. | ||
use_duckdb | boolean | No | Use DuckDB extension for reading data, which is faster but may have memory limitations with very large tables. Defaults to False | |
postgres | Postgres | Any of: | No | Postgres read options. |
SnowflakeReadComponent
Component that reads data from a Snowflake table.
Property | Default | Type | Required | Description |
---|---|---|---|---|
dependencies | array[None] | No | List of dependencies that must complete before this Component runs. | |
event_time | string | No | Timestamp column in the Component output used to represent Event time. | |
connection | string | No | Name of the Connection to use for reading data. | |
columns | array[None] | No | List specifying the columns to read from the source and transformations to make during read. | |
normalize | boolean | No | Boolean flag indicating whether the output column names should be normalized to a standard naming convention after reading. | |
preserve_case | boolean | No | Boolean flag indicating whether the case of the column names should be preserved after reading. | |
uppercase | boolean | No | Boolean flag indicating whether the column names should be transformed to uppercase after reading. | |
strategy | Any of: full IncrementalReadStrategy PartitionedStrategy | No | Ingest strategy options. | |
read_options | No | Options for reading from the database or warehouse. | ||
snowflake | Any of: | Yes |
IncrementalReadStrategy
Incremental read strategy for database Read Components - this is a combination of the replication strategy that defines how new data is read from the source, and the incremental strategy that defines how this new data is materialized in the output.
Property | Default | Type | Required | Description |
---|---|---|---|---|
replication | One of: Any of: cdc Any of: incremental | No | Replication strategy to use for data synchronization. | |
incremental | Any of: append MergeStrategy | Yes | Incremental processing strategy. | |
on_schema_change | string ("ignore", "fail", "append_new_columns", "sync_all_columns") | No | Policy to apply when schema changes are detected. Defaults to 'fail' if not provided. |
TransformComponent
Component that executes SQL or Python code to transform data.
Property | Default | Type | Required | Description |
---|---|---|---|---|
data_plane | One of: SnowflakeDataPlane BigQueryDataPlane DuckdbDataPlane DatabricksDataPlane | No | Data Plane-specific configuration options for Components. | |
skip | boolean | No | Boolean flag indicating whether to skip processing for the Component or not. | |
retry_strategy | No | Retry strategy configuration options for the Component if any exceptions are encountered. | ||
description | string | No | Brief description of what the model does. | |
metadata | No | Meta information of a resource. In most cases it doesn't affect the system behavior but may be helpful to analyze project resources. | ||
name | string | Yes | The name of the model | |
flow_name | string | No | Name of the Flow that the Component belongs to. | |
data_maintenance | No | The data maintenance configuration options for the Component. | ||
tests | No | Defines tests to run on this Component's data. | ||
transform | One of: SqlTransform PythonTransform SnowparkTransform PySparkTransform | Yes | Transform that executes SQL or Python code for data transformation. |
PySparkTransform
PySpark Transforms execute PySpark code to transform data.
Property | Default | Type | Required | Description |
---|---|---|---|---|
dependencies | array[None] | No | List of dependencies that must complete before this Component runs. | |
event_time | string | No | Timestamp column in the Component output used to represent Event time. | |
microbatch | boolean | No | Whether to process data in microbatches. | |
batch_size | string | No | Size/time granularity of the microbatch to process. | |
lookback | 1 | integer | No | Number of time intervals prior to the current interval (and inclusive of current interval) to process in time-series processing mode. |
begin | string | No | 'Beginning of time' for this Component. If provided, time intervals before this time will be skipped in a time-series run. | |
inputs | array[None] | No | List of input components to use as Transform data sources. | |
strategy | Any of: PartitionedStrategy IncrementalStrategy string ("view", "table") | No | Transform strategy: either incremental, partitioned, or view/table. | |
pyspark | No | PySpark function to execute for data transformation. |
PythonTransform
Python Transform.
Property | Default | Type | Required | Description |
---|---|---|---|---|
dependencies | array[None] | No | List of dependencies that must complete before this Component runs. | |
event_time | string | No | Timestamp column in the Component output used to represent Event time. | |
microbatch | boolean | No | Whether to process data in microbatches. | |
batch_size | string | No | Size/time granularity of the microbatch to process. | |
lookback | 1 | integer | No | Number of time intervals prior to the current interval (and inclusive of current interval) to process in time-series processing mode. |
begin | string | No | 'Beginning of time' for this Component. If provided, time intervals before this time will be skipped in a time-series run. | |
inputs | array[None] | No | List of input components to use as Transform data sources. | |
strategy | Any of: PartitionedStrategy IncrementalStrategy string ("view", "table") | No | Transform strategy: either incremental, partitioned, or view/table. | |
python | No | Python function to execute for data transformation. |
SnowparkTransform
Snowpark Transforms execute Python code to transform data within the Snowflake platform.
Property | Default | Type | Required | Description |
---|---|---|---|---|
dependencies | array[None] | No | List of dependencies that must complete before this Component runs. | |
event_time | string | No | Timestamp column in the Component output used to represent Event time. | |
microbatch | boolean | No | Whether to process data in microbatches. | |
batch_size | string | No | Size/time granularity of the microbatch to process. | |
lookback | 1 | integer | No | Number of time intervals prior to the current interval (and inclusive of current interval) to process in time-series processing mode. |
begin | string | No | 'Beginning of time' for this Component. If provided, time intervals before this time will be skipped in a time-series run. | |
inputs | array[None] | No | List of input components to use as Transform data sources. | |
strategy | Any of: PartitionedStrategy IncrementalStrategy string ("view", "table") | No | Transform strategy: either incremental, partitioned, or view/table. | |
snowpark | No | Snowpark function to execute for data transformation. |
SqlTransform
SQL Transform.
Property | Default | Type | Required | Description |
---|---|---|---|---|
dependencies | array[None] | No | List of dependencies that must complete before this Component runs. | |
event_time | string | No | Timestamp column in the Component output used to represent Event time. | |
microbatch | boolean | No | Whether to process data in microbatches. | |
batch_size | string | No | Size/time granularity of the microbatch to process. | |
lookback | 1 | integer | No | Number of time intervals prior to the current interval (and inclusive of current interval) to process in time-series processing mode. |
begin | string | No | 'Beginning of time' for this Component. If provided, time intervals before this time will be skipped in a time-series run. | |
inputs | array[None] | No | List of input components to use as Transform data sources. | |
strategy | Any of: PartitionedStrategy IncrementalStrategy string ("view", "table") | No | Transform strategy: either incremental, partitioned, or view/table. | |
sql | string | No | SQL query to execute for data transformation. | |
dialect | spark | No | SQL dialect to use for the query. Set to 'None' for the Data Plane's default dialect, or 'spark' for Spark SQL. |
IncrementalStrategy
Incremental Processing Strategy.
Property | Default | Type | Required | Description |
---|---|---|---|---|
incremental | Any of: append MergeStrategy | Yes | Incremental processing strategy. | |
on_schema_change | string ("ignore", "fail", "append_new_columns", "sync_all_columns") | No | Policy to apply when schema changes are detected. Defaults to 'fail' if not provided. |
WriteComponent
Property | Default | Type | Required | Description |
---|---|---|---|---|
skip | boolean | No | Boolean flag indicating whether to skip processing for the Component or not. | |
retry_strategy | No | Retry strategy configuration options for the Component if any exceptions are encountered. | ||
description | string | No | Brief description of what the model does. | |
metadata | No | Meta information of a resource. In most cases it doesn't affect the system behavior but may be helpful to analyze project resources. | ||
name | string | Yes | The name of the model | |
flow_name | string | No | Name of the Flow that the Component belongs to. | |
write | One of: BigQueryWriteComponent SnowflakeWriteComponent S3WriteComponent SFTPWriteComponent GcsWriteComponent AbfsWriteComponent MySQLWriteComponent OracleWriteComponent PostgresWriteComponent | Yes |
BigQueryWriteComponent
Component that writes data to a BigQuery table.
Property | Default | Type | Required | Description |
---|---|---|---|---|
dependencies | array[None] | No | List of dependencies that must complete before this Component runs. | |
connection | string | Yes | Name of the Connection to use for writing data. | |
input | Yes | Input component name. | ||
normalize | boolean | No | Boolean flag indicating if the output column names should be normalized to a standard naming convention when writing. | |
preserve_case | boolean | No | Boolean flag indicating if the case of the column names should be preserved when writing. | |
uppercase | boolean | No | Boolean flag indicating if the column names should be transformed to uppercase when writing. | |
strategy | full: mode: drop_and_recreate | Any of: snapshot FullWriteStrategy IncrementalWriteStrategyWithSchemaChange PartitionedWriteStrategyWithSchemaChange | No | Resource for write strategy. |
pre_sql | Any of: string array[string] | No | SQL statements to execute before the main write operation. Can be a single SQL statement string or multiple statements as a list of strings. | |
post_sql | Any of: string array[string] | No | SQL statements to execute after the main write operation. Can be a single SQL statement string or multiple statements as a list of strings. | |
bigquery | Yes |
MySQLWriteComponent
Component that writes data to a MySQL table
Property | Default | Type | Required | Description |
---|---|---|---|---|
dependencies | array[None] | No | List of dependencies that must complete before this Component runs. | |
connection | string | Yes | Name of the Connection to use for writing data. | |
input | Yes | Input component name. | ||
normalize | boolean | No | Boolean flag indicating if the output column names should be normalized to a standard naming convention when writing. | |
preserve_case | boolean | No | Boolean flag indicating if the case of the column names should be preserved when writing. | |
uppercase | boolean | No | Boolean flag indicating if the column names should be transformed to uppercase when writing. | |
strategy | full: |