Partitioned Strategy
Partitioned Ingest Strategy. The user is expected to provide 2 functions, a list function that lists partitions in the source, and a read function that reads a partition from the source.
PartitionedStrategy
PartitionedStrategy
is defined beneath the following ancestor nodes in the YAML structure:
- Component
- CustomPythonReadComponent
- CustomPythonReadOptions
- ReadComponent
- AbfsReadComponent
- BigQueryReadComponent
- DatabricksReadComponent
- GcsReadComponent
- GenericFileReadComponent
- LocalFileReadComponent
- MSSQLReadComponent
- MySQLReadComponent
- OracleReadComponent
- PostgresReadComponent
- S3ReadComponent
- SFTPReadComponent
- SnowflakeReadComponent
- TransformComponent
- PySparkTransform
- PythonTransform
- SnowparkTransform
- SqlTransform
Below are the properties for the PartitionedStrategy
. Each property links to the specific details section further down in this page.
Property | Default | Type | Required | Description |
---|---|---|---|---|
partitioned | No | Options for partitioning data. | ||
on_schema_change | string ("ignore", "fail", "append_new_columns", "sync_all_columns") | No | Policy to apply when schema changes are detected. Defaults to 'fail' if not provided. |
Property Details
Component
A component is a fundamental building block of a data flow. Types of components that are supported include: read, transform, task, test, and more.
Property | Default | Type | Required | Description |
---|---|---|---|---|
component | One of: CustomPythonReadComponent ApplicationComponent AliasedTableComponent ExternalTableComponent | Yes | Configuration options for the component. |
CustomPythonReadComponent
A component that reads data using user-defined custom Python code.
Property | Default | Type | Required | Description |
---|---|---|---|---|
data_plane | One of: SnowflakeDataPlane BigQueryDataPlane DatabricksDataPlane | No | Data Plane-specific configuration options for a component. | |
skip | boolean | No | A boolean flag indicating whether to skip processing for the component or not. | |
retry_strategy | No | The retry strategy configuration options for the component if any exceptions are encountered. | ||
description | string | No | A brief description of what the model does. | |
metadata | No | Meta information of a resource. In most cases it doesn't affect the system behavior but may be helpful to analyze project resources. | ||
name | string | Yes | The name of the model | |
flow_name | string | No | The name of the flow that the component belongs to. | |
data_maintenance | No | The data maintenance configuration options for the component. | ||
tests | No | Defines tests to run on the data of this component. | ||
custom_python_read | Yes |
CustomPythonReadOptions
Configuration options for the Custom Python Read component.
Property | Default | Type | Required | Description |
---|---|---|---|---|
dependencies | array[None] | No | List of dependencies that must complete before this component runs. | |
event_time | string | No | Timestamp column in the component output used to represent event time. | |
strategy | full | Any of: full IncrementalStrategy PartitionedStrategy | No | Ingest strategy. |
python | Any of: | Yes | Python code to execute for ingesting data. |
ReadComponent
A component that reads data from a data system.
Property | Default | Type | Required | Description |
---|---|---|---|---|
data_plane | One of: SnowflakeDataPlane BigQueryDataPlane DatabricksDataPlane | No | Data Plane-specific configuration options for a component. | |
skip | boolean | No | A boolean flag indicating whether to skip processing for the component or not. | |
retry_strategy | No | The retry strategy configuration options for the component if any exceptions are encountered. | ||
description | string | No | A brief description of what the model does. | |
metadata | No | Meta information of a resource. In most cases it doesn't affect the system behavior but may be helpful to analyze project resources. | ||
name | string | Yes | The name of the model | |
flow_name | string | No | The name of the flow that the component belongs to. | |
data_maintenance | No | The data maintenance configuration options for the component. | ||
tests | No | Defines tests to run on the data of this component. | ||
read | One of: GenericFileReadComponent LocalFileReadComponent SFTPReadComponent S3ReadComponent GcsReadComponent AbfsReadComponent HttpReadComponent MSSQLReadComponent MySQLReadComponent OracleReadComponent PostgresReadComponent SnowflakeReadComponent BigQueryReadComponent DatabricksReadComponent | Yes | The read component that reads data from a data system. |
AbfsReadComponent
Component for reading files from an Azure Blob Storage container.
Property | Default | Type | Required | Description |
---|---|---|---|---|
dependencies | array[None] | No | List of dependencies that must complete before this component runs. | |
event_time | string | No | Timestamp column in the component output used to represent event time. | |
connection | string | No | The name of the connection to use for reading data. | |
columns | array[None] | No | A list specifying the columns to read from the source and transformations to make during read. | |
normalize | boolean | No | A boolean flag indicating if the output column names should be normalized to a standard naming convention after reading. | |
preserve_case | boolean | No | A boolean flag indicating if the case of the column names should be preserved after reading. | |
uppercase | boolean | No | A boolean flag indicating if the column names should be transformed to uppercase after reading. | |
strategy | PartitionedStrategy | No | Ingest strategy when reading files. | |
abfs | Yes | Options for reading files from an Azure Blob Storage container. |
BigQueryReadComponent
A component that reads data from a BigQuery table.
Property | Default | Type | Required | Description |
---|---|---|---|---|
dependencies | array[None] | No | List of dependencies that must complete before this component runs. | |
event_time | string | No | Timestamp column in the component output used to represent event time. | |
connection | string | No | The name of the connection to use for reading data. | |
columns | array[None] | No | A list specifying the columns to read from the source and transformations to make during read. | |
normalize | boolean | No | A boolean flag indicating if the output column names should be normalized to a standard naming convention after reading. | |
preserve_case | boolean | No | A boolean flag indicating if the case of the column names should be preserved after reading. | |
uppercase | boolean | No | A boolean flag indicating if the column names should be transformed to uppercase after reading. | |
strategy | Any of: full IncrementalReadStrategy PartitionedStrategy | No | Ingest strategy options. | |
read_options | No | Options for reading from the database or warehouse. | ||
bigquery | Any of: | Yes |
DatabricksReadComponent
A component that reads data from a Databricks table.
Property | Default | Type | Required | Description |
---|---|---|---|---|
dependencies | array[None] | No | List of dependencies that must complete before this component runs. | |
event_time | string | No | Timestamp column in the component output used to represent event time. | |
connection | string | No | The name of the connection to use for reading data. | |
columns | array[None] | No | A list specifying the columns to read from the source and transformations to make during read. | |
normalize | boolean | No | A boolean flag indicating if the output column names should be normalized to a standard naming convention after reading. | |
preserve_case | boolean | No | A boolean flag indicating if the case of the column names should be preserved after reading. | |
uppercase | boolean | No | A boolean flag indicating if the column names should be transformed to uppercase after reading. | |
strategy | Any of: full IncrementalReadStrategy PartitionedStrategy | No | Ingest strategy options. | |
read_options | No | Options for reading from the database or warehouse. | ||
databricks | Any of: | Yes |
GcsReadComponent
Component for reading files from a Google Cloud Storage bucket.
Property | Default | Type | Required | Description |
---|---|---|---|---|
dependencies | array[None] | No | List of dependencies that must complete before this component runs. | |
event_time | string | No | Timestamp column in the component output used to represent event time. | |
connection | string | No | The name of the connection to use for reading data. | |
columns | array[None] | No | A list specifying the columns to read from the source and transformations to make during read. | |
normalize | boolean | No | A boolean flag indicating if the output column names should be normalized to a standard naming convention after reading. | |
preserve_case | boolean | No | A boolean flag indicating if the case of the column names should be preserved after reading. | |
uppercase | boolean | No | A boolean flag indicating if the column names should be transformed to uppercase after reading. | |
strategy | PartitionedStrategy | No | Ingest strategy when reading files. | |
gcs | Yes | Options for reading files from a Google Cloud Storage bucket. |
GenericFileReadComponent
Component for reading files from a filesystem.
Property | Default | Type | Required | Description |
---|---|---|---|---|
dependencies | array[None] | No | List of dependencies that must complete before this component runs. | |
event_time | string | No | Timestamp column in the component output used to represent event time. | |
connection | string | No | The name of the connection to use for reading data. | |
columns | array[None] | No | A list specifying the columns to read from the source and transformations to make during read. | |
normalize | boolean | No | A boolean flag indicating if the output column names should be normalized to a standard naming convention after reading. | |
preserve_case | boolean | No | A boolean flag indicating if the case of the column names should be preserved after reading. | |
uppercase | boolean | No | A boolean flag indicating if the column names should be transformed to uppercase after reading. | |
strategy | PartitionedStrategy | No | Ingest strategy when reading files. | |
generic_file | Yes | Options for reading files from a filesystem. |
LocalFileReadComponent
Component for reading files from the local filesystem.
Property | Default | Type | Required | Description |
---|---|---|---|---|
dependencies | array[None] | No | List of dependencies that must complete before this component runs. | |
event_time | string | No | Timestamp column in the component output used to represent event time. | |
connection | string | No | The name of the connection to use for reading data. | |
columns | array[None] | No | A list specifying the columns to read from the source and transformations to make during read. | |
normalize | boolean | No | A boolean flag indicating if the output column names should be normalized to a standard naming convention after reading. | |
preserve_case | boolean | No | A boolean flag indicating if the case of the column names should be preserved after reading. | |
uppercase | boolean | No | A boolean flag indicating if the column names should be transformed to uppercase after reading. | |
strategy | PartitionedStrategy | No | Ingest strategy when reading files. | |
local_file | Yes | Options for reading files from the local filesystem. |
MSSQLReadComponent
A component that reads data from a MSSQL Server database, options include ingesting a single table / query, or multiple tables / queries.
Property | Default | Type | Required | Description |
---|---|---|---|---|
dependencies | array[None] | No | List of dependencies that must complete before this component runs. | |
event_time | string | No | Timestamp column in the component output used to represent event time. | |
connection | string | No | The name of the connection to use for reading data. | |
columns | array[None] | No | A list specifying the columns to read from the source and transformations to make during read. | |
normalize | boolean | No | A boolean flag indicating if the output column names should be normalized to a standard naming convention after reading. | |
preserve_case | boolean | No | A boolean flag indicating if the case of the column names should be preserved after reading. | |
uppercase | boolean | No | A boolean flag indicating if the column names should be transformed to uppercase after reading. | |
strategy | Any of: full IncrementalReadStrategy PartitionedStrategy | No | Ingest strategy options. | |
read_options | No | Options for reading from the database or warehouse. | ||
mssql | Any of: | Yes |
MySQLReadComponent
A component that reads data from a MySQL database, options include ingesting a single table / query, or multiple tables / queries.
Property | Default | Type | Required | Description |
---|---|---|---|---|
dependencies | array[None] | No | List of dependencies that must complete before this component runs. | |
event_time | string | No | Timestamp column in the component output used to represent event time. | |
connection | string | No | The name of the connection to use for reading data. | |
columns | array[None] | No | A list specifying the columns to read from the source and transformations to make during read. | |
normalize | boolean | No | A boolean flag indicating if the output column names should be normalized to a standard naming convention after reading. | |
preserve_case | boolean | No | A boolean flag indicating if the case of the column names should be preserved after reading. | |
uppercase | boolean | No | A boolean flag indicating if the column names should be transformed to uppercase after reading. | |
strategy | Any of: full IncrementalReadStrategy PartitionedStrategy | No | Ingest strategy options. | |
read_options | No | Options for reading from the database or warehouse. | ||
use_duckdb | boolean | No | Use DuckDB extension for reading data, which is faster but may have memory limitations with very large tables. Defaults to False | |
mysql | Any of: | Yes | MySQL read options. | |
use_checksum | boolean | No | Use table checksum to detect data changes. if false or unset, will do full re-read for every run for full-sync. |
OracleReadComponent
A component that reads data from an Oracle table.
Property | Default | Type | Required | Description |
---|---|---|---|---|
dependencies | array[None] | No | List of dependencies that must complete before this component runs. | |
event_time | string | No | Timestamp column in the component output used to represent event time. | |
connection | string | No | The name of the connection to use for reading data. | |
columns | array[None] | No | A list specifying the columns to read from the source and transformations to make during read. | |
normalize | boolean | No | A boolean flag indicating if the output column names should be normalized to a standard naming convention after reading. | |
preserve_case | boolean | No | A boolean flag indicating if the case of the column names should be preserved after reading. | |
uppercase | boolean | No | A boolean flag indicating if the column names should be transformed to uppercase after reading. | |
strategy | Any of: full IncrementalReadStrategy PartitionedStrategy | No | Ingest strategy options. | |
read_options | No | Options for reading from the database or warehouse. | ||
oracle | Oracle | Any of: | No | Oracle read options. |
PostgresReadComponent
A component that reads data from a Postgresql table.
Property | Default | Type | Required | Description |
---|---|---|---|---|
dependencies | array[None] | No | List of dependencies that must complete before this component runs. | |
event_time | string | No | Timestamp column in the component output used to represent event time. | |
connection | string | No | The name of the connection to use for reading data. | |
columns | array[None] | No | A list specifying the columns to read from the source and transformations to make during read. | |
normalize | boolean | No | A boolean flag indicating if the output column names should be normalized to a standard naming convention after reading. | |
preserve_case | boolean | No | A boolean flag indicating if the case of the column names should be preserved after reading. | |
uppercase | boolean | No | A boolean flag indicating if the column names should be transformed to uppercase after reading. | |
strategy | Any of: full IncrementalReadStrategy PartitionedStrategy | No | Ingest strategy options. | |
read_options | No | Options for reading from the database or warehouse. | ||
use_duckdb | boolean | No | Use DuckDB extension for reading data, which is faster but may have memory limitations with very large tables. Defaults to False | |
postgres | Postgres | Any of: | No | Postgres read options. |
S3ReadComponent
Component for reading files from an S3 bucket.
Property | Default | Type | Required | Description |
---|---|---|---|---|
dependencies | array[None] | No | List of dependencies that must complete before this component runs. | |
event_time | string | No | Timestamp column in the component output used to represent event time. | |
connection | string | No | The name of the connection to use for reading data. | |
columns | array[None] | No | A list specifying the columns to read from the source and transformations to make during read. | |
normalize | boolean | No | A boolean flag indicating if the output column names should be normalized to a standard naming convention after reading. | |
preserve_case | boolean | No | A boolean flag indicating if the case of the column names should be preserved after reading. | |
uppercase | boolean | No | A boolean flag indicating if the column names should be transformed to uppercase after reading. | |
strategy | PartitionedStrategy | No | Ingest strategy when reading files. | |
s3 | Yes | Options for reading files from an S3 bucket. |
SFTPReadComponent
Component for reading files from an SFTP server.
Property | Default | Type | Required | Description |
---|---|---|---|---|
dependencies | array[None] | No | List of dependencies that must complete before this component runs. | |
event_time | string | No | Timestamp column in the component output used to represent event time. | |
connection | string | No | The name of the connection to use for reading data. | |
columns | array[None] | No | A list specifying the columns to read from the source and transformations to make during read. | |
normalize | boolean | No | A boolean flag indicating if the output column names should be normalized to a standard naming convention after reading. | |
preserve_case | boolean | No | A boolean flag indicating if the case of the column names should be preserved after reading. | |
uppercase | boolean | No | A boolean flag indicating if the column names should be transformed to uppercase after reading. | |
strategy | PartitionedStrategy | No | Ingest strategy when reading files. | |
sftp | Yes | Options for reading files from an SFTP server. |
SnowflakeReadComponent
A component that reads data from a Snowflake table.
Property | Default | Type | Required | Description |
---|---|---|---|---|
dependencies | array[None] | No | List of dependencies that must complete before this component runs. | |
event_time | string | No | Timestamp column in the component output used to represent event time. | |
connection | string | No | The name of the connection to use for reading data. | |
columns | array[None] | No | A list specifying the columns to read from the source and transformations to make during read. | |
normalize | boolean | No | A boolean flag indicating if the output column names should be normalized to a standard naming convention after reading. | |
preserve_case | boolean | No | A boolean flag indicating if the case of the column names should be preserved after reading. | |
uppercase | boolean | No | A boolean flag indicating if the column names should be transformed to uppercase after reading. | |
strategy | Any of: full IncrementalReadStrategy PartitionedStrategy | No | Ingest strategy options. | |
read_options | No | Options for reading from the database or warehouse. | ||
snowflake | Any of: | Yes |
TransformComponent
A component that executes SQL or Python code to transform data.
Property | Default | Type | Required | Description |
---|---|---|---|---|
data_plane | One of: SnowflakeDataPlane BigQueryDataPlane DatabricksDataPlane | No | Data Plane-specific configuration options for a component. | |
skip | boolean | No | A boolean flag indicating whether to skip processing for the component or not. | |
retry_strategy | No | The retry strategy configuration options for the component if any exceptions are encountered. | ||
description | string | No | A brief description of what the model does. | |
metadata | No | Meta information of a resource. In most cases it doesn't affect the system behavior but may be helpful to analyze project resources. | ||
name | string | Yes | The name of the model | |
flow_name | string | No | The name of the flow that the component belongs to. | |
data_maintenance | No | The data maintenance configuration options for the component. | ||
tests | No | Defines tests to run on the data of this component. | ||
transform | One of: SqlTransform PythonTransform SnowparkTransform PySparkTransform | Yes | The transform component that executes SQL or Python code to transform data. |
PySparkTransform
PySpark transforms execute PySpark code to transform data.
Property | Default | Type | Required | Description |
---|---|---|---|---|
dependencies | array[None] | No | List of dependencies that must complete before this component runs. | |
event_time | string | No | Timestamp column in the component output used to represent event time. | |
microbatch | boolean | No | Whether to process data in microbatches. | |
batch_size | string | No | The size/time granularity of the microbatch to process. | |
lookback | 1 | integer | No | The number of time intervals prior to the current interval (and inclusive of current interval) to process in time-series processing mode. |
begin | string | No | The 'beginning of time' for this component. If provided, time intervals before this time will be skipped in a time-series run. | |
inputs | array[None] | No | List of input components to use as data sources for the transform. | |
strategy | Any of: PartitionedStrategy IncrementalStrategy string ("view", "table") | No | Transform strategy - incremental, partitioned, or view/table. | |
pyspark | No | PySpark transform function to execute for transforming the data. |
PythonTransform
Python transforms execute Python code to transform data.
Property | Default | Type | Required | Description |
---|---|---|---|---|
dependencies | array[None] | No | List of dependencies that must complete before this component runs. | |
event_time | string | No | Timestamp column in the component output used to represent event time. | |
microbatch | boolean | No | Whether to process data in microbatches. | |
batch_size | string | No | The size/time granularity of the microbatch to process. | |
lookback | 1 | integer | No | The number of time intervals prior to the current interval (and inclusive of current interval) to process in time-series processing mode. |
begin | string | No | The 'beginning of time' for this component. If provided, time intervals before this time will be skipped in a time-series run. | |
inputs | array[None] | No | List of input components to use as data sources for the transform. | |
strategy | Any of: PartitionedStrategy IncrementalStrategy string ("view", "table") | No | Transform strategy - incremental, partitioned, or view/table. | |
python | No | Python transform function to execute for transforming the data. |
SnowparkTransform
Snowpark transforms execute Python code to transform data within the Snowflake platform.
Property | Default | Type | Required | Description |
---|---|---|---|---|
dependencies | array[None] | No | List of dependencies that must complete before this component runs. | |
event_time | string | No | Timestamp column in the component output used to represent event time. | |
microbatch | boolean | No | Whether to process data in microbatches. | |
batch_size | string | No | The size/time granularity of the microbatch to process. | |
lookback | 1 | integer | No | The number of time intervals prior to the current interval (and inclusive of current interval) to process in time-series processing mode. |
begin | string | No | The 'beginning of time' for this component. If provided, time intervals before this time will be skipped in a time-series run. | |
inputs | array[None] | No | List of input components to use as data sources for the transform. | |
strategy | Any of: PartitionedStrategy IncrementalStrategy string ("view", "table") | No | Transform strategy - incremental, partitioned, or view/table. | |
snowpark | No | Snowpark transform function to execute for transforming the data. |
SqlTransform
SQL transforms execute SQL queries to transform data.
Property | Default | Type | Required | Description |
---|---|---|---|---|
dependencies | array[None] | No | List of dependencies that must complete before this component runs. | |
event_time | string | No | Timestamp column in the component output used to represent event time. | |
microbatch | boolean | No | Whether to process data in microbatches. | |
batch_size | string | No | The size/time granularity of the microbatch to process. | |
lookback | 1 | integer | No | The number of time intervals prior to the current interval (and inclusive of current interval) to process in time-series processing mode. |
begin | string | No | The 'beginning of time' for this component. If provided, time intervals before this time will be skipped in a time-series run. | |
inputs | array[None] | No | List of input components to use as data sources for the transform. | |
strategy | Any of: PartitionedStrategy IncrementalStrategy string ("view", "table") | No | Transform strategy - incremental, partitioned, or view/table. | |
sql | string | No | SQL query to execute for transforming the data. | |
dialect | spark | No | SQL dialect to use for the query. Set to 'None' for the data plane's default dialect, or 'spark' for Spark SQL. |
PartitionedOptions
Options related to partition optimization - in particular, the policy that determines which partitions to ingest.
Property | Default | Type | Required | Description |
---|---|---|---|---|
enable_substitution_by_partition_name | boolean | Yes | Enable substitution by partition name. | |
output_type | table | string ("table", "view") | No | Output type for partitioned data. Must be either 'table' or 'view'. This strategy applies only to Transforms. |