Partitioned Strategy
Partitioned Ingest Strategy. The user is expected to provide 2 functions, a list function that lists partitions in the source, and a read function that reads a partition from the source.
PartitionedStrategy
PartitionedStrategy is defined beneath the following ancestor nodes in the YAML structure:
- Component
- CustomPythonReadComponent
- CustomPythonReadOptions
- ReadComponent
- AbfsReadComponent
- BigQueryReadComponent
- DatabricksReadComponent
- GcsReadComponent
- GenericFileReadComponent
- LocalFileReadComponent
- MSSQLReadComponent
- MySQLReadComponent
- OracleReadComponent
- PostgresReadComponent
- S3ReadComponent
- SFTPReadComponent
- SnowflakeReadComponent
- TransformComponent
- PySparkTransform
- PythonTransform
- SnowparkTransform
- SqlTransform
Below are the properties for the PartitionedStrategy. Each property links to the specific details section further down in this page.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| partitioned | No | Options for partitioning data. | ||
| on_schema_change | string ("ignore", "fail", "append_new_columns", "sync_all_columns", "smart") | No | Policy to apply when schema changes are detected. Defaults to 'fail' if not provided. |
Property Details
Component
A Component is a fundamental building block of a data Flow. Supported Component types include: Read, Transform, Task, Test, and more.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| component | One of: CustomPythonReadComponent ApplicationComponent AliasedTableComponent ExternalTableComponent | Yes | Component configuration options. |
CustomPythonReadComponent
Component that reads data using user-defined custom Python code.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| data_plane | One of: SnowflakeDataPlane BigQueryDataPlane DuckdbDataPlane DatabricksDataPlane | No | Data Plane-specific configuration options for Components. | |
| skip | boolean | No | Boolean flag indicating whether to skip processing for the Component or not. | |
| retry_strategy | No | Retry strategy configuration options for the Component if any exceptions are encountered. | ||
| description | string | No | Brief description of what the model does. | |
| metadata | No | Meta information of a resource. In most cases it doesn't affect the system behavior but may be helpful to analyze project resources. | ||
| name | string | Yes | The name of the model | |
| flow_name | string | No | Name of the Flow that the Component belongs to. | |
| data_maintenance | No | The data maintenance configuration options for the Component. | ||
| tests | No | Defines tests to run on this Component's data. | ||
| custom_python_read | Yes |
CustomPythonReadOptions
Configuration options for the Custom Python Read Component.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| dependencies | array[None] | No | List of dependencies that must complete before this Component runs. | |
| event_time | string | No | Timestamp column in the Component output used to represent Event time. | |
| strategy | full | Any of: full IncrementalStrategy PartitionedStrategy | No | Ingest strategy. |
| python | Any of: | Yes | Python code to execute for ingesting data. |
ReadComponent
Component that reads data from a system.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| data_plane | One of: SnowflakeDataPlane BigQueryDataPlane DuckdbDataPlane DatabricksDataPlane | No | Data Plane-specific configuration options for Components. | |
| skip | boolean | No | Boolean flag indicating whether to skip processing for the Component or not. | |
| retry_strategy | No | Retry strategy configuration options for the Component if any exceptions are encountered. | ||
| description | string | No | Brief description of what the model does. | |
| metadata | No | Meta information of a resource. In most cases it doesn't affect the system behavior but may be helpful to analyze project resources. | ||
| name | string | Yes | The name of the model | |
| flow_name | string | No | Name of the Flow that the Component belongs to. | |
| data_maintenance | No | The data maintenance configuration options for the Component. | ||
| tests | No | Defines tests to run on this Component's data. | ||
| read | One of: GenericFileReadComponent LocalFileReadComponent SFTPReadComponent S3ReadComponent GcsReadComponent AbfsReadComponent HttpReadComponent MSSQLReadComponent MySQLReadComponent OracleReadComponent PostgresReadComponent SnowflakeReadComponent BigQueryReadComponent DatabricksReadComponent | Yes | Read component that reads data from a system. |
AbfsReadComponent
Component for reading files from an Azure Blob Storage container.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| dependencies | array[None] | No | List of dependencies that must complete before this Component runs. | |
| event_time | string | No | Timestamp column in the Component output used to represent Event time. | |
| connection | string | No | Name of the Connection to use for reading data. | |
| columns | array[None] | No | List specifying the columns to read from the source and transformations to make during read. | |
| normalize | boolean | No | Boolean flag indicating whether the output column names should be normalized to a standard naming convention after reading. | |
| preserve_case | boolean | No | Boolean flag indicating whether the case of the column names should be preserved after reading. | |
| uppercase | boolean | No | Boolean flag indicating whether the column names should be transformed to uppercase after reading. | |
| strategy | PartitionedStrategy | No | Ingest strategy when reading files. | |
| abfs | Yes | Options for reading files from an Azure Blob Storage container. |
BigQueryReadComponent
Component that reads data from a BigQuery table.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| dependencies | array[None] | No | List of dependencies that must complete before this Component runs. | |
| event_time | string | No | Timestamp column in the Component output used to represent Event time. | |
| connection | string | No | Name of the Connection to use for reading data. | |
| columns | array[None] | No | List specifying the columns to read from the source and transformations to make during read. | |
| normalize | boolean | No | Boolean flag indicating whether the output column names should be normalized to a standard naming convention after reading. | |
| preserve_case | boolean | No | Boolean flag indicating whether the case of the column names should be preserved after reading. | |
| uppercase | boolean | No | Boolean flag indicating whether the column names should be transformed to uppercase after reading. | |
| strategy | Any of: full IncrementalReadStrategy PartitionedStrategy | No | Ingest strategy options. | |
| read_options | No | Options for reading from the database or warehouse. | ||
| bigquery | Any of: | Yes |
DatabricksReadComponent
Component that reads data from a Databricks table.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| dependencies | array[None] | No | List of dependencies that must complete before this Component runs. | |
| event_time | string | No | Timestamp column in the Component output used to represent Event time. | |
| connection | string | No | Name of the Connection to use for reading data. | |
| columns | array[None] | No | List specifying the columns to read from the source and transformations to make during read. | |
| normalize | boolean | No | Boolean flag indicating whether the output column names should be normalized to a standard naming convention after reading. | |
| preserve_case | boolean | No | Boolean flag indicating whether the case of the column names should be preserved after reading. | |
| uppercase | boolean | No | Boolean flag indicating whether the column names should be transformed to uppercase after reading. | |
| strategy | Any of: full IncrementalReadStrategy PartitionedStrategy | No | Ingest strategy options. | |
| read_options | No | Options for reading from the database or warehouse. | ||
| databricks | Any of: | Yes |
GcsReadComponent
Component for reading files from a Google Cloud Storage bucket.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| dependencies | array[None] | No | List of dependencies that must complete before this Component runs. | |
| event_time | string | No | Timestamp column in the Component output used to represent Event time. | |
| connection | string | No | Name of the Connection to use for reading data. | |
| columns | array[None] | No | List specifying the columns to read from the source and transformations to make during read. | |
| normalize | boolean | No | Boolean flag indicating whether the output column names should be normalized to a standard naming convention after reading. | |
| preserve_case | boolean | No | Boolean flag indicating whether the case of the column names should be preserved after reading. | |
| uppercase | boolean | No | Boolean flag indicating whether the column names should be transformed to uppercase after reading. | |
| strategy | PartitionedStrategy | No | Ingest strategy when reading files. | |
| gcs | Yes | Options for reading files from a Google Cloud Storage bucket. |
GenericFileReadComponent
Component for reading files from a filesystem.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| dependencies | array[None] | No | List of dependencies that must complete before this Component runs. | |
| event_time | string | No | Timestamp column in the Component output used to represent Event time. | |
| connection | string | No | Name of the Connection to use for reading data. | |
| columns | array[None] | No | List specifying the columns to read from the source and transformations to make during read. | |
| normalize | boolean | No | Boolean flag indicating whether the output column names should be normalized to a standard naming convention after reading. | |
| preserve_case | boolean | No | Boolean flag indicating whether the case of the column names should be preserved after reading. | |
| uppercase | boolean | No | Boolean flag indicating whether the column names should be transformed to uppercase after reading. | |
| strategy | PartitionedStrategy | No | Ingest strategy when reading files. | |
| generic_file | Yes | Options for reading files from a filesystem. |
LocalFileReadComponent
Component for reading files from the local filesystem.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| dependencies | array[None] | No | List of dependencies that must complete before this Component runs. | |
| event_time | string | No | Timestamp column in the Component output used to represent Event time. | |
| connection | string | No | Name of the Connection to use for reading data. | |
| columns | array[None] | No | List specifying the columns to read from the source and transformations to make during read. | |
| normalize | boolean | No | Boolean flag indicating whether the output column names should be normalized to a standard naming convention after reading. | |
| preserve_case | boolean | No | Boolean flag indicating whether the case of the column names should be preserved after reading. | |
| uppercase | boolean | No | Boolean flag indicating whether the column names should be transformed to uppercase after reading. | |
| strategy | PartitionedStrategy | No | Ingest strategy when reading files. | |
| local_file | Yes | Options for reading files from the local filesystem. |
MSSQLReadComponent
A component that reads data from a MSSQL Server database, options include ingesting a single table / query, or multiple tables / queries.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| dependencies | array[None] | No | List of dependencies that must complete before this Component runs. | |
| event_time | string | No | Timestamp column in the Component output used to represent Event time. | |
| connection | string | No | Name of the Connection to use for reading data. | |
| columns | array[None] | No | List specifying the columns to read from the source and transformations to make during read. | |
| normalize | boolean | No | Boolean flag indicating whether the output column names should be normalized to a standard naming convention after reading. | |
| preserve_case | boolean | No | Boolean flag indicating whether the case of the column names should be preserved after reading. | |
| uppercase | boolean | No | Boolean flag indicating whether the column names should be transformed to uppercase after reading. | |
| strategy | Any of: full IncrementalReadStrategy PartitionedStrategy | No | Ingest strategy options. | |
| read_options | No | Options for reading from the database or warehouse. | ||
| mssql | Any of: | Yes |
MySQLReadComponent
Component that reads data from a MySQL database, options include ingesting a single table / query, or multiple tables / queries.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| dependencies | array[None] | No | List of dependencies that must complete before this Component runs. | |
| event_time | string | No | Timestamp column in the Component output used to represent Event time. | |
| connection | string | No | Name of the Connection to use for reading data. | |
| columns | array[None] | No | List specifying the columns to read from the source and transformations to make during read. | |
| normalize | boolean | No | Boolean flag indicating whether the output column names should be normalized to a standard naming convention after reading. | |
| preserve_case | boolean | No | Boolean flag indicating whether the case of the column names should be preserved after reading. | |
| uppercase | boolean | No | Boolean flag indicating whether the column names should be transformed to uppercase after reading. | |
| strategy | Any of: full IncrementalReadStrategy PartitionedStrategy | No | Ingest strategy options. | |
| read_options | No | Options for reading from the database or warehouse. | ||
| use_duckdb | boolean | No | Use DuckDB extension for reading data, which is faster but may have memory limitations with very large tables. Defaults to False | |
| mysql | Any of: | Yes | MySQL read options. | |
| use_checksum | boolean | No | Use table checksum to detect data changes. If false or unset, will do full re-read for every run for full-sync. |
OracleReadComponent
Component that reads data from an Oracle table.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| dependencies | array[None] | No | List of dependencies that must complete before this Component runs. | |
| event_time | string | No | Timestamp column in the Component output used to represent Event time. | |
| connection | string | No | Name of the Connection to use for reading data. | |
| columns | array[None] | No | List specifying the columns to read from the source and transformations to make during read. | |
| normalize | boolean | No | Boolean flag indicating whether the output column names should be normalized to a standard naming convention after reading. | |
| preserve_case | boolean | No | Boolean flag indicating whether the case of the column names should be preserved after reading. | |
| uppercase | boolean | No | Boolean flag indicating whether the column names should be transformed to uppercase after reading. | |
| strategy | Any of: full IncrementalReadStrategy PartitionedStrategy | No | Ingest strategy options. | |
| read_options | No | Options for reading from the database or warehouse. | ||
| oracle | Oracle | Any of: | No | Oracle read options. |
PostgresReadComponent
Component that reads data from a PostgreSQL table.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| dependencies | array[None] | No | List of dependencies that must complete before this Component runs. | |
| event_time | string | No | Timestamp column in the Component output used to represent Event time. | |
| connection | string | No | Name of the Connection to use for reading data. | |
| columns | array[None] | No | List specifying the columns to read from the source and transformations to make during read. | |
| normalize | boolean | No | Boolean flag indicating whether the output column names should be normalized to a standard naming convention after reading. | |
| preserve_case | boolean | No | Boolean flag indicating whether the case of the column names should be preserved after reading. | |
| uppercase | boolean | No | Boolean flag indicating whether the column names should be transformed to uppercase after reading. | |
| strategy | Any of: full IncrementalReadStrategy PartitionedStrategy | No | Ingest strategy options. | |
| read_options | No | Options for reading from the database or warehouse. | ||
| use_duckdb | boolean | No | Use DuckDB extension for reading data, which is faster but may have memory limitations with very large tables. Defaults to False | |
| postgres | Postgres | Any of: | No | PostgreSQL read options. |
| arrays_as_json | boolean | No | Ingest PostgreSQL arrays as JSON. |
S3ReadComponent
Component for reading files from an S3 bucket.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| dependencies | array[None] | No | List of dependencies that must complete before this Component runs. | |
| event_time | string | No | Timestamp column in the Component output used to represent Event time. | |
| connection | string | No | Name of the Connection to use for reading data. | |
| columns | array[None] | No | List specifying the columns to read from the source and transformations to make during read. | |
| normalize | boolean | No | Boolean flag indicating whether the output column names should be normalized to a standard naming convention after reading. | |
| preserve_case | boolean | No | Boolean flag indicating whether the case of the column names should be preserved after reading. | |
| uppercase | boolean | No | Boolean flag indicating whether the column names should be transformed to uppercase after reading. | |
| strategy | PartitionedStrategy | No | Ingest strategy when reading files. | |
| s3 | Yes | Options for reading files from an S3 bucket. |
SFTPReadComponent
Component for reading files from an SFTP server.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| dependencies | array[None] | No | List of dependencies that must complete before this Component runs. | |
| event_time | string | No | Timestamp column in the Component output used to represent Event time. | |
| connection | string | No | Name of the Connection to use for reading data. | |
| columns | array[None] | No | List specifying the columns to read from the source and transformations to make during read. | |
| normalize | boolean | No | Boolean flag indicating whether the output column names should be normalized to a standard naming convention after reading. | |
| preserve_case | boolean | No | Boolean flag indicating whether the case of the column names should be preserved after reading. | |
| uppercase | boolean | No | Boolean flag indicating whether the column names should be transformed to uppercase after reading. | |
| strategy | PartitionedStrategy | No | Ingest strategy when reading files. | |
| sftp | Yes | Options for reading files from an SFTP server. |
SnowflakeReadComponent
Component that reads data from a Snowflake table.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| dependencies | array[None] | No | List of dependencies that must complete before this Component runs. | |
| event_time | string | No | Timestamp column in the Component output used to represent Event time. | |
| connection | string | No | Name of the Connection to use for reading data. | |
| columns | array[None] | No | List specifying the columns to read from the source and transformations to make during read. | |
| normalize | boolean | No | Boolean flag indicating whether the output column names should be normalized to a standard naming convention after reading. | |
| preserve_case | boolean | No | Boolean flag indicating whether the case of the column names should be preserved after reading. | |
| uppercase | boolean | No | Boolean flag indicating whether the column names should be transformed to uppercase after reading. | |
| strategy | Any of: full IncrementalReadStrategy PartitionedStrategy | No | Ingest strategy options. | |
| read_options | No | Options for reading from the database or warehouse. | ||
| snowflake | Any of: | Yes |
TransformComponent
Component that executes SQL or Python code to transform data.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| data_plane | One of: SnowflakeDataPlane BigQueryDataPlane DuckdbDataPlane DatabricksDataPlane | No | Data Plane-specific configuration options for Components. | |
| skip | boolean | No | Boolean flag indicating whether to skip processing for the Component or not. | |
| retry_strategy | No | Retry strategy configuration options for the Component if any exceptions are encountered. | ||
| description | string | No | Brief description of what the model does. | |
| metadata | No | Meta information of a resource. In most cases it doesn't affect the system behavior but may be helpful to analyze project resources. | ||
| name | string | Yes | The name of the model | |
| flow_name | string | No | Name of the Flow that the Component belongs to. | |
| data_maintenance | No | The data maintenance configuration options for the Component. | ||
| tests | No | Defines tests to run on this Component's data. | ||
| transform | One of: SqlTransform PythonTransform SnowparkTransform PySparkTransform | Yes | Transform that executes SQL or Python code for data transformation. |
PySparkTransform
PySpark Transforms execute PySpark code to transform data.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| dependencies | array[None] | No | List of dependencies that must complete before this Component runs. | |
| event_time | string | No | Timestamp column in the Component output used to represent Event time. | |
| microbatch | boolean | No | Whether to process data in microbatches. | |
| batch_size | string | No | Size/time granularity of the microbatch to process. | |
| lookback | 1 | integer | No | Number of time intervals prior to the current interval (and inclusive of current interval) to process in time-series processing mode. |
| begin | string | No | 'Beginning of time' for this Component. If provided, time intervals before this time will be skipped in a time-series run. | |
| inputs | array[None] | No | List of input components to use as Transform data sources. | |
| strategy | Any of: PartitionedStrategy IncrementalStrategy string ("view", "table") | No | Transform strategy: either incremental, partitioned, or view/table. | |
| pyspark | No | PySpark function to execute for data transformation. |
PythonTransform
Python Transform.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| dependencies | array[None] | No | List of dependencies that must complete before this Component runs. | |
| event_time | string | No | Timestamp column in the Component output used to represent Event time. | |
| microbatch | boolean | No | Whether to process data in microbatches. | |
| batch_size | string | No | Size/time granularity of the microbatch to process. | |
| lookback | 1 | integer | No | Number of time intervals prior to the current interval (and inclusive of current interval) to process in time-series processing mode. |
| begin | string | No | 'Beginning of time' for this Component. If provided, time intervals before this time will be skipped in a time-series run. | |
| inputs | array[None] | No | List of input components to use as Transform data sources. | |
| strategy | Any of: PartitionedStrategy IncrementalStrategy string ("view", "table") | No | Transform strategy: either incremental, partitioned, or view/table. | |
| python | No | Python function to execute for data transformation. |
SnowparkTransform
Snowpark Transforms execute Python code to transform data within the Snowflake platform.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| dependencies | array[None] | No | List of dependencies that must complete before this Component runs. | |
| event_time | string | No | Timestamp column in the Component output used to represent Event time. | |
| microbatch | boolean | No | Whether to process data in microbatches. | |
| batch_size | string | No | Size/time granularity of the microbatch to process. | |
| lookback | 1 | integer | No | Number of time intervals prior to the current interval (and inclusive of current interval) to process in time-series processing mode. |
| begin | string | No | 'Beginning of time' for this Component. If provided, time intervals before this time will be skipped in a time-series run. | |
| inputs | array[None] | No | List of input components to use as Transform data sources. | |
| strategy | Any of: PartitionedStrategy IncrementalStrategy string ("view", "table") | No | Transform strategy: either incremental, partitioned, or view/table. | |
| snowpark | No | Snowpark function to execute for data transformation. |
SqlTransform
SQL Transform.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| dependencies | array[None] | No | List of dependencies that must complete before this Component runs. | |
| event_time | string | No | Timestamp column in the Component output used to represent Event time. | |
| microbatch | boolean | No | Whether to process data in microbatches. | |
| batch_size | string | No | Size/time granularity of the microbatch to process. | |
| lookback | 1 | integer | No | Number of time intervals prior to the current interval (and inclusive of current interval) to process in time-series processing mode. |
| begin | string | No | 'Beginning of time' for this Component. If provided, time intervals before this time will be skipped in a time-series run. | |
| inputs | array[None] | No | List of input components to use as Transform data sources. | |
| strategy | Any of: PartitionedStrategy IncrementalStrategy string ("view", "table") | No | Transform strategy: either incremental, partitioned, or view/table. | |
| sql | string | No | SQL query to execute for data transformation. | |
| dialect | spark | No | SQL dialect to use for the query. Set to 'None' for the Data Plane's default dialect, or 'spark' for Spark SQL. |
PartitionedOptions
Options related to partition optimization - in particular, the policy that determines which partitions to ingest.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| enable_substitution_by_partition_name | boolean | Yes | Enable substitution by partition name. | |
| output_type | table | string ("table", "view") | No | Output type for partitioned data. Must be either 'table' or 'view'. This strategy only applies to Transforms. |