Project
A Project is a group of related Connections, Flows/Components, Profiles, Vaults, Automations and other code/configuration artifacts. Project files define the mapping of filesystem paths to different kinds of artifacts that the platform can access when running Flows for the Project.
Examples
- simple_project_name_description.yaml
- custom_project_with_folders.yaml
- project_with_version_and_folders.yaml
- project_with_defaults.yaml
project:
name: SimpleProject
description: A simple project with only a name and description.
project:
name: CustomProject
description: A project with custom source and test folders.
sources:
- custom_src/
tests:
- custom_tests/
project:
name: MyProject
description: A project with specific version and multiple connection and flow folders.
version: "1.0.0"
connections:
- connections/folder1/
- connections/folder2/
flows:
- flows/folder1/
- flows/folder2/
Project
Below are the properties for the Project. Each property links to the specific details section further down in this page.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| project | Yes | Project options. |
Property Details
ProjectOptions
Options that can be specified for a Project.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| pip_packages | array[string] | No | Python PIP packages to install | |
| parameters | object with property values of type None | No | Dictionary of parameters to use for resource. | |
| defaults | array[None] | No | List of default configs with filters that can be applied to a resource config. | |
| description | string | No | Brief description of what the model does. | |
| metadata | No | Meta information of a resource. In most cases it doesn't affect the system behavior but may be helpful to analyze project resources. | ||
| name | string | Yes | The name of the model | |
| version | string | No | Project version. | |
| connections | ['connections/'] | array[string] | No | List of Connection definition folders used in the Project. |
| flows | ['flows/'] | array[string] | No | List of Flow definition folders used in the Project. |
| profiles | ['profiles/'] | array[string] | No | List of Profile definition folders used in the Project. |
| sources | ['src/'] | array[string] | No | List of source definition folders used in the Project. |
| tests | ['tests/'] | array[string] | No | List of test definition folders used in the Project. |
| vaults | ['vaults/'] | array[string] | No | List of Vault definition folders used in the Project. |
| actions | ['actions/'] | array[string] | No | List of Action definition folders used in the Project. |
| automations | ['automations/'] | array[string] | No | List of Automation definition folders used in the Project. |
| sensors | ['sensors/'] | array[string] | No | List of Sensor definition folders used in the Project. |
| ssh_tunnels | ['ssh_tunnels/'] | array[string] | No | List of SSH tunnel definition folders used in the Project. |
| applications | ['applications/'] | array[string] | No | List of Application definition folders used in the Project. |
ConfigFilter
Filter used to target configuration settings to a specific Flow and/or Component.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| kind | string ("Flow", "Component") | Yes | Resource kind to target with this configuration. | |
| name | Any of: string array[string] array[None] | Yes | Name of the resource to target with this configuration. | |
| flow_name | Any of: string array[string] array[None] | No | Name of the Flow to target with this configuration. | |
| spec | Any of: | No | Dictionary of parameters to use for the resource. |
ComponentSpec
Specification for configuration applied to a component at runtime based on the config filter.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| skip | boolean | No | Boolean flag indicating whether to skip processing for the Component or not. | |
| retry_strategy | No | Retry strategy configuration options for the Component if any exceptions are encountered. | ||
| data_maintenance | No | The data maintenance configuration options for the Component. | ||
| data_plane | One of: SnowflakeDataPlane BigQueryDataPlane DuckdbDataPlane DatabricksDataPlane | No | Data Plane-specific configuration options for Components. |
FlowSpec
Specification for configuration applied to a Flow at runtime based on the config filter.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| data_plane | No | The data plane that will be used for the flow at runtime. | ||
| runner | RunnerConfig | No | Runner configuration. | |
| component_concurrency | integer | No | Maximum number of concurrent Components to run within this Flow. |
DataPlane
The external warehouse where data is persisted throughout the Flow runs, and where primary computation on the data itself occurs.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| connection_name | string | No | ||
| metadata_storage_location_prefix | string | No | Prefix to prepend to the names of metadata tables created for this Flow. The prefix may include database/project/etc. and schema/dataset/etc where applicable. If not provided, metadata tables are stored alongside the output data tables per the Data Plane's Connection configuration. |
RegexFilter
A filter used to target resources based on a regex pattern.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| regex | string | Yes | The regex to filter the resources. |
RunnerConfig
Configuration for the flow runner
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| size | Any of: RuntimeSize CustomRuntimeSize | No | Runtime size configuration. Can be: (1) a tier name string (X-Small, Small, Medium, Large, X-Large), or (2) a CustomRuntimeSize object with tier-based or fully custom resources. |
CustomRuntimeSize
Runtime size configuration with flexible resource specification. Supports two modes: 1. Tier-based: Specify a tier with optional resource overrides 2. Fully custom: Specify CPU directly with optional memory/disk Either 'tier' or 'cpu' must be provided (or both).
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| tier | RuntimeSize | No | Base size tier (X-Small, Small, Medium, Large, X-Large). Required unless 'cpu' is specified. | |
| cpu | string | No | CPU allocation in whole cores (e.g., '1', '4', '8'). Required unless 'tier' is specified. | |
| memory | string | No | Memory allocation. Use 'high' for tier-based doubling, or specify exact value with unit suffix (e.g., '32Gi', '4G', '512Mi') | |
| disk | string | No | Disk allocation with unit suffix (e.g., '100Gi', '1Ti', '500G') |
RuntimeSize
Enumeration of standard runtime size tiers. Each tier corresponds to specific resource allocations (CPU, memory, disk).
No properties defined.
BigQueryDataPlane
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| bigquery | Yes | BigQuery configuration options. |
BigQueryDataPlaneOptions
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| partition_by | Any of: | No | Partition By clause for the table. | |
| cluster_by | array[string] | No | Clustering keys to be added to the table. |
BigQueryRangePartitioning
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| field | string | Yes | Field to partition by. | |
| range | Yes | Range partitioning options. |
BigQueryTimePartitioning
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| field | string | Yes | Field to partition by. | |
| granularity | string ("DAY", "HOUR", "MONTH", "YEAR") | Yes | Granularity of the time partitioning. |
DatabricksDataPlane
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| databricks | cluster_by: null pyspark_job_cluster_id: null table_properties: null | No | Databricks configuration options. |
DatabricksDataPlaneOptions
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| table_properties | object with property values of type string | No | Table properties to include when creating the data table. This setting is equivalent to the CREATE TABLE ... TBLPROPERTIES clause. Please refer to the Databricks documentation at https://docs.databricks.com/aws/en/delta/table-properties for available properties depending on your Data Plane. | |
| pyspark_job_cluster_id | string | No | ID of the compute cluster to use for PySpark jobs. | |
| cluster_by | array[string] | No | Clustering keys to be added to the table. |
DuckdbDataPlane
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| duckdb | ducklake_data_table_compaction: small_file_count_threshold: 50 small_file_ratio_threshold: 0.25 small_file_record_count_limit: 100000 ducklake_metadata_table_compaction: small_file_count_threshold: 10 small_file_ratio_threshold: null small_file_record_count_limit: 10 | No | DuckDB configuration options. |
DuckDbDataPlaneOptions
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| ducklake_metadata_table_compaction | small_file_count_threshold: 10 small_file_ratio_threshold: null small_file_record_count_limit: 10 | DuckLakeTableCompactionSettings | No | Settings for compacting metadata tables. If present, metadata table compaction is enabled. |
| ducklake_data_table_compaction | small_file_count_threshold: 50 small_file_ratio_threshold: 0.25 small_file_record_count_limit: 100000 | DuckLakeTableCompactionSettings | No | Settings for compacting data tables. If present, data table compaction is enabled. |
DuckLakeTableCompactionSettings
Settings for DuckLake table compaction.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| small_file_record_count_limit | 10 | integer | No | Files with fewer records than this limit are considered 'small'. |
| small_file_count_threshold | 10 | integer | No | Run manual table compaction if the number of files with fewer than small_file_record_count_limit records exceeds this threshold. |
| small_file_ratio_threshold | number | No | Percentage (0.0-1.0) of small files relative to total files. If set, both absolute count AND ratio must pass for compaction to be triggered. If None, only absolute count check is performed. |
FabricDataPlane
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| fabric | spark_session_config: null | No | Fabric configuration options. |
FabricDataPlaneOptions
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| spark_session_config | No | Spark session configuration. |
RangeOptions
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| start | integer | Yes | Start of the range partitioning. | |
| end | integer | Yes | End of the range partitioning. | |
| interval | integer | Yes | Interval of the range partitioning. |
SnowflakeDataPlane
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| snowflake | Yes | Snowflake configuration options. |
SnowflakeDataPlaneOptions
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| cluster_by | array[string] | No | Clustering keys to be added to the table. |
SynapseDataPlane
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| synapse | spark_session_config: null | No | Synapse configuration options. |
SynapseDataPlaneOptions
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| spark_session_config | No | Spark session configuration. |
LivySparkSessionConfig
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| pool | string | No | Pool to use for the Spark session. | |
| driver_memory | string | No | Memory to use for the Spark driver. | |
| driver_cores | integer | No | Number of cores to use for the Spark driver. | |
| executor_memory | string | No | Memory to use for the Spark executor. | |
| executor_cores | integer | No | Number of cores to use for each executor. | |
| num_executors | integer | No | Number of executors to use for the Spark session. | |
| session_key_override | string | No | Key to use for the Spark session. | |
| max_concurrent_sessions | integer | No | Maximum number of concurrent sessions allowed for this configuration. |
DataMaintenance
Data maintenance configuration options for Components.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| enabled | boolean | No | Boolean flag indicating whether data maintenance is enabled for the Component. | |
| manual_table_compaction | boolean | No | Boolean flag indicating whether manual table compaction is enabled for the Component. This is currently only relevant for DuckLake Data Planes. | |
| manual_table_compaction_record_count_threshold | 10 | integer | No | Consider files with fewer than this number of records in determining whether to perform manual table compaction. This is currently only relevant for DuckLake Data Planes. |
| manual_table_compaction_file_count_threshold | 10 | integer | No | Run manual table compaction if the number of files with fewer than manual_table_compaction_record_count_threshold records exceeds this threshold. This is currently only relevant for DuckLake Data Planes. |
ResourceMetadata
Meta information of a resource. In most cases, it doesn't affect the system behavior but may be helpful to analyze Project resources.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| source | No | The origin or source information for the resource. | ||
| source_event_uuid | string | No | Event UUID associated with creation of this resource. |
ResourceLocation
The origin or source information for the resource.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| path | string | Yes | Path within repository files where the resource is defined. | |
| first_line_number | integer | No | First line number within path file where the resource is defined. |
RetryStrategy
Retry strategy configuration for Component operations. This configuration leverages the tenacity library to implement robust retry mechanisms. The configuration options directly map to tenacity's retry parameters. Details on the tenacity library can be found here: https://tenacity.readthedocs.io/en/latest/api.html#retry-main-api Current implementation includes: - stop_after_attempt: Maximum number of retry attempts - stop_after_delay: Give up on retries one attempt before you would exceed the delay. Will need to supply at least one of the two parameters. Additional retry parameters will be added as needed to support more complex use cases.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| stop_after_attempt | integer | No | Number of retry attempts before giving up. If set to None, it will not stop after any number of attempts. | |
| stop_after_delay | integer | No | Maximum time (in seconds) to spend on retries before giving up. If set to None, it will not stop after any time delay. |