DuckLake Table Compaction Settings
Settings for DuckLake table compaction.
DuckLakeTableCompactionSettings
DuckLakeTableCompactionSettings is defined beneath the following ancestor nodes in the YAML structure:
- BackfillRun
- BackfillRunOptions
- Component
- CustomPythonReadComponent
- ExternalTableComponent
- Flow
- FlowOptions
- FlowRun
- FlowRunBaseOptions
- FlowRunOptions
- Profile
- ProfileOptions
- Project
- ProjectOptions
- ConfigFilter
- ComponentSpec
- ReadComponent
- TaskComponent
- TransformComponent
- DuckdbDataPlane
- DuckDbDataPlaneOptions
Below are the properties for the DuckLakeTableCompactionSettings. Each property links to the specific details section further down in this page.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| small_file_record_count_limit | 10 | integer | No | Files with fewer records than this limit are considered 'small'. |
| small_file_count_threshold | 10 | integer | No | Run manual table compaction if the number of files with fewer than small_file_record_count_limit records exceeds this threshold. |
| small_file_ratio_threshold | number | No | Percentage (0.0-1.0) of small files relative to total files. If set, both absolute count AND ratio must pass for compaction to be triggered. If None, only absolute count check is performed. |
Property Details
BackfillRun
Defines the parameters for a backfill run.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| backfill_run | Yes | Backfill run options. |
BackfillRunOptions
Options for a backfill run.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| description | string | No | Brief description of what the model does. | |
| metadata | No | Meta information of a resource. In most cases it doesn't affect the system behavior but may be helpful to analyze project resources. | ||
| name | string | Yes | The name of the model | |
| flow_name | string | Yes | Name of the Flow that is to be backfilled. | |
| start_time | string | Yes | Start time of the time range to be backfilled. | |
| end_time | string | Yes | End time of the time range to be backfilled. | |
| granularity | string ("day", "week", "month") | Yes | Time granularity to use for backfill. Must be one of: 'day', 'week', 'month'. The backfill runner divides the date range into Flow runs of this granularity and launches these Flow runs. | |
| max_concurrent_flow_runs | 1 | integer | No | Maximum number of concurrent Flow runs used for backfill. This is used to limit the number of Flow runners (and hence cluster resources) that are launched simultaneously. |
| backfill_order | string ("forward_chronological", "reverse_chronological") | No | Order to use for backfilling - either forward or reverse chronological order. | |
| flow_run_options | No | Additional options for each Flow run launched during the backfill. | ||
| run_final_sync | boolean | No | Boolean flag indicating whether to run a final sync after concurrent backfill Flow runs. This final sync is a single Flow run that is executed without any time parameters, and is meant to sync the data to the latest state and capture any missing time intervals. |
Component
A Component is a fundamental building block of a data Flow. Supported Component types include: Read, Transform, Task, Test, and more.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| component | One of: CustomPythonReadComponent ApplicationComponent AliasedTableComponent ExternalTableComponent | Yes | Component configuration options. |
CustomPythonReadComponent
Component that reads data using user-defined custom Python code.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| data_plane | One of: SnowflakeDataPlane BigQueryDataPlane DuckdbDataPlane DatabricksDataPlane | No | Data Plane-specific configuration options for Components. | |
| skip | boolean | No | Boolean flag indicating whether to skip processing for the Component or not. | |
| retry_strategy | No | Retry strategy configuration options for the Component if any exceptions are encountered. | ||
| data_maintenance | No | The data maintenance configuration options for the Component. | ||
| description | string | No | Brief description of what the model does. | |
| metadata | No | Meta information of a resource. In most cases it doesn't affect the system behavior but may be helpful to analyze project resources. | ||
| name | string | Yes | The name of the model | |
| flow_name | string | No | Name of the Flow that the Component belongs to. | |
| tests | No | Defines tests to run on this Component's data. | ||
| custom_python_read | Yes |
ExternalTableComponent
Component that constructs and updates an External Table. Currently supported for Snowflake only.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| data_plane | One of: SnowflakeDataPlane BigQueryDataPlane DuckdbDataPlane DatabricksDataPlane | No | Data Plane-specific configuration options for Components. | |
| skip | boolean | No | Boolean flag indicating whether to skip processing for the Component or not. | |
| retry_strategy | No | Retry strategy configuration options for the Component if any exceptions are encountered. | ||
| data_maintenance | No | The data maintenance configuration options for the Component. | ||
| description | string | No | Brief description of what the model does. | |
| metadata | No | Meta information of a resource. In most cases it doesn't affect the system behavior but may be helpful to analyze project resources. | ||
| name | string | Yes | The name of the model | |
| flow_name | string | No | Name of the Flow that the Component belongs to. | |
| tests | No | Defines tests to run on this Component's data. | ||
| external_table | Any of: | Yes | Configuration options for the External Table Component. |
Flow
A Flow is the primary unit of execution in Ascend and contains a collection of Components assembled into a directed acyclic graph (DAG).
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| flow | Yes |
FlowOptions
Defines the options for a Flow
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| parameters | object with property values of type None | No | Dictionary of parameters to use for resource. | |
| defaults | array[None] | No | List of default configs with filters that can be applied to a resource config. | |
| description | string | No | Brief description of what the model does. | |
| metadata | No | Meta information of a resource. In most cases it doesn't affect the system behavior but may be helpful to analyze project resources. | ||
| name | string | Yes | The name of the model | |
| data_plane | No | Data plane to use for the flow. | ||
| version | string | No | Flow version. | |
| bootstrap | string | No | Bootstrap command to run within the Docker container. | |
| runner | RunnerConfig | No | Runner configuration. | |
| component_concurrency | integer | No | Maximum number of concurrent Components to run within this Flow. |
FlowRun
Defines the run-specific parameters for a Flow, one flow can have multiple Flow runs
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| flow_run | Yes |
FlowRunBaseOptions
Base options for a Flow Run
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| parameters | object with property values of type None | No | Dictionary of parameters to use for resource. | |
| defaults | array[None] | No | List of default configs with filters that can be applied to a resource config. | |
| description | string | No | Brief description of what the model does. | |
| metadata | No | Meta information of a resource. In most cases it doesn't affect the system behavior but may be helpful to analyze project resources. | ||
| run_tests | True | boolean | No | Boolean flag indicating whether to run tests after processing data. |
| store_test_results | boolean | No | Boolean flag indicating whether to store test results. | |
| components | array[string] | No | List of Component names to run. | |
| component_categories | array[string] | No | List of Component categories to run. | |
| halt_flow_on_error | boolean | No | Boolean flag indicating whether to halt the Flow on error. | |
| disable_optimizers | boolean | No | Boolean flag indicating whether to disable optimizers. | |
| disable_incremental_metadata_collection | boolean | No | Boolean flag indicating whether to disable collection of Incremental Read and Transform Component metadata. | |
| full_refresh | False | boolean | No | Boolean flag indicating whether to perform a full refresh of each Component. ⚠ If true, will drop all internal data and metadata tables/views and re-compute them from scratch. |
| update_materialization_type | False | boolean | No | Boolean flag indicating whether to update Component materialization types (e.g., changing types between 'simple', 'view', 'incremental', and 'smart'). ⚠ If materialization type changes are detected, existing data and metadata tables/views will be dropped and re-computed from scratch. Otherwise, existing data and metadata tables/views will be preserved and type changes will result in an error. |
| backfill_missing_statistics | True | boolean | No | Boolean flag indicating whether to backfill block statistics for existing data blocks that don't have statistics yet. If true (default), statistics will be computed and stored for data blocks that don't have them yet. |
| runner_overrides | RunnerConfig | No | Override runner configuration for this specific flow run. If not specified, inherits from the flow's runner configuration, or the deployment/workspace defaults. |
FlowRunOptions
Options for a Flow Run
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| parameters | object with property values of type None | No | Dictionary of parameters to use for resource. | |
| defaults | array[None] | No | List of default configs with filters that can be applied to a resource config. | |
| description | string | No | Brief description of what the model does. | |
| metadata | No | Meta information of a Flow run. In most cases, it doesn't affect the system behavior but may be helpful to analyze project resources. | ||
| run_tests | True | boolean | No | Boolean flag indicating whether to run tests after processing data. |
| store_test_results | boolean | No | Boolean flag indicating whether to store test results. | |
| components | array[string] | No | List of Component names to run. | |
| component_categories | array[string] | No | List of Component categories to run. | |
| halt_flow_on_error | boolean | No | Boolean flag indicating whether to halt the Flow on error. | |
| disable_optimizers | boolean | No | Boolean flag indicating whether to disable optimizers. | |
| disable_incremental_metadata_collection | boolean | No | Boolean flag indicating whether to disable collection of Incremental Read and Transform Component metadata. | |
| full_refresh | False | boolean | No | Boolean flag indicating whether to perform a full refresh of each Component. ⚠ If true, will drop all internal data and metadata tables/views and re-compute them from scratch. |
| update_materialization_type | False | boolean | No | Boolean flag indicating whether to update Component materialization types (e.g., changing types between 'simple', 'view', 'incremental', and 'smart'). ⚠ If materialization type changes are detected, existing data and metadata tables/views will be dropped and re-computed from scratch. Otherwise, existing data and metadata tables/views will be preserved and type changes will result in an error. |
| backfill_missing_statistics | True | boolean | No | Boolean flag indicating whether to backfill block statistics for existing data blocks that don't have statistics yet. If true (default), statistics will be computed and stored for data blocks that don't have them yet. |
| runner_overrides | RunnerConfig | No | Override runner configuration for this specific flow run. If not specified, inherits from the flow's runner configuration, or the deployment/workspace defaults. | |
| name | string | No | Flow run name. | |
| flow_name | string | Yes | Name of the Flow to run. | |
| event_start_time | string | No | Event start time to be used for time-series processing. | |
| event_end_time | string | No | Event end time to be used for time-series processing. |
Profile
A Profile is a set of configuration options and parameters that define the target where customer code is compiled/run.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| profile | Yes | Options and parameters for Profiles. |
ProfileOptions
Configuration options and parameters for Profiles.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| pip_packages | array[string] | No | Python PIP packages to install | |
| parameters | object with property values of type None | No | Dictionary of parameters to use for resource. | |
| defaults | array[None] | No | List of default configs with filters that can be applied to a resource config. | |
| description | string | No | Brief description of what the model does. | |
| metadata | No | Meta information of a resource. In most cases it doesn't affect the system behavior but may be helpful to analyze project resources. | ||
| name | string | Yes | The name of the model | |
| ignore | array[string] | No | Additional ignore patterns to apply when using this profile (follows .gitignore syntax) |
Project
A Project is a group of related Connections, Flows/Components, Profiles, Vaults, Automations and other code/configuration artifacts. Project files define the mapping of filesystem paths to different kinds of artifacts that the platform can access when running Flows for the Project.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| project | Yes | Project options. |
ProjectOptions
Options that can be specified for a Project.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| pip_packages | array[string] | No | Python PIP packages to install | |
| parameters | object with property values of type None | No | Dictionary of parameters to use for resource. | |
| defaults | array[None] | No | List of default configs with filters that can be applied to a resource config. | |
| description | string | No | Brief description of what the model does. | |
| metadata | No | Meta information of a resource. In most cases it doesn't affect the system behavior but may be helpful to analyze project resources. | ||
| name | string | Yes | The name of the model | |
| version | string | No | Project version. | |
| connections | ['connections/'] | array[string] | No | List of Connection definition folders used in the Project. |
| flows | ['flows/'] | array[string] | No | List of Flow definition folders used in the Project. |
| profiles | ['profiles/'] | array[string] | No | List of Profile definition folders used in the Project. |
| sources | ['src/'] | array[string] | No | List of source definition folders used in the Project. |
| tests | ['tests/'] | array[string] | No | List of test definition folders used in the Project. |
| vaults | ['vaults/'] | array[string] | No | List of Vault definition folders used in the Project. |
| actions | ['actions/'] | array[string] | No | List of Action definition folders used in the Project. |
| automations | ['automations/'] | array[string] | No | List of Automation definition folders used in the Project. |
| sensors | ['sensors/'] | array[string] | No | List of Sensor definition folders used in the Project. |
| ssh_tunnels | ['ssh_tunnels/'] | array[string] | No | List of SSH tunnel definition folders used in the Project. |
| applications | ['applications/'] | array[string] | No | List of Application definition folders used in the Project. |
ConfigFilter
Filter used to target configuration settings to a specific Flow and/or Component.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| kind | string ("Flow", "Component") | Yes | Resource kind to target with this configuration. | |
| name | Any of: string array[string] array[None] | Yes | Name of the resource to target with this configuration. | |
| flow_name | Any of: string array[string] array[None] | No | Name of the Flow to target with this configuration. | |
| spec | Any of: | No | Dictionary of parameters to use for the resource. |
ComponentSpec
Specification for configuration applied to a component at runtime based on the config filter.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| skip | boolean | No | Boolean flag indicating whether to skip processing for the Component or not. | |
| retry_strategy | No | Retry strategy configuration options for the Component if any exceptions are encountered. | ||
| data_maintenance | No | The data maintenance configuration options for the Component. | ||
| data_plane | One of: SnowflakeDataPlane BigQueryDataPlane DuckdbDataPlane DatabricksDataPlane | No | Data Plane-specific configuration options for Components. |
ReadComponent
Component that reads data from a system.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| data_plane | One of: SnowflakeDataPlane BigQueryDataPlane DuckdbDataPlane DatabricksDataPlane | No | Data Plane-specific configuration options for Components. | |
| skip | boolean | No | Boolean flag indicating whether to skip processing for the Component or not. | |
| retry_strategy | No | Retry strategy configuration options for the Component if any exceptions are encountered. | ||
| data_maintenance | No | The data maintenance configuration options for the Component. | ||
| description | string | No | Brief description of what the model does. | |
| metadata | No | Meta information of a resource. In most cases it doesn't affect the system behavior but may be helpful to analyze project resources. | ||
| name | string | Yes | The name of the model | |
| flow_name | string | No | Name of the Flow that the Component belongs to. | |
| tests | No | Defines tests to run on this Component's data. | ||
| read | One of: GenericFileReadComponent LocalFileReadComponent SFTPReadComponent S3ReadComponent GcsReadComponent AbfsReadComponent HttpReadComponent MSSQLReadComponent MySQLReadComponent OracleReadComponent PostgresReadComponent SnowflakeReadComponent BigQueryReadComponent DatabricksReadComponent | Yes | Read component that reads data from a system. |
TaskComponent
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| data_plane | One of: SnowflakeDataPlane BigQueryDataPlane DuckdbDataPlane DatabricksDataPlane | No | Data Plane-specific configuration options for Components. | |
| skip | boolean | No | Boolean flag indicating whether to skip processing for the Component or not. | |
| retry_strategy | No | Retry strategy configuration options for the Component if any exceptions are encountered. | ||
| data_maintenance | No | The data maintenance configuration options for the Component. | ||
| description | string | No | Brief description of what the model does. | |
| metadata | No | Meta information of a resource. In most cases it doesn't affect the system behavior but may be helpful to analyze project resources. | ||
| name | string | Yes | The name of the model | |
| flow_name | string | No | Name of the Flow that the Component belongs to. | |
| tests | No | Defines tests to run on this Component's data. | ||
| task | One of: TaskSqlComponent TaskPythonComponent | Yes |
TransformComponent
Component that executes SQL or Python code to transform data.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| data_plane | One of: SnowflakeDataPlane BigQueryDataPlane DuckdbDataPlane DatabricksDataPlane | No | Data Plane-specific configuration options for Components. | |
| skip | boolean | No | Boolean flag indicating whether to skip processing for the Component or not. | |
| retry_strategy | No | Retry strategy configuration options for the Component if any exceptions are encountered. | ||
| data_maintenance | No | The data maintenance configuration options for the Component. | ||
| description | string | No | Brief description of what the model does. | |
| metadata | No | Meta information of a resource. In most cases it doesn't affect the system behavior but may be helpful to analyze project resources. | ||
| name | string | Yes | The name of the model | |
| flow_name | string | No | Name of the Flow that the Component belongs to. | |
| tests | No | Defines tests to run on this Component's data. | ||
| transform | One of: SqlTransform PythonTransform SnowparkTransform PySparkTransform | Yes | Transform that executes SQL or Python code for data transformation. |
DuckdbDataPlane
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| duckdb | ducklake_data_table_compaction: small_file_count_threshold: 50 small_file_ratio_threshold: 0.25 small_file_record_count_limit: 100000 ducklake_metadata_table_compaction: small_file_count_threshold: 10 small_file_ratio_threshold: null small_file_record_count_limit: 10 | No | DuckDB configuration options. |
DuckDbDataPlaneOptions
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| ducklake_metadata_table_compaction | small_file_count_threshold: 10 small_file_ratio_threshold: null small_file_record_count_limit: 10 | DuckLakeTableCompactionSettings | No | Settings for compacting metadata tables. If present, metadata table compaction is enabled. |
| ducklake_data_table_compaction | small_file_count_threshold: 50 small_file_ratio_threshold: 0.25 small_file_record_count_limit: 100000 | DuckLakeTableCompactionSettings | No | Settings for compacting data tables. If present, data table compaction is enabled. |