Small File Compaction Settings
Settings for small file compaction thresholds.
SmallFileCompactionSettings
info
SmallFileCompactionSettings is defined beneath the following ancestor nodes in the YAML structure:
- BackfillRun
- BackfillRunOptions
- Component
- CustomPythonReadComponent
- ExternalTableComponent
- Flow
- FlowOptions
- FlowRun
- FlowRunBaseOptions
- FlowRunOptions
- Profile
- ProfileOptions
- Project
- ProjectOptions
- ConfigFilter
- ComponentSpec
- ReadComponent
- TaskComponent
- TransformComponent
- DuckdbDataPlane
- DuckDbDataPlaneOptions
- DuckLakeDataPlaneOptions
Below are the properties for the SmallFileCompactionSettings. Each property links to the specific details section further down in this page.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| file_size_limit | 100 | integer | No | Files smaller than this size (in MB) are considered 'small' and eligible for compaction. |
| count_threshold | 10 | integer | No | Run compaction if the number of small files exceeds this threshold. |
| ratio_threshold | number | No | Percentage (0.0-1.0) of small files relative to total files. If set, both absolute count AND ratio must pass for compaction to be triggered. If None, only absolute count check is performed. |
Property Details
BackfillRun
Defines the parameters for a backfill run.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| backfill_run | Yes | Backfill run options. |
BackfillRunOptions
Options for a backfill run.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| description | string | No | Brief description of what the model does. | |
| metadata | No | Meta information of a resource. In most cases it doesn't affect the system behavior but may be helpful to analyze project resources. | ||
| name | string | Yes | The name of the model | |
| flow_name | string | Yes | Name of the Flow that is to be backfilled. | |
| start_time | string | Yes | Start time of the time range to be backfilled. | |
| end_time | string | Yes | End time of the time range to be backfilled. | |
| granularity | string ("day", "week", "month") | Yes | Time granularity to use for backfill. Must be one of: 'day', 'week', 'month'. The backfill runner divides the date range into Flow runs of this granularity and launches these Flow runs. | |
| max_concurrent_flow_runs | 1 | integer | No | Maximum number of concurrent Flow runs used for backfill. This is used to limit the number of Flow runners (and hence cluster resources) that are launched simultaneously. |
| backfill_order | string ("forward_chronological", "reverse_chronological") | No | Order to use for backfilling - either forward or reverse chronological order. | |
| flow_run_options | No | Additional options for each Flow run launched during the backfill. | ||
| run_final_sync | boolean | No | Boolean flag indicating whether to run a final sync after concurrent backfill Flow runs. This final sync is a single Flow run that is executed without any time parameters, and is meant to sync the data to the latest state and capture any missing time intervals. |
Component
A Component is a fundamental building block of a data Flow. Supported Component types include: Read, Transform, Task, Test, and more.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| component | One of: CustomPythonReadComponent ApplicationComponent AliasedTableComponent |