Small File Compaction Settings

Settings for small file compaction thresholds.

SmallFileCompactionSettings

info

SmallFileCompactionSettings is defined beneath the following ancestor nodes in the YAML structure:

BackfillRun
BackfillRunOptions
Component
CustomPythonReadComponent
ExternalTableComponent
Flow
FlowOptions
FlowRun
FlowRunBaseOptions
FlowRunOptions
Profile
ProfileOptions
Project
ProjectOptions
ConfigFilter
ComponentSpec
ReadComponent
TaskComponent
TransformComponent
DuckdbDataPlane
DuckDbDataPlaneOptions
DuckLakeDataPlaneOptions

Below are the properties for the SmallFileCompactionSettings. Each property links to the specific details section further down in this page.

Property	Default	Type	Required	Description
file_size_limit	100	integer	No	Files smaller than this size (in MB) are considered 'small' and eligible for compaction.
count_threshold	10	integer	No	Run compaction if the number of small files exceeds this threshold.
ratio_threshold		number	No	Percentage (0.0-1.0) of small files relative to total files. If set, both absolute count AND ratio must pass for compaction to be triggered. If None, only absolute count check is performed.

Property Details

BackfillRun

Defines the parameters for a backfill run.

Property	Default	Type	Required	Description
backfill_run			Yes	Backfill run options.

BackfillRunOptions

Options for a backfill run.

Property	Default	Type	Required	Description
description		string	No	Brief description of what the model does.
metadata			No	Meta information of a resource. In most cases it doesn't affect the system behavior but may be helpful to analyze project resources.
name		string	Yes	The name of the model
flow_name		string	Yes	Name of the Flow that is to be backfilled.
start_time		string	Yes	Start time of the time range to be backfilled.
end_time		string	Yes	End time of the time range to be backfilled.
granularity		string ("day", "week", "month")	Yes	Time granularity to use for backfill. Must be one of: 'day', 'week', 'month'. The backfill runner divides the date range into Flow runs of this granularity and launches these Flow runs.
max_concurrent_flow_runs	1	integer	No	Maximum number of concurrent Flow runs used for backfill. This is used to limit the number of Flow runners (and hence cluster resources) that are launched simultaneously.
backfill_order		string ("forward_chronological", "reverse_chronological")	No	Order to use for backfilling - either forward or reverse chronological order.
flow_run_options			No	Additional options for each Flow run launched during the backfill.
run_final_sync		boolean	No	Boolean flag indicating whether to run a final sync after concurrent backfill Flow runs. This final sync is a single Flow run that is executed without any time parameters, and is meant to sync the data to the latest state and capture any missing time intervals.

Component

A Component is a fundamental building block of a data Flow. Supported Component types include: Read, Transform, Task, Test, and more.

Property	Default	Type	Required	Description
component		One of: CustomPythonReadComponent ApplicationComponent AliasedTableComponent ExternalTableComponent DbtNodeComponent	Yes	Component configuration options.

CustomPythonReadComponent

Component that reads data using user-defined custom Python code.

Property	Type	Required	Description
data_plane	One of: SnowflakeDataPlane BigQueryDataPlane DuckdbDataPlane DatabricksDataPlane	No	Data Plane-specific configuration options for Components.
skip	boolean	No	Boolean flag indicating whether to skip processing for the Component or not.
retry_strategy		No	Retry strategy configuration options for the Component if any exceptions are encountered.
data_maintenance		No	The data maintenance configuration options for the Component.
description	string	No	Brief description of what the model does.
metadata		No	Meta information of a resource. In most cases it doesn't affect the system behavior but may be helpful to analyze project resources.
name	string	Yes	The name of the model
flow_name	string	No	Name of the Flow that the Component belongs to.
tests		No	Defines tests to run on this Component's data.
custom_python_read		Yes

ExternalTableComponent

Component that constructs and updates an External Table. Currently supported for Snowflake only.

Property	Type	Required	Description
data_plane	One of: SnowflakeDataPlane BigQueryDataPlane DuckdbDataPlane DatabricksDataPlane	No	Data Plane-specific configuration options for Components.
skip	boolean	No	Boolean flag indicating whether to skip processing for the Component or not.
retry_strategy		No	Retry strategy configuration options for the Component if any exceptions are encountered.
data_maintenance		No	The data maintenance configuration options for the Component.
description	string	No	Brief description of what the model does.
metadata		No	Meta information of a resource. In most cases it doesn't affect the system behavior but may be helpful to analyze project resources.
name	string	Yes	The name of the model
flow_name	string	No	Name of the Flow that the Component belongs to.
tests		No	Defines tests to run on this Component's data.
external_table	Any of:	Yes	Configuration options for the External Table Component.

Flow

A Flow is the primary unit of execution in Ascend and contains a collection of Components assembled into a directed acyclic graph (DAG).

Property	Default	Type	Required	Description
flow			Yes

FlowOptions

Defines the options for a Flow

Property	Type	Required	Description
parameters	object with property values of type None	No	Dictionary of parameters to use for resource.
defaults	array[None]	No	List of default configs with filters that can be applied to a resource config.
description	string	No	Brief description of what the model does.
metadata		No	Meta information of a resource. In most cases it doesn't affect the system behavior but may be helpful to analyze project resources.
name	string	Yes	The name of the model
data_plane		No	Data plane to use for the flow.
version	string	No	Flow version.
bootstrap	string	No	Bootstrap command to run within the Docker container.
runner	RunnerConfig	No	Runner configuration.
component_concurrency	integer	No	Maximum number of concurrent Components to run within this Flow.

FlowRun

Defines the run-specific parameters for a Flow, one flow can have multiple Flow runs

Property	Default	Type	Required	Description
flow_run			Yes

FlowRunBaseOptions

Base options for a Flow Run

Property	Default	Type	Required	Description
parameters		object with property values of type None	No	Dictionary of parameters to use for resource.
defaults		array[None]	No	List of default configs with filters that can be applied to a resource config.
description		string	No	Brief description of what the model does.
metadata			No	Meta information of a resource. In most cases it doesn't affect the system behavior but may be helpful to analyze project resources.
run_tests	True	boolean	No	Boolean flag indicating whether to run tests after processing data.
store_test_results		boolean	No	Boolean flag indicating whether to store test results.
components		array[string]	No	List of Component names to run.
component_categories		array[string]	No	List of Component categories to run.
halt_flow_on_error		boolean	No	Boolean flag indicating whether to halt the Flow on error.
disable_optimizers		boolean	No	Boolean flag indicating whether to disable optimizers.
disable_incremental_metadata_collection		boolean	No	Boolean flag indicating whether to disable collection of Incremental Read and Transform Component metadata.
full_refresh	False	boolean	No	Boolean flag indicating whether to perform a full refresh of each Component. ⚠ If true, will drop all internal data and metadata tables/views and re-compute them from scratch.
update_materialization_type	False	boolean	No	Boolean flag indicating whether to update Component materialization types (e.g., changing types between 'simple', 'view', 'incremental', and 'smart'). ⚠ If materialization type changes are detected, existing data and metadata tables/views will be dropped and re-computed from scratch. Otherwise, existing data and metadata tables/views will be preserved and type changes will result in an error.
backfill_missing_statistics	True	boolean	No	Boolean flag indicating whether to backfill block statistics for existing data blocks that don't have statistics yet. If true (default), statistics will be computed and stored for data blocks that don't have them yet.
runner_overrides		RunnerConfig	No	Override runner configuration for this specific flow run. If not specified, inherits from the flow's runner configuration, or the deployment/workspace defaults.

FlowRunOptions

Options for a Flow Run

Property	Default	Type	Required	Description
parameters		object with property values of type None	No	Dictionary of parameters to use for resource.
defaults		array[None]	No	List of default configs with filters that can be applied to a resource config.
description		string	No	Brief description of what the model does.
metadata			No	Meta information of a Flow run. In most cases, it doesn't affect the system behavior but may be helpful to analyze project resources.
run_tests	True	boolean	No	Boolean flag indicating whether to run tests after processing data.
store_test_results		boolean	No	Boolean flag indicating whether to store test results.
components		array[string]	No	List of Component names to run.
component_categories		array[string]	No	List of Component categories to run.
halt_flow_on_error		boolean	No	Boolean flag indicating whether to halt the Flow on error.
disable_optimizers		boolean	No	Boolean flag indicating whether to disable optimizers.
disable_incremental_metadata_collection		boolean	No	Boolean flag indicating whether to disable collection of Incremental Read and Transform Component metadata.
full_refresh	False	boolean	No	Boolean flag indicating whether to perform a full refresh of each Component. ⚠ If true, will drop all internal data and metadata tables/views and re-compute them from scratch.
update_materialization_type	False	boolean	No	Boolean flag indicating whether to update Component materialization types (e.g., changing types between 'simple', 'view', 'incremental', and 'smart'). ⚠ If materialization type changes are detected, existing data and metadata tables/views will be dropped and re-computed from scratch. Otherwise, existing data and metadata tables/views will be preserved and type changes will result in an error.
backfill_missing_statistics	True	boolean	No	Boolean flag indicating whether to backfill block statistics for existing data blocks that don't have statistics yet. If true (default), statistics will be computed and stored for data blocks that don't have them yet.
runner_overrides		RunnerConfig	No	Override runner configuration for this specific flow run. If not specified, inherits from the flow's runner configuration, or the deployment/workspace defaults.
name		string	No	Flow run name.
flow_name		string	Yes	Name of the Flow to run.
event_start_time		string	No	Event start time to be used for time-series processing.
event_end_time		string	No	Event end time to be used for time-series processing.

Profile

A Profile is a set of configuration options and parameters that define the target where customer code is compiled/run.

Property	Default	Type	Required	Description
profile			Yes	Options and parameters for Profiles.

ProfileOptions

Configuration options and parameters for Profiles.

Property	Type	Required	Description
pip_packages	array[string]	No	Python PIP packages to install
parameters	object with property values of type None	No	Dictionary of parameters to use for resource.
defaults	array[None]	No	List of default configs with filters that can be applied to a resource config.
description	string	No	Brief description of what the model does.
metadata		No	Meta information of a resource. In most cases it doesn't affect the system behavior but may be helpful to analyze project resources.
name	string	Yes	The name of the model
ignore	array[string]	No	Additional ignore patterns to apply when using this profile (follows .gitignore syntax)

Project

A Project is a group of related Connections, Flows/Components, Profiles, Vaults, Automations and other code/configuration artifacts. Project files define the mapping of filesystem paths to different kinds of artifacts that the platform can access when running Flows for the Project.

Property	Default	Type	Required	Description
project			Yes	Project options.

ProjectOptions

Options that can be specified for a Project.

Property	Default	Type	Required	Description
pip_packages		array[string]	No	Python PIP packages to install
parameters		object with property values of type None	No	Dictionary of parameters to use for resource.
defaults		array[None]	No	List of default configs with filters that can be applied to a resource config.
description		string	No	Brief description of what the model does.
metadata			No	Meta information of a resource. In most cases it doesn't affect the system behavior but may be helpful to analyze project resources.
name		string	Yes	The name of the model
version		string	No	Project version.
connections	['connections/']	array[string]	No	List of Connection definition folders used in the Project.
flows	['flows/']	array[string]	No	List of Flow definition folders used in the Project.
profiles	['profiles/']	array[string]	No	List of Profile definition folders used in the Project.
sources	['src/']	array[string]	No	List of source definition folders used in the Project.
tests	['tests/']	array[string]	No	List of test definition folders used in the Project.
vaults	['vaults/']	array[string]	No	List of Vault definition folders used in the Project.
actions	['actions/']	array[string]	No	List of Action definition folders used in the Project.
automations	['automations/']	array[string]	No	List of Automation definition folders used in the Project.
sensors	['sensors/']	array[string]	No	List of Sensor definition folders used in the Project.
ssh_tunnels	['ssh_tunnels/']	array[string]	No	List of SSH tunnel definition folders used in the Project.
applications	['applications/']	array[string]	No	List of Application definition folders used in the Project.

ConfigFilter

Filter used to target configuration settings to a specific Flow and/or Component.

Property	Type	Required	Description
kind	string ("Flow", "Component")	Yes	Resource kind to target with this configuration.
name	Any of: string array[string] array[None]	Yes	Name of the resource to target with this configuration.
flow_name	Any of: string array[string] array[None]	No	Name of the Flow to target with this configuration.
spec	Any of:	No	Dictionary of parameters to use for the resource.

ComponentSpec

Specification for configuration applied to a component at runtime based on the config filter.

Property	Type	Required	Description
skip	boolean	No	Boolean flag indicating whether to skip processing for the Component or not.
retry_strategy		No	Retry strategy configuration options for the Component if any exceptions are encountered.
data_maintenance		No	The data maintenance configuration options for the Component.
data_plane	One of: SnowflakeDataPlane BigQueryDataPlane DuckdbDataPlane DatabricksDataPlane	No	Data Plane-specific configuration options for Components.

ReadComponent

Component that reads data from a system.

Property	Type	Required	Description
data_plane	One of: SnowflakeDataPlane BigQueryDataPlane DuckdbDataPlane DatabricksDataPlane	No	Data Plane-specific configuration options for Components.
skip	boolean	No	Boolean flag indicating whether to skip processing for the Component or not.
retry_strategy		No	Retry strategy configuration options for the Component if any exceptions are encountered.
data_maintenance		No	The data maintenance configuration options for the Component.
description	string	No	Brief description of what the model does.
metadata		No	Meta information of a resource. In most cases it doesn't affect the system behavior but may be helpful to analyze project resources.
name	string	Yes	The name of the model
flow_name	string	No	Name of the Flow that the Component belongs to.
tests		No	Defines tests to run on this Component's data.
read	One of: GenericFileReadComponent LocalFileReadComponent SFTPReadComponent S3ReadComponent GcsReadComponent AbfsReadComponent HttpReadComponent MSSQLReadComponent MySQLReadComponent OracleReadComponent PostgresReadComponent SnowflakeReadComponent BigQueryReadComponent DatabricksReadComponent	Yes	Read component that reads data from a system.

TaskComponent

Property	Type	Required	Description
data_plane	One of: SnowflakeDataPlane BigQueryDataPlane DuckdbDataPlane DatabricksDataPlane	No	Data Plane-specific configuration options for Components.
skip	boolean	No	Boolean flag indicating whether to skip processing for the Component or not.
retry_strategy		No	Retry strategy configuration options for the Component if any exceptions are encountered.
data_maintenance		No	The data maintenance configuration options for the Component.
description	string	No	Brief description of what the model does.
metadata		No	Meta information of a resource. In most cases it doesn't affect the system behavior but may be helpful to analyze project resources.
name	string	Yes	The name of the model
flow_name	string	No	Name of the Flow that the Component belongs to.
tests		No	Defines tests to run on this Component's data.
task	One of: TaskSqlComponent TaskPythonComponent	Yes

TransformComponent

Component that executes SQL or Python code to transform data.

Property	Type	Required	Description
data_plane	One of: SnowflakeDataPlane BigQueryDataPlane DuckdbDataPlane DatabricksDataPlane	No	Data Plane-specific configuration options for Components.
skip	boolean	No	Boolean flag indicating whether to skip processing for the Component or not.
retry_strategy		No	Retry strategy configuration options for the Component if any exceptions are encountered.
data_maintenance		No	The data maintenance configuration options for the Component.
description	string	No	Brief description of what the model does.
metadata		No	Meta information of a resource. In most cases it doesn't affect the system behavior but may be helpful to analyze project resources.
name	string	Yes	The name of the model
flow_name	string	No	Name of the Flow that the Component belongs to.
tests		No	Defines tests to run on this Component's data.
transform	One of: SqlTransform PythonTransform SnowparkTransform PySparkTransform	Yes	Transform that executes SQL or Python code for data transformation.

DuckdbDataPlane

Property	Default	Type	Required	Description
duckdb	ducklake: null		No	DuckDB configuration options.

DuckDbDataPlaneOptions

Property	Default	Type	Required	Description
ducklake			No	DuckLake-specific data plane configuration options including table compaction settings.

DuckLakeDataPlaneOptions

DuckLake-specific data plane configuration options.

Property	Default	Type	Required	Description
manual_table_compaction	True	boolean	No	Enable manual table compaction for DuckLake tables.
metadata_small_file_compaction	count_threshold: 10 file_size_limit: 100 ratio_threshold: null	SmallFileCompactionSettings	No	Settings for compacting metadata tables.
data_small_file_compaction	count_threshold: 50 file_size_limit: 100 ratio_threshold: 0.25	SmallFileCompactionSettings	No	Settings for compacting data tables.
partition_by		array[string]	No	Partition keys to be added to the table. Can be column names or expressions (e.g., ['part_key']).
rewrite_data_files	True	boolean	No	Call DuckLake's rewrite_data_files() maintenance operation to optimize table storage.
rewrite_data_files_delete_threshold		number	No	Delete threshold for ducklake_rewrite_data_files operation (0.0-1.0). If set to None, DuckLake's default value (0.95) will be used.