Custom Python Read Component
A component that reads data using user-defined custom Python code.
CustomPythonReadComponent
CustomPythonReadComponent
is defined beneath the following ancestor nodes in the YAML structure:
Below are the properties for the CustomPythonReadComponent
. Each property links to the specific details section further down in this page.
Property | Default | Type | Required | Description |
---|---|---|---|---|
data_plane | One of: SnowflakeDataPlane BigQueryDataPlane DuckdbDataPlane SynapseDataPlane FabricDataPlane DatabricksDataPlane | No | Data Plane-specific configuration options for a component. | |
description | string | No | A brief description of what the model does. | |
metadata | ResourceMetadata | No | Meta information of a resource. In most cases it doesn't affect the system behavior but may be helpful to analyze project resources. | |
name | string | Yes | The name of the model | |
flow_name | string | No | The name of the flow that the component belongs to. | |
skip | boolean | No | A boolean flag indicating whether to skip processing for the component or not. | |
data_maintenance | DataMaintenance | No | The data maintenance configuration options for the component. | |
tests | ComponentTestOptions | No | Defines tests to run on the data of this component. | |
custom_python_read | CustomPythonReadOptions | Yes |
Property Details
Component
A component is a fundamental building block of a data flow. Types of components that are supported include: read, transform, task, test, and more.
Property | Default | Type | Required | Description |
---|---|---|---|---|
component | One of: ReadComponent TransformComponent TaskComponent SingularTestComponent CustomPythonReadComponent WriteComponent CompoundComponent AliasedTableComponent ExternalTableComponent | Yes | Configuration options for the component. |
CustomPythonReadOptions
Configuration options for the Custom Python Read component.
Property | Default | Type | Required | Description |
---|---|---|---|---|
event_time | string | No | Timestamp column in the component output used to represent event time. | |
strategy | full | Any of: string IncrementalStrategy PartitionedStrategy | No | Ingest strategy. |
python | Any of: PythonBase PartitionedListRead | Yes | Python code to execute for ingesting data. |
PartitionedListRead
Property | Default | Type | Required | Description |
---|---|---|---|---|
list | PythonBase | Yes | Python function that lists partitions in the source. | |
read | PythonBase | Yes | Python function that reads a partition from the source. |
PythonBase
Base class for Python-based components and resources.
Property | Default | Type | Required | Description |
---|---|---|---|---|
entrypoint | string | Yes | The entrypoint for the python transform function. | |
source | string | Yes | The source file for the python transform function. |
BigQueryDataPlane
Property | Default | Type | Required | Description |
---|---|---|---|---|
bigquery | BigQueryDataPlaneOptions | Yes | BigQuery configuration options. |
BigQueryDataPlaneOptions
Property | Default | Type | Required | Description |
---|---|---|---|---|
partition_by | Any of: BigQueryRangePartitioning BigQueryTimePartitioning | No | Partition By clause for the table. | |
cluster_by | array[string] | No | Clustering keys to be added to the table. |
BigQueryRangePartitioning
Property | Default | Type | Required | Description |
---|---|---|---|---|
field | string | Yes | Field to partition by. | |
range | RangeOptions | Yes | Range partitioning options. |
BigQueryTimePartitioning
Property | Default | Type | Required | Description |
---|---|---|---|---|
field | string | Yes | Field to partition by. | |
granularity | string ("DAY", "HOUR", "MONTH", "YEAR") | Yes | Granularity of the time partitioning. |
ComponentTestOptions
Options for component tests, including data quality tests and schema checks.
ColumnTestPython
Test to validate data using a Python function for a single column.
Property | Default | Type | Required | Description |
---|---|---|---|---|
severity | error | string ("error", "warn") | No | The severity level for issues raised by the test. Default is 'error'. Use 'error' for critical issues that should interrupt flow processing. Use 'warn' for warnings/minor issues that should not interrupt flow processing. |
name | string | Yes | ||
python | ColumnTestPythonOptions | Yes | Configuration options for the Python column test. |
ColumnTestPythonOptions
Property | Default | Type | Required | Description |
---|---|---|---|---|
entrypoint | string | Yes | The entrypoint for the python transform function. | |
source | string | Yes | The source file for the python transform function. | |
params | object with property values of type None | No | Parameters for the Python test function. | |
is_asset_test | boolean | No |
ColumnTestSql
Test to validate data using an SQL query for a single column.
Property | Default | Type | Required | Description |
---|---|---|---|---|
severity | error | string ("error", "warn") | No | The severity level for issues raised by the test. Default is 'error'. Use 'error' for critical issues that should interrupt flow processing. Use 'warn' for warnings/minor issues that should not interrupt flow processing. |
name | string | Yes | ||
sql | string | No | SQL query that tests data for conditions. |
CombinationUniqueTest
Test to check if a value is unique.
Property | Default | Type | Required | Description |
---|---|---|---|---|
severity | error | string ("error", "warn") | No | The severity level for issues raised by the test. Default is 'error'. Use 'error' for critical issues that should interrupt flow processing. Use 'warn' for warnings/minor issues that should not interrupt flow processing. |
combination_unique | CombinationUniqueTestOptions | Yes | Test to check if a value is unique. |
CombinationUniqueTestOptions
Configuration options for the unique test.
Property | Default | Type | Required | Description |
---|---|---|---|---|
columns | array[string] | Yes | The combination of columns to check for uniqueness. |
ComponentSchemaTest
Test to validate that component columns match expected types.
Property | Default | Type | Required | Description |
---|---|---|---|---|
severity | error | string ("error", "warn") | No | The severity level for issues raised by the test. Default is 'error'. Use 'error' for critical issues that should interrupt flow processing. Use 'warn' for warnings/minor issues that should not interrupt flow processing. |
match | exact | string ("exact", "ignore_missing") | No | The type of schema matching to perform. 'exact' requires all columns to be present, 'ignore_missing' allows for missing columns. |
columns | object with property values of type string | No | A mapping of column names to their expected types. |
CountDistinctEqualTest
Test to check if the number of distinct values is equal to a certain number.
Property | Default | Type | Required | Description |
---|---|---|---|---|
severity | error | string ("error", "warn") | No | The severity level for issues raised by the test. Default is 'error'. Use 'error' for critical issues that should interrupt flow processing. Use 'warn' for warnings/minor issues that should not interrupt flow processing. |
count_distinct_equal | CountDistinctEqualTestOptions | Yes |
CountDistinctEqualTestOptions
Configuration options for the count_distinct_equal test.
Property | Default | Type | Required | Description |
---|---|---|---|---|
count | integer | Yes | The number of distinct values to expect. | |
group_by_columns | array[string] | No | The columns to group by. |
CountEqualTest
Test to check if the number of rows is equal to a certain number.
Property | Default | Type | Required | Description |
---|---|---|---|---|
severity | error | string ("error", "warn") | No | The severity level for issues raised by the test. Default is 'error'. Use 'error' for critical issues that should interrupt flow processing. Use 'warn' for warnings/minor issues that should not interrupt flow processing. |
count_equal | CountEqualTestOptions | Yes | Configuration options for the the count_equal test. |
CountEqualTestOptions
Configuration options for the count_equal test.
Property | Default | Type | Required | Description |
---|---|---|---|---|
count | integer | Yes | The number of rows to expect. |
CountGreaterThanOrEqualTest
Test to check if the number of rows is greater than or equal to a certain number.
Property | Default | Type | Required | Description |
---|---|---|---|---|
severity | error | string ("error", "warn") | No | The severity level for issues raised by the test. Default is 'error'. Use 'error' for critical issues that should interrupt flow processing. Use 'warn' for warnings/minor issues that should not interrupt flow processing. |
count_greater_than_or_equal | CountGreaterThanOrEqualTestOptions | Yes |
CountGreaterThanOrEqualTestOptions
Configuration options for the count_greater_than_or_equal test.
Property | Default | Type | Required | Description |
---|---|---|---|---|
count | integer | Yes | The value to compare against. | |
group_by_columns | array[string] | No | The columns to group by. |
CountGreaterThanTest
Test to check if the number of rows is greater than a certain number.
Property | Default | Type | Required | Description |
---|---|---|---|---|
severity | error | string ("error", "warn") | No | The severity level for issues raised by the test. Default is 'error'. Use 'error' for critical issues that should interrupt flow processing. Use 'warn' for warnings/minor issues that should not interrupt flow processing. |
count_greater_than | CountGreaterThanTestOptions | Yes |
CountGreaterThanTestOptions
Configuration options for the count_greater_than test.
Property | Default | Type | Required | Description |
---|---|---|---|---|
count | integer | Yes | The value to compare against. | |
group_by_columns | array[string] | No | The columns to group by. |
CountLessThanOrEqualTest
Test to check if the number of rows is greater than or equal to a certain number.
Property | Default | Type | Required | Description |
---|---|---|---|---|
severity | error | string ("error", "warn") | No | The severity level for issues raised by the test. Default is 'error'. Use 'error' for critical issues that should interrupt flow processing. Use 'warn' for warnings/minor issues that should not interrupt flow processing. |
count_less_than_or_equal | CountLessThanOrEqualTestOptions | Yes |
CountLessThanOrEqualTestOptions
Configuration options for the count_less_than_or_equal test.
Property | Default | Type | Required | Description |
---|---|---|---|---|
count | integer | Yes | The value to compare against. | |
group_by_columns | array[string] | No | The columns to group by. |
CountLessThanTest
Test to check if the number of rows is less than a certain number.
Property | Default | Type | Required | Description |
---|---|---|---|---|
severity | error | string ("error", "warn") | No | The severity level for issues raised by the test. Default is 'error'. Use 'error' for critical issues that should interrupt flow processing. Use 'warn' for warnings/minor issues that should not interrupt flow processing. |
count_less_than | CountLessThanTestOptions | Yes |
CountLessThanTestOptions
Configuration options for the count_less_than test.
Property | Default | Type | Required | Description |
---|---|---|---|---|
count | integer | Yes | The value to compare against. | |
group_by_columns | array[string] | No | The columns to group by. |
DataMaintenance
Data maintenance configuration options for the component.
Property | Default | Type | Required | Description |
---|---|---|---|---|
enabled | boolean | No | A boolean flag indicating whether data maintenance is enabled for the component. |
DatabricksDataPlane
Property | Default | Type | Required | Description |
---|---|---|---|---|
databricks | cluster_by: null pyspark_job_cluster_id: null table_properties: null | DatabricksDataPlaneOptions | No | Databricks configuration options. |
DatabricksDataPlaneOptions
Property | Default | Type | Required | Description |
---|---|---|---|---|
table_properties | object with property values of type string | No | Table properties to include when creating the data table. This setting is equivalent to the CREATE TABLE ... TBLPROPERTIES clause. Please refer to the Databricks documentation at https://docs.databricks.com/aws/en/delta/table-properties for available properties depending on your data plane. | |
pyspark_job_cluster_id | string | No | The ID of the compute cluster to use for PySpark jobs. | |
cluster_by | array[string] | No | Clustering keys to be added to the table. |
DateInRangeTest
Test to check if a date is within a certain range.
Property | Default | Type | Required | Description |
---|---|---|---|---|
severity | error | string ("error", "warn") | No | The severity level for issues raised by the test. Default is 'error'. Use 'error' for critical issues that should interrupt flow processing. Use 'warn' for warnings/minor issues that should not interrupt flow processing. |
date_in_range | DateInRangeTestOptions | Yes |
DateInRangeTestOptions
Configuration options for the date_in_range test.
Property | Default | Type | Required | Description |
---|---|---|---|---|
min | string | Yes | The minimum value to expect. | |
max | string | Yes | The maximum value to expect. |
DuckdbDataPlane
Property | Default | Type | Required | Description |
---|---|---|---|---|
duckdb | DuckDbDataPlaneOptions | No | Duckdb configuration options. |
DuckDbDataPlaneOptions
No properties defined.
FabricDataPlane
Property | Default | Type | Required | Description |
---|---|---|---|---|
fabric | spark_session_config: null | FabricDataPlaneOptions | No | Fabric configuration options. |
FabricDataPlaneOptions
Property | Default | Type | Required | Description |
---|---|---|---|---|
spark_session_config | LivySparkSessionConfig | No | Spark session configuration. |
GreaterThanOrEqualTest
Test to check if a value is greater than or equal to a certain number.
Property | Default | Type | Required | Description |
---|---|---|---|---|
severity | error | string ("error", "warn") | No | The severity level for issues raised by the test. Default is 'error'. Use 'error' for critical issues that should interrupt flow processing. Use 'warn' for warnings/minor issues that should not interrupt flow processing. |
greater_than_or_equal | GreaterThanOrEqualTestOptions | Yes |
GreaterThanOrEqualTestOptions
Configuration options for the greater_than_or_equal test.
Property | Default | Type | Required | Description |
---|---|---|---|---|
value | Any of: integer number string | Yes | The value to compare against. |
GreaterThanTest
Test to check if a value is greater than a certain number.
Property | Default | Type | Required | Description |
---|---|---|---|---|
severity | error | string ("error", "warn") | No | The severity level for issues raised by the test. Default is 'error'. Use 'error' for critical issues that should interrupt flow processing. Use 'warn' for warnings/minor issues that should not interrupt flow processing. |
greater_than | GreaterThanTestOptions | Yes |
GreaterThanTestOptions
Configuration options for the greater_than test.
Property | Default | Type | Required | Description |
---|---|---|---|---|
value | Any of: integer number string | Yes | The value to compare against. |
InRangeTest
Test to check if a value is within a certain range.
Property | Default | Type | Required | Description |
---|---|---|---|---|
severity | error | string ("error", "warn") | No | The severity level for issues raised by the test. Default is 'error'. Use 'error' for critical issues that should interrupt flow processing. Use 'warn' for warnings/minor issues that should not interrupt flow processing. |
in_range | InRangeTestOptions | Yes |
InRangeTestOptions
Configuration options for the in_range test.
Property | Default | Type | Required | Description |
---|---|---|---|---|
min | Any of: integer number string | Yes | The minimum value to expect. | |
max | Any of: integer number string | Yes | The maximum value to expect. |
InSetTest
Test to check if a value is in a set of values.
Property | Default | Type | Required | Description |
---|---|---|---|---|
severity | error | string ("error", "warn") | No | The severity level for issues raised by the test. Default is 'error'. Use 'error' for critical issues that should interrupt flow processing. Use 'warn' for warnings/minor issues that should not interrupt flow processing. |
in_set | InSetTestOptions | Yes |
InSetTestOptions
Configuration options for the in_set test.
Property | Default | Type | Required | Description |
---|---|---|---|---|
values | array[Any of: (integer, number, string)] | Yes | The set of values to expect. |
LessThanOrEqualTest
Test to check if a value is less than or equal to a certain number.
Property | Default | Type | Required | Description |
---|---|---|---|---|
severity | error | string ("error", "warn") | No | The severity level for issues raised by the test. Default is 'error'. Use 'error' for critical issues that should interrupt flow processing. Use 'warn' for warnings/minor issues that should not interrupt flow processing. |
less_than_or_equal | LessThanOrEqualTestOptions | Yes |
LessThanOrEqualTestOptions
Configuration options for the less_than_or_equal test.
Property | Default | Type | Required | Description |
---|---|---|---|---|
value | Any of: integer number string | Yes | The value to compare against. |
LessThanTest
Test to check if a value is less than a certain number.
Property | Default | Type | Required | Description |
---|---|---|---|---|
severity | error | string ("error", "warn") | No | The severity level for issues raised by the test. Default is 'error'. Use 'error' for critical issues that should interrupt flow processing. Use 'warn' for warnings/minor issues that should not interrupt flow processing. |
less_than | LessThanTestOptions | Yes |
LessThanTestOptions
Configuration options for the less_than test.
Property | Default | Type | Required | Description |
---|---|---|---|---|
value | Any of: integer number string | Yes | The value to compare against. |
MeanInRangeTest
Test to check if a value is within a certain mean range.
Property | Default | Type | Required | Description |
---|---|---|---|---|
severity | error | string ("error", "warn") | No | The severity level for issues raised by the test. Default is 'error'. Use 'error' for critical issues that should interrupt flow processing. Use 'warn' for warnings/minor issues that should not interrupt flow processing. |
mean_in_range | MeanInRangeTestOptions | Yes |
MeanInRangeTestOptions
Configuration options for the mean_in_range test.
Property | Default | Type | Required | Description |
---|---|---|---|---|
min | Any of: integer number string | Yes | The minimum value to expect. | |
max | Any of: integer number string | Yes | The maximum value to expect. |
NotEmptyTest
Test to check if a value is not empty.
Property | Default | Type | Required | Description |
---|---|---|---|---|
severity | error | string ("error", "warn") | No | The severity level for issues raised by the test. Default is 'error'. Use 'error' for critical issues that should interrupt flow processing. Use 'warn' for warnings/minor issues that should not interrupt flow processing. |
not_empty | NoTestOptions | No | Test to check if a value is not empty. |
NotNullTest
Test to check if a value is not null.
Property | Default | Type | Required | Description |
---|---|---|---|---|
severity | error | string ("error", "warn") | No | The severity level for issues raised by the test. Default is 'error'. Use 'error' for critical issues that should interrupt flow processing. Use 'warn' for warnings/minor issues that should not interrupt flow processing. |
not_null | NoTestOptions | No | Test to check if a value is not null. |
RangeOptions
Property | Default | Type | Required | Description |
---|---|---|---|---|
start | integer | Yes | Start of the range partitioning. | |
end | integer | Yes | End of the range partitioning. | |
interval | integer | Yes | Interval of the range partitioning. |
SnowflakeDataPlane
Property | Default | Type | Required | Description |
---|---|---|---|---|
snowflake | SnowflakeDataPlaneOptions | Yes | Snowflake configuration options. |
SnowflakeDataPlaneOptions
Property | Default | Type | Required | Description |
---|---|---|---|---|
cluster_by | array[string] | No | Clustering keys to be added to the table. |
IncrementalStrategy
Incremental Processing Strategy.
Property | Default | Type | Required | Description |
---|---|---|---|---|
incremental | Any of: string MergeStrategy SCDType2Strategy | Yes | Incremental processing strategy. | |
on_schema_change | string ("ignore", "fail", "append_new_columns", "sync_all_columns") | No | Policy to apply when schema changes are detected. Defaults to 'fail' if not provided. |
PartitionedStrategy
Partitioned Ingest Strategy. The user is expected to provide 2 functions, a list function that lists partitions in the source, and a read function that reads a partition from the source.
Property | Default | Type | Required | Description |
---|---|---|---|---|
partitioned | PartitionedOptions | No | Options for partitioning data. | |
on_schema_change | string ("ignore", "fail", "append_new_columns", "sync_all_columns") | No | Policy to apply when schema changes are detected. Defaults to 'fail' if not provided. |
PartitionedOptions
Options related to partition optimization - in particular, the policy that determines which partitions to ingest.
Property | Default | Type | Required | Description |
---|---|---|---|---|
enable_substitution_by_partition_name | boolean | Yes | Enable substitution by partition name. | |
output_type | table | string ("table", "view") | No | Output type for partitioned data. Must be either 'table' or 'view'. This strategy applies only to Transforms. |
SCDType2Strategy
The SCD Type 2 strategy allows users to track changes to records over time, by tracking the start and end times for each version of a record. A brief overview of the strategy can be found at https://en.wikipedia.org/wiki/Slowly_changing_dimension#Type_2:_add_new_row.
Property | Default | Type | Required | Description |
---|---|---|---|---|
scd_type_2 | KeyOptions | No | Options for SCD Type 2 strategy. |
StddevInRangeTest
Test to check if a value is within a certain standard deviation range.
Property | Default | Type | Required | Description |
---|---|---|---|---|
severity | error | string ("error", "warn") | No | The severity level for issues raised by the test. Default is 'error'. Use 'error' for critical issues that should interrupt flow processing. Use 'warn' for warnings/minor issues that should not interrupt flow processing. |
stddev_in_range | StddevInRangeTestOptions | Yes |
StddevInRangeTestOptions
Configuration options for the stddev_in_range test.
Property | Default | Type | Required | Description |
---|---|---|---|---|
min | Any of: integer number string | Yes | The minimum value to expect. | |
max | Any of: integer number string | Yes | The maximum value to expect. |
SubstringMatchTest
Test to check if a value contains a substring.
Property | Default | Type | Required | Description |
---|---|---|---|---|
severity | error | string ("error", "warn") | No | The severity level for issues raised by the test. Default is 'error'. Use 'error' for critical issues that should interrupt flow processing. Use 'warn' for warnings/minor issues that should not interrupt flow processing. |
substring_match | SubstringMatchTestOptions | Yes |
SubstringMatchTestOptions
Configuration options for the substring_match test.
Property | Default | Type | Required | Description |
---|---|---|---|---|
substring | string | Yes | The substring to search for. |
SynapseDataPlane
Property | Default | Type | Required | Description |
---|---|---|---|---|
synapse | spark_session_config: null | SynapseDataPlaneOptions | No | Synapse configuration options. |
SynapseDataPlaneOptions
Property | Default | Type | Required | Description |
---|---|---|---|---|
spark_session_config | LivySparkSessionConfig | No | Spark session configuration. |
LivySparkSessionConfig
Property | Default | Type | Required | Description |
---|---|---|---|---|
pool | string | No | The pool to use for the Spark session. | |
driver_memory | string | No | The memory to use for the Spark driver. | |
driver_cores | integer | No | The number of cores to use for the Spark driver. | |
executor_memory | string | No | The memory to use for the Spark executor. | |
executor_cores | integer | No | The number of cores to use for each executor. | |
num_executors | integer | No | The number of executors to use for the Spark session. | |
session_key_override | string | No | The key to use for the Spark session. | |
max_concurrent_sessions | integer | No | The maximum number of concurrent sessions of this spec to create. |
UniqueTest
Test to check if a value is unique.
Property | Default | Type | Required | Description |
---|---|---|---|---|
severity | error | string ("error", "warn") | No | The severity level for issues raised by the test. Default is 'error'. Use 'error' for critical issues that should interrupt flow processing. Use 'warn' for warnings/minor issues that should not interrupt flow processing. |
unique | NoTestOptions | No | Test to check if a value is unique. |
NoTestOptions
Configuration options for tests that have no test body definition (not_null, unique, etc.).
No properties defined.
ResourceMetadata
Meta information of a resource. In most cases it doesn't affect the system behavior but may be helpful to analyze project resources.
Property | Default | Type | Required | Description |
---|---|---|---|---|
source | ResourceLocation | No | The origin or source information for the resource. | |
source_event_uuid | string | No | UUID of the event that is associated with creation of this resource. |
ResourceLocation
The origin or source information for the resource.
Property | Default | Type | Required | Description |
---|---|---|---|---|
path | string | Yes | Path within repository files where the resource is defined. | |
first_line_number | integer | No | First line number within path file where the resource is defined. |
MergeStrategy
A strategy that involves merging new data with existing data by updating existing records that match the unique key.
Property | Default | Type | Required | Description |
---|---|---|---|---|
merge | KeyOptions | No | Options for merge strategy. |
KeyOptions
Column options needed for merge and SCD Type 2 strategies, such as unique key and deletion column name.
Property | Default | Type | Required | Description |
---|---|---|---|---|
unique_key | string | Yes | Column or comma-separated set of columns used as a unique identifier for records, aiding in the merge process. | |
deletion_column | string | No | Column name used in the upstream source for soft-deleting records. Used when replicating data from a source that supports soft-deletion. If provided, the merge strategy will be able to detect deletions and mark them as deleted in the destination. If not provided, the merge strategy will not be able to detect deletions. | |
merge_update_columns | Any of: string array[string] | No | List of columns to include when updating values in merge. These columns are mutually exclusive with respect to the columns in merge_exclude_columns . | |
merge_exclude_columns | Any of: string array[string] | No | List of columns to exclude when updating values in merge. These columns are mutually exclusive with respect to the columns in merge_update_columns . | |
incremental_predicates | Any of: string array[string] | No | List of conditions to filter incremental data. |