Generic File Read Component

Component for reading files from a filesystem.

GenericFileReadComponent

info

GenericFileReadComponent is defined beneath the following ancestor nodes in the YAML structure:

Component
ReadComponent

Below are the properties for the GenericFileReadComponent. Each property links to the specific details section further down in this page.

Property	Type	Required	Description
dependencies	array[None]	No	List of dependencies that must complete before this Component runs.
event_time	string	No	Timestamp column in the Component output used to represent Event time.
connection	string	No	Name of the Connection to use for reading data.
columns	array[None]	No	List specifying the columns to read from the source and transformations to make during read.
normalize	boolean	No	Boolean flag indicating whether the output column names should be normalized to a standard naming convention after reading.
preserve_case	boolean	No	Boolean flag indicating whether the case of the column names should be preserved after reading.
uppercase	boolean	No	Boolean flag indicating whether the column names should be transformed to uppercase after reading.
strategy	PartitionedStrategy	No	Ingest strategy when reading files.
generic_file		Yes	Options for reading files from a filesystem.

Property Details

Component

A Component is a fundamental building block of a data Flow. Supported Component types include: Read, Transform, Task, Test, and more.

Property	Default	Type	Required	Description
component		One of: CustomPythonReadComponent ApplicationComponent AliasedTableComponent ExternalTableComponent DbtNodeComponent	Yes	Component configuration options.

ReadComponent

Component that reads data from a system.

Property	Type	Required	Description
data_plane	One of: SnowflakeDataPlane BigQueryDataPlane DuckdbDataPlane DatabricksDataPlane	No	Data Plane-specific configuration options for Components.
skip	boolean	No	Boolean flag indicating whether to skip processing for the Component or not.
retry_strategy		No	Retry strategy configuration options for the Component if any exceptions are encountered.
data_maintenance		No	The data maintenance configuration options for the Component.
description	string	No	Brief description of what the model does.
metadata		No	Meta information of a resource. In most cases it doesn't affect the system behavior but may be helpful to analyze project resources.
name	string	Yes	The name of the model
flow_name	string	No	Name of the Flow that the Component belongs to.
tests		No	Defines tests to run on this Component's data.
read	One of: GenericFileReadComponent LocalFileReadComponent SFTPReadComponent S3ReadComponent GcsReadComponent AbfsReadComponent HttpReadComponent MSSQLReadComponent MySQLReadComponent OracleReadComponent PostgresReadComponent SnowflakeReadComponent BigQueryReadComponent DatabricksReadComponent	Yes	Read component that reads data from a system.

FileReadOptionsBase

Options for locating and parsing files from a specified directory or file path, including file selection criteria and parser to use.

Property	Default	Type	Required	Description
path		string	Yes	Path to the directory or file to read. Path is relative to the connection's root directory, and cannot be an absolute path or traverse outside the root directory.
exclude		array[None]	No	List of conditions to exclude specific files from being processed.
include		array[None]	No	List of conditions to include specific files for processing.
parser	auto	One of: Any of: auto Any of: avro Any of: feather Any of: orc Any of: parquet Any of: pickle Any of: text Any of: xml Any of: csv Any of: excel Any of: json	No	Parser Resource for reading the files. Defaults to 'auto'. To set specific parser options, use the parser name as a child object.
archive		Any of:	No	Configuration for archive files for processing.
load_strategy			No	Strategy for loading files, including limits on number and size of files.
time_based_file_selection		Any of: last_modified	No	Method to use for file selection based on a time window.

AutoParser

Parser that automatically detects the file format and settings for parsing.

Property	Default	Type	Required	Description
hive_partitioning	False	boolean	No	Whether to extract hive partitioning from the full path name of the file and use it as partition columns. If true, the partition columns will be added to the table as additional columns. The dataset should not contain any columns with the same name as the partition columns. If false, the partition columns will not be added to the table. If not set, the default value is false. e.g.: directory names with key=value pairs like “/year=2009/month=11”, when set to true, will be parsed into partition columns with the names "year" and "month" and the values "2009" and "11" respectively.
auto			Yes	Options for automatically detecting the file format. None need to be specified.

AvroParser

Parser for the Avro file format.

Property	Default	Type	Required	Description
hive_partitioning	False	boolean	No	Whether to extract hive partitioning from the full path name of the file and use it as partition columns. If true, the partition columns will be added to the table as additional columns. The dataset should not contain any columns with the same name as the partition columns. If false, the partition columns will not be added to the table. If not set, the default value is false. e.g.: directory names with key=value pairs like “/year=2009/month=11”, when set to true, will be parsed into partition columns with the names "year" and "month" and the values "2009" and "11" respectively.
avro			Yes	Parsing options for Avro files.

CsvParser

Parser for CSV files.

Property	Default	Type	Required	Description
hive_partitioning	False	boolean	No	Whether to extract hive partitioning from the full path name of the file and use it as partition columns. If true, the partition columns will be added to the table as additional columns. The dataset should not contain any columns with the same name as the partition columns. If false, the partition columns will not be added to the table. If not set, the default value is false. e.g.: directory names with key=value pairs like “/year=2009/month=11”, when set to true, will be parsed into partition columns with the names "year" and "month" and the values "2009" and "11" respectively.
csv			Yes	Parsing options for CSV files.

CsvParserOptions

Parsing options for CSV files, including separators, header presence, and multi-line values.

Property	Default	Type	Required	Description
all_varchar		boolean	No	Option to skip type detection for CSV parsing and assume all columns to be of type VARCHAR.
allow_quoted_nulls		boolean	No	Option to allow the conversion of quoted values to NULL values.
auto_type_candidates		array[string]	No	Types that the sniffer will use when detecting CSV column types. VARCHAR is always included as a fallback.
buffer_size		integer	No	The buffer size in bytes for the CSV reader.
columns		object with property values of type string	No	A struct specifying the column names and types. Using this option implies that auto detection is not used.
compression		string	No	File compression type, detected automatically from the file extension.
dateformat		string	No	Date format to use when parsing dates.
decimal_separator		string	No	Decimal separator of numbers.
delim		string	No	Delimiter of the CSV file.
types		Any of: array[string] object with property values of type string	No	Column types as a struct (by name).
encoding		string	No	Encoding of the CSV file.
escape		string	No	String that should appear before a data character sequence that matches the quote value.
force_not_null		array[string]	No	Do not match the specified columns' values against the NULL string.
header		Any of: boolean integer	No	Indicates whether the file contains a header row. If True, the first row is treated as column names. If False, no header is expected. If an integer, specifies the row number (0-based) to use as the header.
hive_partitioning		boolean	No	Whether to interpret the path as a Hive partitioned path.
ignore_errors		boolean	No	Option to ignore any parsing errors encountered.
new_line		string	No	Line terminator of the CSV file.
max_line_size		integer	No	Maximum line size in bytes.
nullstr		Any of: string array[string]	No	String or strings that represent a NULL value.
names		array[string]	No	List of column names.
normalize_names		boolean	No	Whether column names should be normalized, removing non-alphanumeric characters.
null_padding		boolean	No	Pad remaining columns on the right with null values if a row lacks columns.
parallel		boolean	No	Whether the parallel CSV reader is used.
quote		string	No	String to use as the quoting character.
sample_size		integer	No	Number of sample rows for auto detection of parameters.
sep		string	No	Single-character string to use as the column separator. Default is ','.
skip		integer	No	Number of lines at the top of the file to skip.
timestampformat		string	No	Date format to use when parsing timestamps.
union_by_name		boolean	No	Whether the columns of multiple schemas should be unified by name, rather than by position. Increases memory consumption.
has_header	True	boolean	No	Indicates if the first row of the CSV is treated as the header row. Defaults to True.
has_multi_line_value	False	boolean	No	Indicates if the CSV contains values spanning multiple lines. Defaults to False.
auto_detect	True	boolean	No	Enables auto detection of CSV parameters.

ExcelParser

Parser for Excel files.

Property	Default	Type	Required	Description
hive_partitioning	False	boolean	No	Whether to extract hive partitioning from the full path name of the file and use it as partition columns. If true, the partition columns will be added to the table as additional columns. The dataset should not contain any columns with the same name as the partition columns. If false, the partition columns will not be added to the table. If not set, the default value is false. e.g.: directory names with key=value pairs like “/year=2009/month=11”, when set to true, will be parsed into partition columns with the names "year" and "month" and the values "2009" and "11" respectively.
excel		Any of:	Yes	Parsing options for Excel files.

DuckDBExcelParserOptions

Parsing options for Excel files using DuckDB. Refer to the DuckDB documentation for more information here: https://duckdb.org/docs/guides/file_formats/excel_import.html

Property	Default	Type	Required	Description
duckdb_options			Yes	Parsing options for DuckDB files.

DuckDBExcelParserOptionsWrapper

Wrapper for the DuckDB Excel Parser Options.

Property	Default	Type	Required	Description
layer		string	No	Layer to use for parsing the Excel file. If not specified, the default layer is used.
open_options		array[string]	No	Additional open_options to pass to the st_read function.

FeatherParser

Parser for the Feather file format.

Property	Default	Type	Required	Description
hive_partitioning	False	boolean	No	Whether to extract hive partitioning from the full path name of the file and use it as partition columns. If true, the partition columns will be added to the table as additional columns. The dataset should not contain any columns with the same name as the partition columns. If false, the partition columns will not be added to the table. If not set, the default value is false. e.g.: directory names with key=value pairs like “/year=2009/month=11”, when set to true, will be parsed into partition columns with the names "year" and "month" and the values "2009" and "11" respectively.
feather			Yes	Parsing options for Feather files.

FilenameTimestampPattern

Pattern to extract timestamps from file names for the purpose of time-based file selection.

Property	Default	Type	Required	Description
pattern		string	Yes	Pattern to extract timestamps from filenames. eg: r'year=(?P<year>\d4)/month=(?P<month>\d2)/day=(?P<day>\d2)'

JsonParser

Parser for JSON files.

Property	Default	Type	Required	Description
hive_partitioning	False	boolean	No	Whether to extract hive partitioning from the full path name of the file and use it as partition columns. If true, the partition columns will be added to the table as additional columns. The dataset should not contain any columns with the same name as the partition columns. If false, the partition columns will not be added to the table. If not set, the default value is false. e.g.: directory names with key=value pairs like “/year=2009/month=11”, when set to true, will be parsed into partition columns with the names "year" and "month" and the values "2009" and "11" respectively.
json			No	Parsing options for JSON files.

JsonParserOptions

Parsing options for JSON files, particularly focusing on date and timestamp formats.

Property	Type	Required	Description
columns	object with property values of type string	No	Struct specifying key names and value types contained within the JSON file (e.g. `{key1: 'INTEGER', key2: 'VARCHAR'}`).
compression	string	No	File compression type, detected automatically from the file extension by default. Options include 'uncompressed', 'gzip', 'zstd', and 'auto_detect'.
convert_strings_to_integers	boolean	No	Whether strings representing integer values should be converted to a numerical type. Defaults to False.
dateformat	string	No	Specifies the date format to use when parsing dates. DuckDB Default format is 'iso'.
format	string	No	Specifies the format of the JSON file. Can be one of ['auto', 'unstructured', 'newline_delimited', 'array'].
hive_partitioning	boolean	No	Whether or not to interpret the path as a Hive partitioned path. Defaults to DuckDB.
ignore_errors	boolean	No	Whether to ignore any parsing errors encountered. Defaults to False.
map_inference_threshold	integer	No	Maximum number of elements in a map to use map inference. Set to -1 to use map inference for all maps.
maximum_depth	integer	No	Maximum nesting depth to which the automatic schema detection detects types.
maximum_object_size	integer	No	Maximum size of a JSON object in bytes.
records	string	No	Determines how JSON objects are treated in parsing. Can be one of ['auto', 'true', 'false'].
sample_size	integer	No	Number of sample objects for automatic JSON type detection. Set to -1 to scan the entire input file.
timestampformat	string	No	Specifies the date format to use when parsing timestamps.
union_by_name	boolean	No	Whether the schemas of multiple JSON files should be unified.

LoadStrategy

Specifications for loading files, including limits on number and size of files.

Property	Type	Required	Description
max_files	integer	No	Maximum number of files to read.
download_threads	integer	No	Number of threads to use for downloading files.
direct_load		No	Override default direct load behavior.

DirectLoadOptions

Property	Default	Type	Required	Description
enable		boolean	No	Boolean flag indicating if direct load should be used if supported by the backend engine. Default is not to use direct load.
ignore_unknown_values	True	boolean	No	Optional. Indicates if BigQuery should allow extra values that are not represented in the table schema. If true, the extra values are ignored. If false, records with extra columns are treated as bad records, and if there are too many bad records, an invalid error is returned in the job result. The default value is false. The sourceFormat property determines what BigQuery treats as an extra value:
max_staging_table_age_in_hours		integer	No	For BigQuery backend, the maximum age of the staging table in hours. If the staging table is older than this age, it will be dropped. If not set, default to 6 hours
nested_fields_as_string		boolean	No	For BigQuery backend, if true, nested fields will be converted to strings. If false, nested fields will be converted to JSON objects.

OrcParser

Parser for the ORC file format.

Property	Default	Type	Required	Description
hive_partitioning	False	boolean	No	Whether to extract hive partitioning from the full path name of the file and use it as partition columns. If true, the partition columns will be added to the table as additional columns. The dataset should not contain any columns with the same name as the partition columns. If false, the partition columns will not be added to the table. If not set, the default value is false. e.g.: directory names with key=value pairs like “/year=2009/month=11”, when set to true, will be parsed into partition columns with the names "year" and "month" and the values "2009" and "11" respectively.
orc			Yes	Parsing options for ORC files.

PandasExcelParserOptions

Parsing options for Excel files using pandas. Refer to the pandas documentation for more information here: https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html

Property	Default	Type	Required	Description
pandas_options			Yes	Parsing options for Pandas files.

PandasExcelParserOptionsWrapper

Wrapper for the Pandas Excel Parser Options.

Property	Type	Required	Description
date_format	string	No	Format string for parsing dates, following Python's strftime directives, e.g., '%Y-%m-%d'.
date_columns	array[string]	No	List of column names that should be parsed as dates.
timestamp_columns	array[string]	No	List of column names that should be parsed as timestamps.
sheet_name	string	No	Name of the Excel sheet to parse. If not specified, the first sheet is used by default.
header	Any of: integer array[integer]	No	Row number(s) to use as the column names, and the start of the data. If a list of integers is passed, the values in the list indicate the row number(s) to use as the column names, and the start of the data.
names	array[string]	No	List of column names to use. If not specified, and header is True, then the first row of the data is used. If a list is passed, it must match the length of the data.
index_col	Any of: integer array[integer]	No	Column number(s) to set as the index (0-based). If a list is passed, the values in the list indicate the column number(s) to set as the index.
usecols	Any of: string array[string]	No	str, list-like, or callable, default None If None, then parse all columns. If str, then indicates comma separated list of Excel column letters and column ranges (e.g. “A:E” or “A,C,E:F”). Ranges are inclusive of both sides. If list of int, then indicates list of column numbers to be parsed (0-indexed). If list of string, then indicates list of column names to be parsed. If callable, then evaluate each column name against it and parse the column if the callable returns True.
dtype	object with property values of type string	No	A struct that specifies the key names and value types contained within the JSON file (e.g. `{key1: 'INTEGER', key2: 'VARCHAR'}`).
skiprows	Any of: integer array[integer]	No	Row number(s) to skip (0-based) before reading the data. If a list is passed, the values in the list indicate the row number(s) to skip.
nrows	integer	No	Number of rows to read. If not specified, all rows are read.
na_values	Any of: string array[string]	No	String or list of strings to recognize as NA/NaN. If a list is passed, the values in the list indicate the strings to recognize as NA/NaN.
keep_default_na	boolean	No	Whether to include the default NaN values when parsing the data. If True, the default NaN values are included. If False, the default NaN values are not included.
na_filter	boolean	No	Whether to filter out rows with NA values. If True, rows with NA values are filtered out. If False, rows with NA values are included.
parse_dates	Any of: boolean array[boolean]	No	Whether to parse dates. If True, dates are parsed. If False, dates are not parsed. If a list is passed, the values in the list indicate whether the corresponding columns should be parsed as dates.
skipfooter	integer	No	Number of lines at the end of the file to skip (0-based). If a list is passed, the values in the list indicate the number of lines to skip.

ParquetParser

Parser for the Parquet file format.

Property	Default	Type	Required	Description
hive_partitioning	False	boolean	No	Whether to extract hive partitioning from the full path name of the file and use it as partition columns. If true, the partition columns will be added to the table as additional columns. The dataset should not contain any columns with the same name as the partition columns. If false, the partition columns will not be added to the table. If not set, the default value is false. e.g.: directory names with key=value pairs like “/year=2009/month=11”, when set to true, will be parsed into partition columns with the names "year" and "month" and the values "2009" and "11" respectively.
parquet			Yes	Parsing options for Parquet files.

PickleParser

Parser for the Pickle file format. Pickle files are generated by Python's pickle module.

Property	Default	Type	Required	Description
hive_partitioning	False	boolean	No	Whether to extract hive partitioning from the full path name of the file and use it as partition columns. If true, the partition columns will be added to the table as additional columns. The dataset should not contain any columns with the same name as the partition columns. If false, the partition columns will not be added to the table. If not set, the default value is false. e.g.: directory names with key=value pairs like “/year=2009/month=11”, when set to true, will be parsed into partition columns with the names "year" and "month" and the values "2009" and "11" respectively.
pickle			Yes	Parsing options for Pickle files.

ComponentColumn

Component column expression definition.

No properties defined.

TarArchive

Configuration options for tar archive files.

Property	Default	Type	Required	Description
tar			No	Configuration for tar archive files.

TextParser

Parser for text files.

Property	Default	Type	Required	Description
hive_partitioning	False	boolean	No	Whether to extract hive partitioning from the full path name of the file and use it as partition columns. If true, the partition columns will be added to the table as additional columns. The dataset should not contain any columns with the same name as the partition columns. If false, the partition columns will not be added to the table. If not set, the default value is false. e.g.: directory names with key=value pairs like “/year=2009/month=11”, when set to true, will be parsed into partition columns with the names "year" and "month" and the values "2009" and "11" respectively.
text			Yes	Parsing options for text files.

PartitionedStrategy

Partitioned Ingest Strategy. The user is expected to provide 2 functions, a list function that lists partitions in the source, and a read function that reads a partition from the source.

Property	Default	Type	Required	Description
partitioned			No	Options for partitioning data.
on_schema_change		string ("ignore", "fail", "append_new_columns", "sync_all_columns", "smart")	No	Policy to apply when schema changes are detected. Defaults to 'fail' if not provided.

PartitionedOptions

Options related to partition optimization - in particular, the policy that determines which partitions to ingest.

Property	Default	Type	Required	Description
enable_substitution_by_partition_name		boolean	Yes	Enable substitution by partition name.
output_type	table	string ("table", "view")	No	Output type for partitioned data. Must be either 'table' or 'view'. This strategy only applies to Transforms.

InputComponent

Specification for input Components defining how partitioning behaviors should be handled. This metadata is required when a Component serves as an input to other Components within a Flow. The reshape parameter controls how input data is partitioned and processed. It accepts either full for full reduction operations or map for partition-wise operations.

Property	Type	Required	Description
flow	string	Yes	Name of the parent Flow that the input Component belongs to.
name	string	Yes	Name of the input Component.
alias	string	No	Alias to use for the input Component.
partition_spec	Any of: string ("full_reduction", "map")	No	Internal specification for how Component input data should be partitioned before processing. This field is populated based on the user-facing `reshape` parameter in ref() calls, which accepts `full` (for full reduction operations) or `map` (for partition-wise operations). Input partitioning is applied before the Component's logic is executed.
where	string	No	Optional filter condition to apply to the input Component's data.
partition_binding	Any of: string	No	Optional partition binding specification to apply to the Component on a per-output-partition basis against other inputs' partitions.

PartitionBinding

Property	Default	Type	Required	Description
logical_operator	logical_operator	string ("AND", "OR")	No	TLogical operator to use to combine the partition binding predicates provided
predicates	predicates	array[string]	No	List of partition binding predicates to apply to the input Component's data

RepartitionSpec

Specification for repartitioning operations on input Component's data

Property	Default	Type	Required	Description
repartition			No	Options for repartitioning the input Component's data.

RepartitionOptions

Options for repartitioning the input Component's data.

Property	Default	Type	Required	Description
partition_by		string	Yes	Column to partition by.
granularity		string	Yes	Granularity to use for the partitioning.

XmlParser

Parser for XML files.

Property	Default	Type	Required	Description
hive_partitioning	False	boolean	No	Whether to extract hive partitioning from the full path name of the file and use it as partition columns. If true, the partition columns will be added to the table as additional columns. The dataset should not contain any columns with the same name as the partition columns. If false, the partition columns will not be added to the table. If not set, the default value is false. e.g.: directory names with key=value pairs like “/year=2009/month=11”, when set to true, will be parsed into partition columns with the names "year" and "month" and the values "2009" and "11" respectively.
xml			Yes	Parsing options for XML files.

NoParserOptions

No custom parsing options exist for this parser.

No properties defined.

ZipArchive

Configuration options for ZIP archive files.

Property	Default	Type	Required	Description
zip			No	Configuration for ZIP archive files.

FileArchiveOptions

Options for working with files inside of the file structure of an archive file format.

Property	Type	Required	Description
path	string	No	Path to the directory in the archive containing files that should be processed.
include	array[None]	No	List of conditions to include specific files in the archive.
exclude	array[None]	No	List of conditions to exclude specific files in the archive.

FileSelection

Options for selecting files based on various criteria. All criteria specified must be met for a file to be included.

Property	Type	Required	Description
prefix	string	No	Select only files with this prefix. Strongly recommended when parser is not set to 'auto'.
suffix	string	No	Select only files with this suffix. Strongly recommended when parser is not set to 'auto'.
created_at		No	Include files created within the specified date range.
modified_at		No	Include files modified within the specified date range.
timestamp_from_filename		No	Extract and include files based on timestamps parsed from filenames.
glob	string	No	Glob pattern for including files.
regex	string	No	Regular expression pattern for including files.

FileTimeFromFilenameSelection

Option to extract timestamps from filenames and include files based on those timestamps.

Property	Type	Required	Description
after	string	No	Include files created or modified after this date.
before	string	No	Include files created or modified before this date.
on_or_after	string	No	Include files created or modified on or after this date.
on_or_before	string	No	Include files created or modified on or before this date.
since		No	Include files created or modified within the specified time delta from now.
pattern	string	Yes	Pattern to extract timestamps from filenames. eg: r'year=(?P<year>\d4)/month=(?P<month>\d2)/day=(?P<day>\d2)'

FileTimeSelection

Defines time-based selection criteria for including files, based on creation or modification dates.

Property	Type	Required	Description
after	string	No	Include files created or modified after this date.
before	string	No	Include files created or modified before this date.
on_or_after	string	No	Include files created or modified on or after this date.
on_or_before	string	No	Include files created or modified on or before this date.
since		No	Include files created or modified within the specified time delta from now.

TimeDelta

Property	Type	Required	Description
seconds	integer	No	Number of seconds.
minutes	integer	No	Number of minutes.
hours	integer	No	Number of hours.
days	integer	No	Number of days.
weeks	integer	No	Number of weeks.
months	integer	No	Number of months.
years	integer	No	Number of years.