Skip to main content
Version: 3.0.0

Generic File Read Component

Component for reading files from a filesystem.

GenericFileReadComponent

info

GenericFileReadComponent is defined beneath the following ancestor nodes in the YAML structure:

Below are the properties for the GenericFileReadComponent. Each property links to the specific details section further down in this page.

PropertyDefaultTypeRequiredDescription
connectionstring
NoThe name of the connection to use for reading data.
columnsarray[ComponentColumn]
NoA list specifying the columns to read from the source and transformations to make during read.
normalizeboolean
NoA boolean flag indicating if the output column names should be normalized to a standard naming convention after reading.
preserve_caseboolean
NoA boolean flag indicating if the case of the column names should be preserved after reading.
uppercaseboolean
NoA boolean flag indicating if the column names should be transformed to uppercase after reading.
materializationPartitionMaterialization
NoResource for data materialization during the read process.
generic_fileFileReadOptionsBaseYesOptions for reading files from a filesystem.

Property Details

Component

A component is a fundamental building block of a data flow. Types of components that are supported include: read, transform, task, test, and more.

PropertyDefaultTypeRequiredDescription
componentOne of:
  ReadComponent
  TransformComponent
  TaskComponent
  SingularTestComponent
  CustomPythonReadComponent
  WriteComponent
  CompoundComponent
  AliasedTableComponent
  ExternalTableComponent
YesConfiguration options for the component.

ReadComponent

A component that reads data from a data system.

PropertyDefaultTypeRequiredDescription
data_plane  One of:
    SnowflakeDataPlane
    BigQueryDataPlane
    DuckdbDataPlane
    SynapseDataPlane
NoData Plane-specific configuration options for a component.
namestring
NoThe name of the model
descriptionstring
NoA brief description of what the model does.
metadataResourceMetadata
NoMeta information of a resource. In most cases it doesn't affect the system behavior but may be helpful to analyze project resources.
flow_namestring
NoThe name of the flow that the component belongs to.
skipboolean
NoA boolean flag indicating whether to skip processing for the component or not.
data_maintenanceDataMaintenance
NoThe data maintenance configuration options for the component.
testsComponentTestColumn
NoDefines tests to run on the data of this component.
readOne of:
  GenericFileReadComponent
  LocalFileReadComponent
  S3ReadComponent
  GcsReadComponent
  AbfsReadComponent
  HttpReadComponent
  MSSQLReadComponent
  MySQLReadComponent
  OracleReadComponent
  PostgresReadComponent
  SnowflakeReadComponent
  BigQueryReadComponent
YesThe read component that reads data from a data system.

FileReadOptionsBase

Options for locating and parsing files from a specified directory or file path, including file selection criteria and parser to use.

PropertyDefaultTypeRequiredDescription
pathstringYesPath to the directory or file to read. Path is relative to the connection's root directory, and cannot be an absolute path or traverse outside the root directory.
excludearray[FileSelection]
NoList of conditions to exclude specific files from being processed.
includearray[FileSelection]
NoList of conditions to include specific files for processing.
parserautoOne of:
  Any of:
    string ("auto")
    AutoParser
  Any of:
    string ("avro")
    AvroParser
  Any of:
    string ("feather")
    FeatherParser
  Any of:
    string ("orc")
    OrcParser
  Any of:
    string ("parquet")
    ParquetParser
  Any of:
    string ("pickle")
    PickleParser
  Any of:
    string ("text")
    TextParser
  Any of:
    string ("xml")
    XmlParser
  Any of:
    string ("csv")
    CsvParser
  Any of:
    string ("excel")
    ExcelParser
  Any of:
    string ("json")
    JsonParser
NoParser Resource for reading the files. Defaults to 'auto'. To set specific parser options, use the parser name as a child object.
archiveAny of:
  TarArchive
  ZipArchive
NoConfiguration for archive files for processing.
load_strategyLoadStrategy
NoStrategy for loading files, including limits on number and size of files.

AutoParser

Parser that automatically detects the file format and settings for parsing.

PropertyDefaultTypeRequiredDescription
autoNoParserOptionsYesOptions for automatically detecting the file format. None need to be specified.

AvroParser

Parser for the Avro file format.

PropertyDefaultTypeRequiredDescription
avroNoParserOptionsYesOptions for parsing Avro files.

CsvParser

Parser for CSV files.

PropertyDefaultTypeRequiredDescription
csvCsvParserOptionsYesOptions for parsing CSV files.

CsvParserOptions

Options for parsing CSV files, including separators, header presence, and multi-line values.

PropertyDefaultTypeRequiredDescription
all_varcharFalseboolean
NoOption to skip type detection for CSV parsing and assume all columns to be of type VARCHAR.
allow_quoted_nullsTrueboolean
NoOption to allow the conversion of quoted values to NULL values.
auto_detectTrueboolean
NoEnables auto detection of CSV parameters.
auto_type_candidatesarray[string]
NoTypes that the sniffer will use when detecting CSV column types. VARCHAR is always included as a fallback.
columnsobject
NoA struct specifying the column names and types. Using this option implies that auto detection is not used.
compressionautostring
NoThe compression type for the file, detected automatically from the file extension.
dateformatstring
NoThe date format to use when parsing dates.
decimal_separator.string
NoThe decimal separator of numbers.
escape"string
NoThe string that should appear before a data character sequence that matches the quote value.
force_not_nullarray[string]
NoDo not match the specified columns' values against the NULL string.
headerFalseboolean
NoSpecifies that the file contains a header line with the names of each column.
hive_partitioningFalseboolean
NoWhether to interpret the path as a Hive partitioned path.
ignore_errorsFalseboolean
NoOption to ignore any parsing errors encountered.
max_line_size2097152integer
NoThe maximum line size in bytes.
namesarray[string]
NoThe column names as a list.
new_linestring
NoSet the new line character(s) in the file.
normalize_namesFalseboolean
NoWhether column names should be normalized, removing non-alphanumeric characters.
null_paddingFalseboolean
NoPad remaining columns on the right with null values if a row lacks columns.
nullstrAny of:
  string
  array[string]
NoThe string or strings that represent a NULL value.
parallelTrueboolean
NoWhether the parallel CSV reader is used.
quote"string
NoThe quoting string to be used when a data value is quoted.
sample_size20480integer
NoThe number of sample rows for auto detection of parameters.
sep,string
NoSingle-character string to use as the column separator. Default is ','.
delim,string
NoSingle-character string to use as the column separator. Default is ','.
skip0integer
NoThe number of lines at the top of the file to skip.
timestampformatstring
NoThe date format to use when parsing timestamps.
typesAny of:
  array[string]
  object
NoThe column types as either a list (by position) or a struct (by name).
dtypesAny of:
  array[string]
  object
NoThe column types as either a list (by position) or a struct (by name).
union_by_nameFalseboolean
NoWhether the columns of multiple schemas should be unified by name, rather than by position. Increases memory consumption.
date_formatstring
NoFormat string for parsing dates, following Python's strftime directives, e.g., '%Y-%m-%d'.
timestamp_formatstring
NoFormat string for parsing timestamps, following Python's strftime directives, e.g., '%Y-%m-%d %H:%M:%S'.
has_headerTrueboolean
NoIndicates if the first row of the CSV is treated as the header row. Defaults to True.
has_multi_line_valueFalseboolean
NoIndicates if the CSV contains values spanning multiple lines. Defaults to False.

ExcelParser

Parser for Excel files.

PropertyDefaultTypeRequiredDescription
excelExcelParserOptionsYesOptions for parsing Excel files.

ExcelParserOptions

Parsing options for Excel files, including which columns should be parsed as dates and the sheet to be used.

PropertyDefaultTypeRequiredDescription
date_formatstring
NoFormat string for parsing dates, following Python's strftime directives, e.g., '%Y-%m-%d'.
timestamp_formatstring
NoFormat string for parsing timestamps, following Python's strftime directives, e.g., '%Y-%m-%d %H:%M:%S'.
date_columnsarray[string]
NoList of column names that should be parsed as dates.
timestamp_columnsarray[string]
NoList of column names that should be parsed as timestamps.
sheet_namestring
NoName of the Excel sheet to parse. If not specified, the first sheet is used by default.

FeatherParser

Parser for the Feather file format.

PropertyDefaultTypeRequiredDescription
featherNoParserOptionsYesOptions for parsing Feather files.

JsonParser

Parser for JSON files.

PropertyDefaultTypeRequiredDescription
jsonJsonParserOptionsNoOptions for parsing JSON files.

JsonParserOptions

Parsing options for JSON files, particularly focusing on date and timestamp formats.

PropertyDefaultTypeRequiredDescription
columnsobject
NoA struct that specifies the key names and value types contained within the JSON file (e.g. {key1: 'INTEGER', key2: 'VARCHAR'}).
compressionauto_detectstring
NoThe compression type for the file. Detected automatically from the file extension by default. Options include 'uncompressed', 'gzip', 'zstd', and 'auto_detect'.
convert_strings_to_integersFalseboolean
NoWhether strings representing integer values should be converted to a numerical type. Defaults to False.
dateformatisostring
NoSpecifies the date format to use when parsing dates. Default format is 'iso'.
formatautostring
NoSpecifies the format of the JSON file. Can be one of ['auto', 'unstructured', 'newline_delimited', 'array']. Default is 'auto'.
hive_partitioningFalseboolean
NoWhether or not to interpret the path as a Hive partitioned path. Defaults to False.
maximum_depth-1integer
NoMaximum nesting depth to which the automatic schema detection detects types. Set to -1 to fully detect nested JSON types.
maximum_object_size16777216integer
NoThe maximum size of a JSON object in bytes. Default is 16777216 bytes.
recordsautostring
NoDetermines how JSON objects are treated in parsing. Can be one of ['auto', 'true', 'false']. Default is 'auto'.
sample_size20480integer
NoNumber of sample objects for automatic JSON type detection. Set to -1 to scan the entire input file. Default is 20480.
timestampformatisostring
NoSpecifies the date format to use when parsing timestamps. Default format is 'iso'.
union_by_nameFalseboolean
NoWhether the schema's of multiple JSON files should be unified. Defaults to False.
date_formatstring
NoFormat string for parsing dates, following Python's strftime directives, e.g., '%Y-%m-%d'.
timestamp_formatstring
NoFormat string for parsing timestamps, following Python's strftime directives, e.g., '%Y-%m-%d %H:%M:%S'.

LoadStrategy

Specifications for loading files, including limits on number and size of files.

PropertyDefaultTypeRequiredDescription
max_filesinteger
NoMaximum number of files to read.
max_sizeAny of:
  integer
  string
NoMaximum size of files to read. Can be an integer or a string that represents a byte size (e.g., '10MB', '1GB').
download_threadsinteger
NoNumber of threads to use for downloading files.
direct_loadDirectLoadOptions
NoOverride default direct load behavior.

DirectLoadOptions

PropertyDefaultTypeRequiredDescription
enableboolean
NoA boolean flag indicating if direct load should be used if supported by the backend engine. default to not use direct load
ignore_unknown_valuesTrueboolean
NoOptional. Indicates if BigQuery should allow extra values that are not represented in the table schema. If true, the extra values are ignored. If false, records with extra columns are treated as bad records, and if there are too many bad records, an invalid error is returned in the job result. The default value is false. The sourceFormat property determines what BigQuery treats as an extra value:
max_staging_table_age_in_hoursinteger
NoFor BigQuery backend, the maximum age of the staging table in hours. If the staging table is older than this age, it will be dropped. If not set, default to 6 hours

OrcParser

Parser for the ORC file format.

PropertyDefaultTypeRequiredDescription
orcNoParserOptionsYesOptions for parsing ORC files.

ParquetParser

Parser for the Parquet file format.

PropertyDefaultTypeRequiredDescription
parquetNoParserOptionsYesOptions for parsing Parquet files.

PickleParser

Parser for the Pickle file format. Pickle files are generated by Python's pickle module.

PropertyDefaultTypeRequiredDescription
pickleNoParserOptionsYesOptions for parsing Pickle files.

ComponentColumn

Component column expression definition.

No properties defined.

TarArchive

Configuration options for tar archive files.

PropertyDefaultTypeRequiredDescription
tarFileArchiveOptionsNoConfiguration for tar archive files.

TextParser

Parser for text files.

PropertyDefaultTypeRequiredDescription
textNoParserOptionsYesOptions for parsing text files.

PartitionMaterialization

Container for options for how data is materialized and stored for partitioned case.

PropertyDefaultTypeRequiredDescription
partitionedPartitionedOptions
NoField for options for partitioning data.

PartitionedOptions

Options for partitioning data.

PropertyDefaultTypeRequiredDescription
enable_substitution_by_partition_namebooleanYesEnable substitution by partition name.
on_schema_changestring ("ignore", "fail", "append_new_columns", "sync_all_columns")
NoPolicy to apply when schema changes are detected.
output_typetablestring ("table", "view")NoOutput type for partitioned data. Must be either 'table' or 'view'.

XmlParser

Parser for XML files.

PropertyDefaultTypeRequiredDescription
xmlNoParserOptionsYesOptions for parsing XML files.

NoParserOptions

No custom parsing options exist for this parser.

No properties defined.

ZipArchive

Configuration options for ZIP archive files.

PropertyDefaultTypeRequiredDescription
zipFileArchiveOptionsNoConfiguration for ZIP archive files.

FileArchiveOptions

Options for working with files inside of the file structure of an archive file format.

PropertyDefaultTypeRequiredDescription
pathstringNoPath to the directory in the archive containing files that should be processed.
includearray[FileSelection]
NoList of conditions to include specific files in the archive.
excludearray[FileSelection]
NoList of conditions to exclude specific files in the archive.

FileSelection

Options for selecting files based on various criteria. All criteria specified must be met for a file to be included.

PropertyDefaultTypeRequiredDescription
prefixstring
NoSelect only files with this prefix. Strongly recommended when parser is not set to 'auto'.
suffixstring
NoSelect only files with this suffix. Strongly recommended when parser is not set to 'auto'.
created_atFileTimeSelection
NoInclude files created within the specified date range.
modified_atFileTimeSelection
NoInclude files modified within the specified date range.
timestamp_from_filenameFileTimeFromFilenameSelection
NoExtract and include files based on timestamps parsed from filenames.
globstring
NoGlob pattern for including files.
regexstring
NoRegular expression pattern for including files.

FileTimeFromFilenameSelection

Option to extract timestamps from filenames and include files based on those timestamps.

PropertyDefaultTypeRequiredDescription
afterstring
NoInclude files created or modified after this date.
beforestring
NoInclude files created or modified before this date.
on_or_afterstring
NoInclude files created or modified on or after this date.
on_or_beforestring
NoInclude files created or modified on or before this date.
patternstringYesPattern to extract timestamps from filenames. eg: r'year=(?P<year>\d4)/month=(?P<month>\d2)/day=(?P<day>\d2)'

FileTimeSelection

Defines time-based selection criteria for including files, based on creation or modification dates.

PropertyDefaultTypeRequiredDescription
afterstring
NoInclude files created or modified after this date.
beforestring
NoInclude files created or modified before this date.
on_or_afterstring
NoInclude files created or modified on or after this date.
on_or_beforestring
NoInclude files created or modified on or before this date.