Skip to main content
Version: 3.0.0

ABFS Read Component

Component for reading files from an Azure Blob Storage container.

Examples

component:
read:
abfs:
path: /path/to/directory
include:
- suffix: .csv
parser:
csv:
has_header: true
connection: my-abfs-connection

AbfsReadComponent

info

AbfsReadComponent is defined beneath the following ancestor nodes in the YAML structure:

Below are the properties for the AbfsReadComponent. Each property links to the specific details section further down in this page.

PropertyDefaultTypeRequiredDescription
event_timestring
NoTimestamp column in the component output used to represent event time.
connectionstring
NoThe name of the connection to use for reading data.
columnsarray[ComponentColumn]
NoA list specifying the columns to read from the source and transformations to make during read.
normalizeboolean
NoA boolean flag indicating if the output column names should be normalized to a standard naming convention after reading.
preserve_caseboolean
NoA boolean flag indicating if the case of the column names should be preserved after reading.
uppercaseboolean
NoA boolean flag indicating if the column names should be transformed to uppercase after reading.
strategyPartitionedStrategy
NoIngest strategy when reading files.
abfsFileReadOptionsBaseYesOptions for reading files from an Azure Blob Storage container.

Property Details

Component

A component is a fundamental building block of a data flow. Types of components that are supported include: read, transform, task, test, and more.

PropertyDefaultTypeRequiredDescription
componentOne of:
  ReadComponent
  TransformComponent
  TaskComponent
  SingularTestComponent
  CustomPythonReadComponent
  WriteComponent
  CompoundComponent
  AliasedTableComponent
  ExternalTableComponent
YesConfiguration options for the component.

ReadComponent

A component that reads data from a data system.

PropertyDefaultTypeRequiredDescription
data_plane  One of:
    SnowflakeDataPlane
    BigQueryDataPlane
    DuckdbDataPlane
    SynapseDataPlane
    FabricDataPlane
    DatabricksDataPlane
NoData Plane-specific configuration options for a component.
descriptionstring
NoA brief description of what the model does.
metadataResourceMetadata
NoMeta information of a resource. In most cases it doesn't affect the system behavior but may be helpful to analyze project resources.
namestringYesThe name of the model
flow_namestring
NoThe name of the flow that the component belongs to.
skipboolean
NoA boolean flag indicating whether to skip processing for the component or not.
data_maintenanceDataMaintenance
NoThe data maintenance configuration options for the component.
testsComponentTestOptions
NoDefines tests to run on the data of this component.
readOne of:
  GenericFileReadComponent
  LocalFileReadComponent
  SFTPReadComponent
  S3ReadComponent
  GcsReadComponent
  AbfsReadComponent
  HttpReadComponent
  MSSQLReadComponent
  MySQLReadComponent
  OracleReadComponent
  PostgresReadComponent
  SnowflakeReadComponent
  BigQueryReadComponent
  DatabricksReadComponent
YesThe read component that reads data from a data system.

FileReadOptionsBase

Options for locating and parsing files from a specified directory or file path, including file selection criteria and parser to use.

PropertyDefaultTypeRequiredDescription
pathstringYesPath to the directory or file to read. Path is relative to the connection's root directory, and cannot be an absolute path or traverse outside the root directory.
excludearray[FileSelection]
NoList of conditions to exclude specific files from being processed.
includearray[FileSelection]
NoList of conditions to include specific files for processing.
parserautoOne of:
  Any of:
    string
    AutoParser
  Any of:
    string
    AvroParser
  Any of:
    string
    FeatherParser
  Any of:
    string
    OrcParser
  Any of:
    string
    ParquetParser
  Any of:
    string
    PickleParser
  Any of:
    string
    TextParser
  Any of:
    string
    XmlParser
  Any of:
    string
    CsvParser
  Any of:
    string
    ExcelParser
  Any of:
    string
    JsonParser
NoParser Resource for reading the files. Defaults to 'auto'. To set specific parser options, use the parser name as a child object.
archiveAny of:
  TarArchive
  ZipArchive
NoConfiguration for archive files for processing.
load_strategyLoadStrategy
NoStrategy for loading files, including limits on number and size of files.
time_based_file_selectionAny of:
  string
  FilenameTimestampPattern
NoMethod to use for file selection based on a time window.

AutoParser

Parser that automatically detects the file format and settings for parsing.

PropertyDefaultTypeRequiredDescription
autoNoParserOptionsYesOptions for automatically detecting the file format. None need to be specified.

AvroParser

Parser for the Avro file format.

PropertyDefaultTypeRequiredDescription
avroNoParserOptionsYesOptions for parsing Avro files.

CsvParser

Parser for CSV files.

PropertyDefaultTypeRequiredDescription
csvCsvParserOptionsYesOptions for parsing CSV files.

CsvParserOptions

Options for parsing CSV files, including separators, header presence, and multi-line values.

PropertyDefaultTypeRequiredDescription
all_varcharboolean
NoOption to skip type detection for CSV parsing and assume all columns to be of type VARCHAR.
allow_quoted_nullsboolean
NoOption to allow the conversion of quoted values to NULL values.
auto_type_candidatesarray[string]
NoTypes that the sniffer will use when detecting CSV column types. VARCHAR is always included as a fallback.
buffer_sizeinteger
NoThe buffer size in bytes for the CSV reader.
columnsobject with property values of type string
NoA struct specifying the column names and types. Using this option implies that auto detection is not used.
compressionstring
NoThe compression type for the file, detected automatically from the file extension.
dateformatstring
NoThe date format to use when parsing dates.
decimal_separatorstring
NoThe decimal separator of numbers.
delimstring
NoThe delimiter of the CSV file.
typesAny of:
  array[string]
  object with property values of type string
NoThe column types as a struct (by name).
encodingstring
NoThe encoding of the CSV file.
escapestring
NoThe string that should appear before a data character sequence that matches the quote value.
force_not_nullarray[string]
NoDo not match the specified columns' values against the NULL string.
headerAny of:
  boolean
  integer
NoSpecifies that the file contains a header line with the names of each column.
hive_partitioningboolean
NoWhether to interpret the path as a Hive partitioned path.
ignore_errorsboolean
NoOption to ignore any parsing errors encountered.
new_linestring
NoThe line terminator of the CSV file.
max_line_sizeinteger
NoThe maximum line size in bytes.
nullstrAny of:
  string
  array[string]
NoThe string or strings that represent a NULL value.
namesarray[string]
NoThe column names as a list.
normalize_namesboolean
NoWhether column names should be normalized, removing non-alphanumeric characters.
null_paddingboolean
NoPad remaining columns on the right with null values if a row lacks columns.
parallelboolean
NoWhether the parallel CSV reader is used.
quotestring
NoThe quoting string to be used when a data value is quoted.
sample_sizeinteger
NoThe number of sample rows for auto detection of parameters.
sepstring
NoSingle-character string to use as the column separator. Default is ','.
skipinteger
NoThe number of lines at the top of the file to skip.
timestampformatstring
NoThe date format to use when parsing timestamps.
union_by_nameboolean
NoWhether the columns of multiple schemas should be unified by name, rather than by position. Increases memory consumption.
has_headerTrueboolean
NoIndicates if the first row of the CSV is treated as the header row. Defaults to True.
has_multi_line_valueFalseboolean
NoIndicates if the CSV contains values spanning multiple lines. Defaults to False.
auto_detectTrueboolean
NoEnables auto detection of CSV parameters.

ExcelParser

Parser for Excel files.

PropertyDefaultTypeRequiredDescription
excelAny of:
  DuckDBExcelParserOptions
  PandasExcelParserOptions
YesOptions for parsing Excel files.

DuckDBExcelParserOptions

Parsing options for Excel files using DuckDB. Refer to the DuckDB documentation for more information here: https://duckdb.org/docs/guides/file_formats/excel_import.html

PropertyDefaultTypeRequiredDescription
duckdb_optionsDuckDBExcelParserOptionsWrapperYesOptions for parsing DuckDB files.

DuckDBExcelParserOptionsWrapper

Wrapper for the DuckDB Excel Parser Options.

PropertyDefaultTypeRequiredDescription
layerstring
NoThe layer to use for parsing the Excel file. If not specified, the default layer is used.
open_optionsarray[string]
NoAdditional open_options to pass to the st_read function.

FeatherParser

Parser for the Feather file format.

PropertyDefaultTypeRequiredDescription
featherNoParserOptionsYesOptions for parsing Feather files.

FilenameTimestampPattern

Pattern to extract timestamps from file names for the purpose of time-based file selection.

PropertyDefaultTypeRequiredDescription
patternstringYesPattern to extract timestamps from filenames. eg: r'year=(?P<year>\d4)/month=(?P<month>\d2)/day=(?P<day>\d2)'

JsonParser

Parser for JSON files.

PropertyDefaultTypeRequiredDescription
jsonJsonParserOptionsNoOptions for parsing JSON files.

JsonParserOptions

Parsing options for JSON files, particularly focusing on date and timestamp formats.

PropertyDefaultTypeRequiredDescription
columnsobject with property values of type string
NoA struct that specifies the key names and value types contained within the JSON file (e.g. {key1: 'INTEGER', key2: 'VARCHAR'}).
compressionstring
NoThe compression type for the file. Detected automatically from the file extension by default. Options include 'uncompressed', 'gzip', 'zstd', and 'auto_detect'.
convert_strings_to_integersboolean
NoWhether strings representing integer values should be converted to a numerical type. Defaults to False.
dateformatstring
NoSpecifies the date format to use when parsing dates. DuckDB Default format is 'iso'.
formatstring
NoSpecifies the format of the JSON file. Can be one of ['auto', 'unstructured', 'newline_delimited', 'array'].
hive_partitioningboolean
NoWhether or not to interpret the path as a Hive partitioned path. Default to use DuckDB default.
ignore_errorsboolean
NoWhether to ignore any parsing errors encountered. Defaults to False.
map_inference_thresholdinteger
NoThe maximum number of elements in a map to use map inference. Set to -1 to use map inference for all maps.
maximum_depthinteger
NoMaximum nesting depth to which the automatic schema detection detects types.
maximum_object_sizeinteger
NoThe maximum size of a JSON object in bytes.
recordsstring
NoDetermines how JSON objects are treated in parsing. Can be one of ['auto', 'true', 'false'].
sample_sizeinteger
NoNumber of sample objects for automatic JSON type detection. Set to -1 to scan the entire input file.
timestampformatstring
NoSpecifies the date format to use when parsing timestamps.
union_by_nameboolean
NoWhether the schema's of multiple JSON files should be unified.

LoadStrategy

Specifications for loading files, including limits on number and size of files.

PropertyDefaultTypeRequiredDescription
max_filesinteger
NoMaximum number of files to read.
max_sizeAny of:
  integer
  string
NoMaximum size of files to read. Can be an integer or a string that represents a byte size (e.g., '10MB', '1GB').
download_threadsinteger
NoNumber of threads to use for downloading files.
direct_loadDirectLoadOptions
NoOverride default direct load behavior.

DirectLoadOptions

PropertyDefaultTypeRequiredDescription
enableboolean
NoA boolean flag indicating if direct load should be used if supported by the backend engine. default to not use direct load
ignore_unknown_valuesTrueboolean
NoOptional. Indicates if BigQuery should allow extra values that are not represented in the table schema. If true, the extra values are ignored. If false, records with extra columns are treated as bad records, and if there are too many bad records, an invalid error is returned in the job result. The default value is false. The sourceFormat property determines what BigQuery treats as an extra value:
max_staging_table_age_in_hoursinteger
NoFor BigQuery backend, the maximum age of the staging table in hours. If the staging table is older than this age, it will be dropped. If not set, default to 6 hours
nested_fields_as_stringboolean
NoFor BigQuery backend, if true, nested fields will be converted to strings. If false, nested fields will be converted to JSON objects.

OrcParser

Parser for the ORC file format.

PropertyDefaultTypeRequiredDescription
orcNoParserOptionsYesOptions for parsing ORC files.

PandasExcelParserOptions

Parsing options for Excel files using pandas. Refer to the pandas documentation for more information here: https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html

PropertyDefaultTypeRequiredDescription
pandas_optionsPandasExcelParserOptionsWrapperYesOptions for parsing Pandas files.

PandasExcelParserOptionsWrapper

Wrapper for the Pandas Excel Parser Options.

PropertyDefaultTypeRequiredDescription
date_formatstring
NoFormat string for parsing dates, following Python's strftime directives, e.g., '%Y-%m-%d'.
date_columnsarray[string]
NoList of column names that should be parsed as dates.
timestamp_columnsarray[string]
NoList of column names that should be parsed as timestamps.
sheet_namestring
NoName of the Excel sheet to parse. If not specified, the first sheet is used by default.
headerAny of:
  integer
  array[integer]
NoRow number(s) to use as the column names, and the start of the data. If a list of integers is passed, the values in the list indicate the row number(s) to use as the column names, and the start of the data.
namesarray[string]
NoList of column names to use. If not specified, and header is True, then the first row of the data is used. If a list is passed, it must match the length of the data.
index_colAny of:
  integer
  array[integer]
NoColumn number(s) to set as the index (0-based). If a list is passed, the values in the list indicate the column number(s) to set as the index.
usecolsAny of:
  string
  array[string]
Nostr, list-like, or callable, default None If None, then parse all columns. If str, then indicates comma separated list of Excel column letters and column ranges (e.g. “A:E” or “A,C,E:F”). Ranges are inclusive of both sides. If list of int, then indicates list of column numbers to be parsed (0-indexed). If list of string, then indicates list of column names to be parsed. If callable, then evaluate each column name against it and parse the column if the callable returns True.
dtypeobject with property values of type string
NoA struct that specifies the key names and value types contained within the JSON file (e.g. {key1: 'INTEGER', key2: 'VARCHAR'}).
skiprowsAny of:
  integer
  array[integer]
NoRow number(s) to skip (0-based) before reading the data. If a list is passed, the values in the list indicate the row number(s) to skip.
nrowsinteger
NoNumber of rows to read. If not specified, all rows are read.
na_valuesAny of:
  string
  array[string]
NoString or list of strings to recognize as NA/NaN. If a list is passed, the values in the list indicate the strings to recognize as NA/NaN.
keep_default_naboolean
NoWhether to include the default NaN values when parsing the data. If True, the default NaN values are included. If False, the default NaN values are not included.
na_filterboolean
NoWhether to filter out rows with NA values. If True, rows with NA values are filtered out. If False, rows with NA values are included.
parse_datesAny of:
  boolean
  array[boolean]
NoWhether to parse dates. If True, dates are parsed. If False, dates are not parsed. If a list is passed, the values in the list indicate whether the corresponding columns should be parsed as dates.
skipfooterinteger
NoNumber of lines at the end of the file to skip (0-based). If a list is passed, the values in the list indicate the number of lines to skip.

ParquetParser

Parser for the Parquet file format.

PropertyDefaultTypeRequiredDescription
parquetNoParserOptionsYesOptions for parsing Parquet files.

PickleParser

Parser for the Pickle file format. Pickle files are generated by Python's pickle module.

PropertyDefaultTypeRequiredDescription
pickleNoParserOptionsYesOptions for parsing Pickle files.

ComponentColumn

Component column expression definition.

No properties defined.

TarArchive

Configuration options for tar archive files.

PropertyDefaultTypeRequiredDescription
tarFileArchiveOptionsNoConfiguration for tar archive files.

TextParser

Parser for text files.

PropertyDefaultTypeRequiredDescription
textNoParserOptionsYesOptions for parsing text files.

PartitionedStrategy

Partitioned Ingest Strategy. The user is expected to provide 2 functions, a list function that lists partitions in the source, and a read function that reads a partition from the source.

PropertyDefaultTypeRequiredDescription
partitionedPartitionedOptions
NoOptions for partitioning data.
on_schema_changestring ("ignore", "fail", "append_new_columns", "sync_all_columns")
NoPolicy to apply when schema changes are detected. Defaults to 'fail' if not provided.

PartitionedOptions

Options related to partition optimization - in particular, the policy that determines which partitions to ingest.

PropertyDefaultTypeRequiredDescription
enable_substitution_by_partition_namebooleanYesEnable substitution by partition name.
output_typetablestring ("table", "view")NoOutput type for partitioned data. Must be either 'table' or 'view'. This strategy applies only to Transforms.

XmlParser

Parser for XML files.

PropertyDefaultTypeRequiredDescription
xmlNoParserOptionsYesOptions for parsing XML files.

NoParserOptions

No custom parsing options exist for this parser.

No properties defined.

ZipArchive

Configuration options for ZIP archive files.

PropertyDefaultTypeRequiredDescription
zipFileArchiveOptionsNoConfiguration for ZIP archive files.

FileArchiveOptions

Options for working with files inside of the file structure of an archive file format.

PropertyDefaultTypeRequiredDescription
pathstringNoPath to the directory in the archive containing files that should be processed.
includearray[FileSelection]
NoList of conditions to include specific files in the archive.
excludearray[FileSelection]
NoList of conditions to exclude specific files in the archive.

FileSelection

Options for selecting files based on various criteria. All criteria specified must be met for a file to be included.

PropertyDefaultTypeRequiredDescription
prefixstring
NoSelect only files with this prefix. Strongly recommended when parser is not set to 'auto'.
suffixstring
NoSelect only files with this suffix. Strongly recommended when parser is not set to 'auto'.
created_atFileTimeSelection
NoInclude files created within the specified date range.
modified_atFileTimeSelection
NoInclude files modified within the specified date range.
timestamp_from_filenameFileTimeFromFilenameSelection
NoExtract and include files based on timestamps parsed from filenames.
globstring
NoGlob pattern for including files.
regexstring
NoRegular expression pattern for including files.

FileTimeFromFilenameSelection

Option to extract timestamps from filenames and include files based on those timestamps.

PropertyDefaultTypeRequiredDescription
afterstring
NoInclude files created or modified after this date.
beforestring
NoInclude files created or modified before this date.
on_or_afterstring
NoInclude files created or modified on or after this date.
on_or_beforestring
NoInclude files created or modified on or before this date.
sinceTimeDelta
NoInclude files created or modified within the specified time delta from now.
patternstringYesPattern to extract timestamps from filenames. eg: r'year=(?P<year>\d4)/month=(?P<month>\d2)/day=(?P<day>\d2)'

FileTimeSelection

Defines time-based selection criteria for including files, based on creation or modification dates.

PropertyDefaultTypeRequiredDescription
afterstring
NoInclude files created or modified after this date.
beforestring
NoInclude files created or modified before this date.
on_or_afterstring
NoInclude files created or modified on or after this date.
on_or_beforestring
NoInclude files created or modified on or before this date.
sinceTimeDelta
NoInclude files created or modified within the specified time delta from now.

TimeDelta

PropertyDefaultTypeRequiredDescription
seconds0integerNoThe number of seconds.
minutes0integerNoThe number of minutes.
hours0integerNoThe number of hours.
days0integerNoThe number of days.
weeks0integerNoThe number of weeks.
months0integerNoThe number of months.
years0integerNoThe number of years.