S3 Read Component

Component for reading files from an S3 bucket.

Examples

read_parquet_s3.yaml
s3_read_csv_with_header.yaml
s3_read_avro_config.yaml

component:
  read:
    connection: my-s3-connection
    s3:
      path: my/parquet/files/
      parser: parquet

component:
  read:
    connection: my-s3-connection
    s3:
      path: my/csv/files/
      parser:
        csv:
          has_header: true

component:
  description: Component to read Avro files from S3.
  read:
    connection: my_s3_connection
    s3:
      include:
      - suffix: .avro
      parser: avro
      path: path/to/your/avro/files

S3ReadComponent

info

S3ReadComponent is defined beneath the following ancestor nodes in the YAML structure:

Component
ReadComponent

Below are the properties for the S3ReadComponent. Each property links to the specific details section further down in this page.

Property	Type	Required	Description
dependencies	array[None]	No	List of dependencies that must complete before this component runs.
event_time	string	No	Timestamp column in the component output used to represent event time.
connection	string	No	The name of the connection to use for reading data.
columns	array[None]	No	A list specifying the columns to read from the source and transformations to make during read.
normalize	boolean	No	A boolean flag indicating if the output column names should be normalized to a standard naming convention after reading.
preserve_case	boolean	No	A boolean flag indicating if the case of the column names should be preserved after reading.
uppercase	boolean	No	A boolean flag indicating if the column names should be transformed to uppercase after reading.
strategy	PartitionedStrategy	No	Ingest strategy when reading files.
s3		Yes	Options for reading files from an S3 bucket.

Property Details

Component

A component is a fundamental building block of a data flow. Types of components that are supported include: read, transform, task, test, and more.

Property	Default	Type	Required	Description
component		One of: CustomPythonReadComponent ApplicationComponent AliasedTableComponent ExternalTableComponent	Yes	Configuration options for the component.

ReadComponent

A component that reads data from a data system.

Property	Type	Required	Description
data_plane	One of: SnowflakeDataPlane BigQueryDataPlane DatabricksDataPlane	No	Data Plane-specific configuration options for a component.
skip	boolean	No	A boolean flag indicating whether to skip processing for the component or not.
retry_strategy		No	The retry strategy configuration options for the component if any exceptions are encountered.
description	string	No	A brief description of what the model does.
metadata		No	Meta information of a resource. In most cases it doesn't affect the system behavior but may be helpful to analyze project resources.
name	string	Yes	The name of the model
flow_name	string	No	The name of the flow that the component belongs to.
data_maintenance		No	The data maintenance configuration options for the component.
tests		No	Defines tests to run on the data of this component.
read	One of: GenericFileReadComponent LocalFileReadComponent SFTPReadComponent S3ReadComponent GcsReadComponent AbfsReadComponent HttpReadComponent MSSQLReadComponent MySQLReadComponent OracleReadComponent PostgresReadComponent SnowflakeReadComponent BigQueryReadComponent DatabricksReadComponent	Yes	The read component that reads data from a data system.

FileReadOptionsBase

Options for locating and parsing files from a specified directory or file path, including file selection criteria and parser to use.

Property	Default	Type	Required	Description
path		string	Yes	Path to the directory or file to read. Path is relative to the connection's root directory, and cannot be an absolute path or traverse outside the root directory.
exclude		array[None]	No	List of conditions to exclude specific files from being processed.
include		array[None]	No	List of conditions to include specific files for processing.
parser	auto	One of: Any of: auto Any of: avro Any of: feather Any of: orc Any of: parquet Any of: pickle Any of: text Any of: xml Any of: csv Any of: excel Any of: json	No	Parser Resource for reading the files. Defaults to 'auto'. To set specific parser options, use the parser name as a child object.
archive		Any of:	No	Configuration for archive files for processing.
load_strategy			No	Strategy for loading files, including limits on number and size of files.
time_based_file_selection		Any of: last_modified	No	Method to use for file selection based on a time window.

AutoParser

Parser that automatically detects the file format and settings for parsing.

Property	Default	Type	Required	Description
hive_partitioning	False	boolean	No	Whether to extract hive partitioning from the full path name of the file and use it as partition columns. If true, the partition columns will be added to the table as additional columns. The data set should not contain any columns with the same name as the partition columns. If false, the partition columns will not be added to the table. If not set, the default value is false. e.g.: directory names with key=value pairs like “/year=2009/month=11”, when set to true, will be parsed into partition columns with the names "year" and "month" and the values "2009" and "11" respectively.
auto			Yes	Options for automatically detecting the file format. None need to be specified.

AvroParser

Parser for the Avro file format.

Property	Default	Type	Required	Description
hive_partitioning	False	boolean	No	Whether to extract hive partitioning from the full path name of the file and use it as partition columns. If true, the partition columns will be added to the table as additional columns. The data set should not contain any columns with the same name as the partition columns. If false, the partition columns will not be added to the table. If not set, the default value is false. e.g.: directory names with key=value pairs like “/year=2009/month=11”, when set to true, will be parsed into partition columns with the names "year" and "month" and the values "2009" and "11" respectively.
avro			Yes	Options for parsing Avro files.

CsvParser

Parser for CSV files.

Property	Default	Type	Required	Description
hive_partitioning	False	boolean	No	Whether to extract hive partitioning from the full path name of the file and use it as partition columns. If true, the partition columns will be added to the table as additional columns. The data set should not contain any columns with the same name as the partition columns. If false, the partition columns will not be added to the table. If not set, the default value is false. e.g.: directory names with key=value pairs like “/year=2009/month=11”, when set to true, will be parsed into partition columns with the names "year" and "month" and the values "2009" and "11" respectively.
csv			Yes	Options for parsing CSV files.

CsvParserOptions

Options for parsing CSV files, including separators, header presence, and multi-line values.

Property	Default	Type	Required	Description
all_varchar		boolean	No	Option to skip type detection for CSV parsing and assume all columns to be of type VARCHAR.
allow_quoted_nulls		boolean	No	Option to allow the conversion of quoted values to NULL values.
auto_type_candidates		array[string]	No	Types that the sniffer will use when detecting CSV column types. VARCHAR is always included as a fallback.
buffer_size		integer	No	The buffer size in bytes for the CSV reader.
columns		object with property values of type string	No	A struct specifying the column names and types. Using this option implies that auto detection is not used.
compression		string	No	The compression type for the file, detected automatically from the file extension.
dateformat		string	No	The date format to use when parsing dates.
decimal_separator		string	No	The decimal separator of numbers.
delim		string	No	The delimiter of the CSV file.
types		Any of: array[string] object with property values of type string	No	The column types as a struct (by name).
encoding		string	No	The encoding of the CSV file.
escape		string	No	The string that should appear before a data character sequence that matches the quote value.
force_not_null		array[string]	No	Do not match the specified columns' values against the NULL string.
header		Any of: boolean integer	No	Specifies that the file contains a header line with the names of each column.
hive_partitioning		boolean	No	Whether to interpret the path as a Hive partitioned path.
ignore_errors		boolean	No	Option to ignore any parsing errors encountered.
new_line		string	No	The line terminator of the CSV file.
max_line_size		integer	No	The maximum line size in bytes.
nullstr		Any of: string array[string]	No	The string or strings that represent a NULL value.
names		array[string]	No	The column names as a list.
normalize_names		boolean	No	Whether column names should be normalized, removing non-alphanumeric characters.
null_padding		boolean	No	Pad remaining columns on the right with null values if a row lacks columns.
parallel		boolean	No	Whether the parallel CSV reader is used.
quote		string	No	The quoting string to be used when a data value is quoted.
sample_size		integer	No	The number of sample rows for auto detection of parameters.
sep		string	No	Single-character string to use as the column separator. Default is ','.
skip		integer	No	The number of lines at the top of the file to skip.
timestampformat		string	No	The date format to use when parsing timestamps.
union_by_name		boolean	No	Whether the columns of multiple schemas should be unified by name, rather than by position. Increases memory consumption.
has_header	True	boolean	No	Indicates if the first row of the CSV is treated as the header row. Defaults to True.
has_multi_line_value	False	boolean	No	Indicates if the CSV contains values spanning multiple lines. Defaults to False.
auto_detect	True	boolean	No	Enables auto detection of CSV parameters.

ExcelParser

Parser for Excel files.

Property	Default	Type	Required	Description
hive_partitioning	False	boolean	No	Whether to extract hive partitioning from the full path name of the file and use it as partition columns. If true, the partition columns will be added to the table as additional columns. The data set should not contain any columns with the same name as the partition columns. If false, the partition columns will not be added to the table. If not set, the default value is false. e.g.: directory names with key=value pairs like “/year=2009/month=11”, when set to true, will be parsed into partition columns with the names "year" and "month" and the values "2009" and "11" respectively.
excel		Any of:	Yes	Options for parsing Excel files.

DuckDBExcelParserOptions

Parsing options for Excel files using DuckDB. Refer to the DuckDB documentation for more information here: https://duckdb.org/docs/guides/file_formats/excel_import.html

Property	Default	Type	Required	Description
duckdb_options			Yes	Options for parsing DuckDB files.

DuckDBExcelParserOptionsWrapper

Wrapper for the DuckDB Excel Parser Options.

Property	Default	Type	Required	Description
layer		string	No	The layer to use for parsing the Excel file. If not specified, the default layer is used.
open_options		array[string]	No	Additional open_options to pass to the st_read function.

FeatherParser

Parser for the Feather file format.

Property	Default	Type	Required	Description
hive_partitioning	False	boolean	No	Whether to extract hive partitioning from the full path name of the file and use it as partition columns. If true, the partition columns will be added to the table as additional columns. The data set should not contain any columns with the same name as the partition columns. If false, the partition columns will not be added to the table. If not set, the default value is false. e.g.: directory names with key=value pairs like “/year=2009/month=11”, when set to true, will be parsed into partition columns with the names "year" and "month" and the values "2009" and "11" respectively.
feather			Yes	Options for parsing Feather files.

FilenameTimestampPattern

Pattern to extract timestamps from file names for the purpose of time-based file selection.

Property	Default	Type	Required	Description
pattern		string	Yes	Pattern to extract timestamps from filenames. eg: r'year=(?P<year>\d4)/month=(?P<month>\d2)/day=(?P<day>\d2)'

JsonParser

Parser for JSON files.

Property	Default	Type	Required	Description
hive_partitioning	False	boolean	No	Whether to extract hive partitioning from the full path name of the file and use it as partition columns. If true, the partition columns will be added to the table as additional columns. The data set should not contain any columns with the same name as the partition columns. If false, the partition columns will not be added to the table. If not set, the default value is false. e.g.: directory names with key=value pairs like “/year=2009/month=11”, when set to true, will be parsed into partition columns with the names "year" and "month" and the values "2009" and "11" respectively.
json			No	Options for parsing JSON files.

JsonParserOptions

Parsing options for JSON files, particularly focusing on date and timestamp formats.

Property	Type	Required	Description
columns	object with property values of type string	No	A struct that specifies the key names and value types contained within the JSON file (e.g. `{key1: 'INTEGER', key2: 'VARCHAR'}`).
compression	string	No	The compression type for the file. Detected automatically from the file extension by default. Options include 'uncompressed', 'gzip', 'zstd', and 'auto_detect'.
convert_strings_to_integers	boolean	No	Whether strings representing integer values should be converted to a numerical type. Defaults to False.
dateformat	string	No	Specifies the date format to use when parsing dates. DuckDB Default format is 'iso'.
format	string	No	Specifies the format of the JSON file. Can be one of ['auto', 'unstructured', 'newline_delimited', 'array'].
hive_partitioning	boolean	No	Whether or not to interpret the path as a Hive partitioned path. Default to use DuckDB default.
ignore_errors	boolean	No	Whether to ignore any parsing errors encountered. Defaults to False.
map_inference_threshold	integer	No	The maximum number of elements in a map to use map inference. Set to -1 to use map inference for all maps.
maximum_depth	integer	No	Maximum nesting depth to which the automatic schema detection detects types.
maximum_object_size	integer	No	The maximum size of a JSON object in bytes.
records	string	No	Determines how JSON objects are treated in parsing. Can be one of ['auto', 'true', 'false'].
sample_size	integer	No	Number of sample objects for automatic JSON type detection. Set to -1 to scan the entire input file.
timestampformat	string	No	Specifies the date format to use when parsing timestamps.
union_by_name	boolean	No	Whether the schema's of multiple JSON files should be unified.

LoadStrategy

Specifications for loading files, including limits on number and size of files.

Property	Type	Required	Description
max_files	integer	No	Maximum number of files to read.
max_size	Any of: integer string	No	Maximum size of files to read. Can be an integer or a string that represents a byte size (e.g., '10MB', '1GB').
download_threads	integer	No	Number of threads to use for downloading files.
direct_load		No	Override default direct load behavior.

DirectLoadOptions

Property	Default	Type	Required	Description
enable		boolean	No	A boolean flag indicating if direct load should be used if supported by the backend engine. default to not use direct load
ignore_unknown_values	True	boolean	No	Optional. Indicates if BigQuery should allow extra values that are not represented in the table schema. If true, the extra values are ignored. If false, records with extra columns are treated as bad records, and if there are too many bad records, an invalid error is returned in the job result. The default value is false. The sourceFormat property determines what BigQuery treats as an extra value:
max_staging_table_age_in_hours		integer	No	For BigQuery backend, the maximum age of the staging table in hours. If the staging table is older than this age, it will be dropped. If not set, default to 6 hours
nested_fields_as_string		boolean	No	For BigQuery backend, if true, nested fields will be converted to strings. If false, nested fields will be converted to JSON objects.

OrcParser

Parser for the ORC file format.

Property	Default	Type	Required	Description
hive_partitioning	False	boolean	No	Whether to extract hive partitioning from the full path name of the file and use it as partition columns. If true, the partition columns will be added to the table as additional columns. The data set should not contain any columns with the same name as the partition columns. If false, the partition columns will not be added to the table. If not set, the default value is false. e.g.: directory names with key=value pairs like “/year=2009/month=11”, when set to true, will be parsed into partition columns with the names "year" and "month" and the values "2009" and "11" respectively.
orc			Yes	Options for parsing ORC files.

PandasExcelParserOptions

Parsing options for Excel files using pandas. Refer to the pandas documentation for more information here: https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html

Property	Default	Type	Required	Description
pandas_options			Yes	Options for parsing Pandas files.

PandasExcelParserOptionsWrapper

Wrapper for the Pandas Excel Parser Options.

Property	Type	Required	Description
date_format	string	No	Format string for parsing dates, following Python's strftime directives, e.g., '%Y-%m-%d'.
date_columns	array[string]	No	List of column names that should be parsed as dates.
timestamp_columns	array[string]	No	List of column names that should be parsed as timestamps.
sheet_name	string	No	Name of the Excel sheet to parse. If not specified, the first sheet is used by default.
header	Any of: integer array[integer]	No	Row number(s) to use as the column names, and the start of the data. If a list of integers is passed, the values in the list indicate the row number(s) to use as the column names, and the start of the data.
names	array[string]	No	List of column names to use. If not specified, and header is True, then the first row of the data is used. If a list is passed, it must match the length of the data.
index_col	Any of: integer array[integer]	No	Column number(s) to set as the index (0-based). If a list is passed, the values in the list indicate the column number(s) to set as the index.
usecols	Any of: string array[string]	No	str, list-like, or callable, default None If None, then parse all columns. If str, then indicates comma separated list of Excel column letters and column ranges (e.g. “A:E” or “A,C,E:F”). Ranges are inclusive of both sides. If list of int, then indicates list of column numbers to be parsed (0-indexed). If list of string, then indicates list of column names to be parsed. If callable, then evaluate each column name against it and parse the column if the callable returns True.
dtype	object with property values of type string	No	A struct that specifies the key names and value types contained within the JSON file (e.g. `{key1: 'INTEGER', key2: 'VARCHAR'}`).
skiprows	Any of: integer array[integer]	No	Row number(s) to skip (0-based) before reading the data. If a list is passed, the values in the list indicate the row number(s) to skip.
nrows	integer	No	Number of rows to read. If not specified, all rows are read.
na_values	Any of: string array[string]	No	String or list of strings to recognize as NA/NaN. If a list is passed, the values in the list indicate the strings to recognize as NA/NaN.
keep_default_na	boolean	No	Whether to include the default NaN values when parsing the data. If True, the default NaN values are included. If False, the default NaN values are not included.
na_filter	boolean	No	Whether to filter out rows with NA values. If True, rows with NA values are filtered out. If False, rows with NA values are included.
parse_dates	Any of: boolean array[boolean]	No	Whether to parse dates. If True, dates are parsed. If False, dates are not parsed. If a list is passed, the values in the list indicate whether the corresponding columns should be parsed as dates.
skipfooter	integer	No	Number of lines at the end of the file to skip (0-based). If a list is passed, the values in the list indicate the number of lines to skip.

ParquetParser

Parser for the Parquet file format.

Property	Default	Type	Required	Description
hive_partitioning	False	boolean	No	Whether to extract hive partitioning from the full path name of the file and use it as partition columns. If true, the partition columns will be added to the table as additional columns. The data set should not contain any columns with the same name as the partition columns. If false, the partition columns will not be added to the table. If not set, the default value is false. e.g.: directory names with key=value pairs like “/year=2009/month=11”, when set to true, will be parsed into partition columns with the names "year" and "month" and the values "2009" and "11" respectively.
parquet			Yes	Options for parsing Parquet files.

PickleParser

Parser for the Pickle file format. Pickle files are generated by Python's pickle module.

Property	Default	Type	Required	Description
hive_partitioning	False	boolean	No	Whether to extract hive partitioning from the full path name of the file and use it as partition columns. If true, the partition columns will be added to the table as additional columns. The data set should not contain any columns with the same name as the partition columns. If false, the partition columns will not be added to the table. If not set, the default value is false. e.g.: directory names with key=value pairs like “/year=2009/month=11”, when set to true, will be parsed into partition columns with the names "year" and "month" and the values "2009" and "11" respectively.
pickle			Yes	Options for parsing Pickle files.

ComponentColumn

Component column expression definition.

No properties defined.

TarArchive

Configuration options for tar archive files.

Property	Default	Type	Required	Description
tar			No	Configuration for tar archive files.

TextParser

Parser for text files.

Property	Default	Type	Required	Description
hive_partitioning	False	boolean	No	Whether to extract hive partitioning from the full path name of the file and use it as partition columns. If true, the partition columns will be added to the table as additional columns. The data set should not contain any columns with the same name as the partition columns. If false, the partition columns will not be added to the table. If not set, the default value is false. e.g.: directory names with key=value pairs like “/year=2009/month=11”, when set to true, will be parsed into partition columns with the names "year" and "month" and the values "2009" and "11" respectively.
text			Yes	Options for parsing text files.

PartitionedStrategy

Partitioned Ingest Strategy. The user is expected to provide 2 functions, a list function that lists partitions in the source, and a read function that reads a partition from the source.

Property	Default	Type	Required	Description
partitioned			No	Options for partitioning data.
on_schema_change		string ("ignore", "fail", "append_new_columns", "sync_all_columns")	No	Policy to apply when schema changes are detected. Defaults to 'fail' if not provided.

PartitionedOptions

Options related to partition optimization - in particular, the policy that determines which partitions to ingest.

Property	Default	Type	Required	Description
enable_substitution_by_partition_name		boolean	Yes	Enable substitution by partition name.
output_type	table	string ("table", "view")	No	Output type for partitioned data. Must be either 'table' or 'view'. This strategy applies only to Transforms.

InputComponent

Specification for input components, including how partitioning behaviors should be handled. This additional metadata is required when a component is used as an input to other components in a flow.

Property	Type	Required	Description
flow	string	Yes	Name of the parent flow that the input component belongs to.
name	string	Yes	The input component name.
alias	string	No	The alias to use for the input component.
partition_spec	Any of: string ("full_reduction", "map")	No	The type of partitioning to apply to the component's input data before processing the component's logic. Input partitioning is applied before the component's logic is executed.
where	string	No	An optional filter condition to apply to the input component's data.
partition_binding	Any of: string	No	An optional partition binding specification to apply to the component on a per-output-partition basis against other inputs' partitions.

PartitionBinding

Property	Default	Type	Required	Description
logical_operator	logical_operator	string ("AND", "OR")	No	The logical operator to use to combine the partition binding predicates provided
predicates	predicates	array[string]	No	The list of partition binding predicates to apply to the input component's data

RepartitionSpec

Specification for repartitioning operations on input component's data

Property	Default	Type	Required	Description
repartition			No	Options for repartitioning the input component's data.

RepartitionOptions

Options for repartitioning the input component's data.

Property	Default	Type	Required	Description
partition_by		string	Yes	The column to partition by.
granularity		string	Yes	The granularity to use for the partitioning.

XmlParser

Parser for XML files.

Property	Default	Type	Required	Description
hive_partitioning	False	boolean	No	Whether to extract hive partitioning from the full path name of the file and use it as partition columns. If true, the partition columns will be added to the table as additional columns. The data set should not contain any columns with the same name as the partition columns. If false, the partition columns will not be added to the table. If not set, the default value is false. e.g.: directory names with key=value pairs like “/year=2009/month=11”, when set to true, will be parsed into partition columns with the names "year" and "month" and the values "2009" and "11" respectively.
xml			Yes	Options for parsing XML files.

NoParserOptions

No custom parsing options exist for this parser.

No properties defined.

ZipArchive

Configuration options for ZIP archive files.

Property	Default	Type	Required	Description
zip			No	Configuration for ZIP archive files.

FileArchiveOptions

Options for working with files inside of the file structure of an archive file format.

Property	Type	Required	Description
path	string	No	Path to the directory in the archive containing files that should be processed.
include	array[None]	No	List of conditions to include specific files in the archive.
exclude	array[None]	No	List of conditions to exclude specific files in the archive.

FileSelection

Options for selecting files based on various criteria. All criteria specified must be met for a file to be included.

Property	Type	Required	Description
prefix	string	No	Select only files with this prefix. Strongly recommended when parser is not set to 'auto'.
suffix	string	No	Select only files with this suffix. Strongly recommended when parser is not set to 'auto'.
created_at		No	Include files created within the specified date range.
modified_at		No	Include files modified within the specified date range.
timestamp_from_filename		No	Extract and include files based on timestamps parsed from filenames.
glob	string	No	Glob pattern for including files.
regex	string	No	Regular expression pattern for including files.

FileTimeFromFilenameSelection

Option to extract timestamps from filenames and include files based on those timestamps.

Property	Type	Required	Description
after	string	No	Include files created or modified after this date.
before	string	No	Include files created or modified before this date.
on_or_after	string	No	Include files created or modified on or after this date.
on_or_before	string	No	Include files created or modified on or before this date.
since		No	Include files created or modified within the specified time delta from now.
pattern	string	Yes	Pattern to extract timestamps from filenames. eg: r'year=(?P<year>\d4)/month=(?P<month>\d2)/day=(?P<day>\d2)'

FileTimeSelection

Defines time-based selection criteria for including files, based on creation or modification dates.

Property	Type	Required	Description
after	string	No	Include files created or modified after this date.
before	string	No	Include files created or modified before this date.
on_or_after	string	No	Include files created or modified on or after this date.
on_or_before	string	No	Include files created or modified on or before this date.
since		No	Include files created or modified within the specified time delta from now.

TimeDelta

Property	Type	Required	Description
seconds	integer	No	The number of seconds.
minutes	integer	No	The number of minutes.
hours	integer	No	The number of hours.
days	integer	No	The number of days.
weeks	integer	No	The number of weeks.
months	integer	No	The number of months.
years	integer	No	The number of years.

Examples​

S3ReadComponent​

Property Details​

Component​

ReadComponent​

FileReadOptionsBase​

AutoParser​

AvroParser​

CsvParser​

CsvParserOptions​

ExcelParser​

DuckDBExcelParserOptions​

DuckDBExcelParserOptionsWrapper​

FeatherParser​

FilenameTimestampPattern​

JsonParser​

JsonParserOptions​

LoadStrategy​

DirectLoadOptions​

OrcParser​

PandasExcelParserOptions​

PandasExcelParserOptionsWrapper​

ParquetParser​

PickleParser​

ComponentColumn​

TarArchive​

TextParser​

PartitionedStrategy​

PartitionedOptions​

InputComponent​

PartitionBinding​

RepartitionSpec​

RepartitionOptions​

XmlParser​

NoParserOptions​

ZipArchive​

FileArchiveOptions​

FileSelection​

FileTimeFromFilenameSelection​

FileTimeSelection​

TimeDelta​