S3 Read Component
Component for reading files from an S3 bucket.
Examples
- read_parquet_s3.yaml
- s3_read_csv_with_header.yaml
- s3_read_avro_config.yaml
component:
read:
connection: my-s3-connection
s3:
path: my/parquet/files/
parser: parquet
component:
read:
connection: my-s3-connection
s3:
path: my/csv/files/
parser:
csv:
has_header: true
component:
description: Component to read Avro files from S3.
read:
connection: my_s3_connection
s3:
include:
- suffix: .avro
parser: avro
path: path/to/your/avro/files
S3ReadComponent
S3ReadComponent
is defined beneath the following ancestor nodes in the YAML structure:
Below are the properties for the S3ReadComponent
. Each property links to the specific details section further down in this page.
Property | Default | Type | Required | Description |
---|---|---|---|---|
dependencies | array[None] | No | List of dependencies that must complete before this component runs. | |
event_time | string | No | Timestamp column in the component output used to represent event time. | |
connection | string | No | The name of the connection to use for reading data. | |
columns | array[None] | No | A list specifying the columns to read from the source and transformations to make during read. | |
normalize | boolean | No | A boolean flag indicating if the output column names should be normalized to a standard naming convention after reading. | |
preserve_case | boolean | No | A boolean flag indicating if the case of the column names should be preserved after reading. | |
uppercase | boolean | No | A boolean flag indicating if the column names should be transformed to uppercase after reading. | |
strategy | PartitionedStrategy | No | Ingest strategy when reading files. | |
s3 | Yes | Options for reading files from an S3 bucket. |
Property Details
Component
A component is a fundamental building block of a data flow. Types of components that are supported include: read, transform, task, test, and more.
Property | Default | Type | Required | Description |
---|---|---|---|---|
component | One of: CustomPythonReadComponent ApplicationComponent AliasedTableComponent ExternalTableComponent | Yes | Configuration options for the component. |
ReadComponent
A component that reads data from a data system.
Property | Default | Type | Required | Description |
---|---|---|---|---|
data_plane | One of: SnowflakeDataPlane BigQueryDataPlane DatabricksDataPlane | No | Data Plane-specific configuration options for a component. | |
skip | boolean | No | A boolean flag indicating whether to skip processing for the component or not. | |
retry_strategy | No | The retry strategy configuration options for the component if any exceptions are encountered. | ||
description | string | No | A brief description of what the model does. | |
metadata | No | Meta information of a resource. In most cases it doesn't affect the system behavior but may be helpful to analyze project resources. | ||
name | string | Yes | The name of the model | |
flow_name | string | No | The name of the flow that the component belongs to. | |
data_maintenance | No | The data maintenance configuration options for the component. | ||
tests | No | Defines tests to run on the data of this component. | ||
read | One of: GenericFileReadComponent LocalFileReadComponent SFTPReadComponent S3ReadComponent GcsReadComponent AbfsReadComponent HttpReadComponent MSSQLReadComponent MySQLReadComponent OracleReadComponent PostgresReadComponent SnowflakeReadComponent BigQueryReadComponent DatabricksReadComponent | Yes | The read component that reads data from a data system. |
FileReadOptionsBase
Options for locating and parsing files from a specified directory or file path, including file selection criteria and parser to use.
Property | Default | Type | Required | Description |
---|---|---|---|---|
path | string | Yes | Path to the directory or file to read. Path is relative to the connection's root directory, and cannot be an absolute path or traverse outside the root directory. | |
exclude | array[None] | No | List of conditions to exclude specific files from being processed. | |
include | array[None] | No | List of conditions to include specific files for processing. | |
parser | auto | One of: Any of: string Any of: string Any of: string Any of: string Any of: string Any of: string Any of: string Any of: string Any of: string Any of: string Any of: string | No | Parser Resource for reading the files. Defaults to 'auto'. To set specific parser options, use the parser name as a child object. |
archive | Any of: | No | Configuration for archive files for processing. | |
load_strategy | No | Strategy for loading files, including limits on number and size of files. | ||
time_based_file_selection | Any of: string | No | Method to use for file selection based on a time window. |
AutoParser
Parser that automatically detects the file format and settings for parsing.
Property | Default | Type | Required | Description |
---|---|---|---|---|
hive_partitioning | False | boolean | No | Whether to extract hive partitioning from the full path name of the file and use it as partition columns. If true, the partition columns will be added to the table as additional columns. The data set should not contain any columns with the same name as the partition columns. If false, the partition columns will not be added to the table. If not set, the default value is false. e.g.: directory names with key=value pairs like “/year=2009/month=11”, when set to true, will be parsed into partition columns with the names "year" and "month" and the values "2009" and "11" respectively. |
auto | Yes | Options for automatically detecting the file format. None need to be specified. |
AvroParser
Parser for the Avro file format.
Property | Default | Type | Required | Description |
---|---|---|---|---|
hive_partitioning | False | boolean | No | Whether to extract hive partitioning from the full path name of the file and use it as partition columns. If true, the partition columns will be added to the table as additional columns. The data set should not contain any columns with the same name as the partition columns. If false, the partition columns will not be added to the table. If not set, the default value is false. e.g.: directory names with key=value pairs like “/year=2009/month=11”, when set to true, will be parsed into partition columns with the names "year" and "month" and the values "2009" and "11" respectively. |
avro | Yes | Options for parsing Avro files. |
CsvParser
Parser for CSV files.
Property | Default | Type | Required | Description |
---|---|---|---|---|
hive_partitioning | False | boolean | No | Whether to extract hive partitioning from the full path name of the file and use it as partition columns. If true, the partition columns will be added to the table as additional columns. The data set should not contain any columns with the same name as the partition columns. If false, the partition columns will not be added to the table. If not set, the default value is false. e.g.: directory names with key=value pairs like “/year=2009/month=11”, when set to true, will be parsed into partition columns with the names "year" and "month" and the values "2009" and "11" respectively. |
csv | Yes | Options for parsing CSV files. |
CsvParserOptions
Options for parsing CSV files, including separators, header presence, and multi-line values.
Property | Default | Type | Required | Description |
---|---|---|---|---|
all_varchar | boolean | No | Option to skip type detection for CSV parsing and assume all columns to be of type VARCHAR. | |
allow_quoted_nulls | boolean | No | Option to allow the conversion of quoted values to NULL values. | |
auto_type_candidates | array[string] | No | Types that the sniffer will use when detecting CSV column types. VARCHAR is always included as a fallback. | |
buffer_size | integer | No | The buffer size in bytes for the CSV reader. | |
columns | object with property values of type string | No | A struct specifying the column names and types. Using this option implies that auto detection is not used. | |
compression | string | No | The compression type for the file, detected automatically from the file extension. | |
dateformat | string | No | The date format to use when parsing dates. | |
decimal_separator | string | No | The decimal separator of numbers. | |
delim | string | No | The delimiter of the CSV file. | |
types | Any of: array[string] object with property values of type string | No | The column types as a struct (by name). | |
encoding | string | No | The encoding of the CSV file. | |
escape | string | No | The string that should appear before a data character sequence that matches the quote value. | |
force_not_null | array[string] | No | Do not match the specified columns' values against the NULL string. | |
header | Any of: boolean integer | No | Specifies that the file contains a header line with the names of each column. | |
hive_partitioning | boolean | No | Whether to interpret the path as a Hive partitioned path. | |
ignore_errors | boolean | No | Option to ignore any parsing errors encountered. | |
new_line | string | No | The line terminator of the CSV file. | |
max_line_size | integer | No | The maximum line size in bytes. | |
nullstr | Any of: string array[string] | No | The string or strings that represent a NULL value. | |
names | array[string] | No | The column names as a list. | |
normalize_names | boolean | No | Whether column names should be normalized, removing non-alphanumeric characters. | |
null_padding | boolean | No | Pad remaining columns on the right with null values if a row lacks columns. | |
parallel | boolean | No | Whether the parallel CSV reader is used. | |
quote | string | No | The quoting string to be used when a data value is quoted. | |
sample_size | integer | No | The number of sample rows for auto detection of parameters. | |
sep | string | No | Single-character string to use as the column separator. Default is ','. | |
skip | integer | No | The number of lines at the top of the file to skip. | |
timestampformat | string | No | The date format to use when parsing timestamps. | |
union_by_name | boolean | No | Whether the columns of multiple schemas should be unified by name, rather than by position. Increases memory consumption. | |
has_header | True | boolean | No | Indicates if the first row of the CSV is treated as the header row. Defaults to True. |
has_multi_line_value | False | boolean | No | Indicates if the CSV contains values spanning multiple lines. Defaults to False. |
auto_detect | True | boolean | No | Enables auto detection of CSV parameters. |
ExcelParser
Parser for Excel files.
Property | Default | Type | Required | Description |
---|---|---|---|---|
hive_partitioning | False | boolean | No | Whether to extract hive partitioning from the full path name of the file and use it as partition columns. If true, the partition columns will be added to the table as additional columns. The data set should not contain any columns with the same name as the partition columns. If false, the partition columns will not be added to the table. If not set, the default value is false. e.g.: directory names with key=value pairs like “/year=2009/month=11 ”, when set to true, will be parsed into partition columns with the names "year" and "month" and the values "2009" and "11" respectively. |
excel | Any of: | Yes | Options for parsing Excel files. |
DuckDBExcelParserOptions
Parsing options for Excel files using DuckDB. Refer to the DuckDB documentation for more information here: https://duckdb.org/docs/guides/file_formats/excel_import.html
Property | Default | Type | Required | Description |
---|---|---|---|---|
duckdb_options | Yes | Options for parsing DuckDB files. |
DuckDBExcelParserOptionsWrapper
Wrapper for the DuckDB Excel Parser Options.
Property | Default | Type | Required | Description |
---|---|---|---|---|
layer | string | No | The layer to use for parsing the Excel file. If not specified, the default layer is used. | |
open_options | array[string] | No | Additional open_options to pass to the st_read function. |
FeatherParser
Parser for the Feather file format.
Property | Default | Type | Required | Description |
---|---|---|---|---|
hive_partitioning | False | boolean | No | Whether to extract hive partitioning from the full path name of the file and use it as partition columns. If true, the partition columns will be added to the table as additional columns. The data set should not contain any columns with the same name as the partition columns. If false, the partition columns will not be added to the table. If not set, the default value is false. e.g.: directory names with key=value pairs like “/year=2009/month=11”, when set to true, will be parsed into partition columns with the names "year" and "month" and the values "2009" and "11" respectively. |
feather | Yes | Options for parsing Feather files. |
FilenameTimestampPattern
Pattern to extract timestamps from file names for the purpose of time-based file selection.
Property | Default | Type | Required | Description |
---|---|---|---|---|
pattern | string | Yes | Pattern to extract timestamps from filenames. eg: r'year=(?P<year>\d4)/month=(?P<month>\d2)/day=(?P<day>\d2)' |