S3 Read Component
Component for reading files from an S3 bucket.
Examples
- read_parquet_s3.yaml
- s3_read_csv_with_header.yaml
- s3_read_avro_config.yaml
component:
read:
connection: my-s3-connection
s3:
path: my/parquet/files/
parser: parquet
component:
read:
connection: my-s3-connection
s3:
path: my/csv/files/
parser:
csv:
has_header: true
component:
description: Component to read Avro files from S3.
read:
connection: my_s3_connection
s3:
include:
- suffix: .avro
parser: avro
path: path/to/your/avro/files
S3ReadComponent
S3ReadComponent
is defined beneath the following ancestor nodes in the YAML structure:
Below are the properties for the S3ReadComponent
. Each property links to the specific details section further down in this page.
Property | Default | Type | Required | Description |
---|---|---|---|---|
connection | string | No | The name of the connection to use for reading data. | |
columns | array[ComponentColumn] | No | A list specifying the columns to read from the source and transformations to make during read. | |
normalize | boolean | No | A boolean flag indicating if the output column names should be normalized to a standard naming convention after reading. | |
preserve_case | boolean | No | A boolean flag indicating if the case of the column names should be preserved after reading. | |
uppercase | boolean | No | A boolean flag indicating if the column names should be transformed to uppercase after reading. | |
materialization | PartitionMaterialization | No | Resource for data materialization during the read process. | |
s3 | FileReadOptionsBase | Yes | Options for reading files from an S3 bucket. |
Property Details
Component
A component is a fundamental building block of a data flow. Types of components that are supported include: read, transform, task, test, and more.
Property | Default | Type | Required | Description |
---|---|---|---|---|
component | One of: ReadComponent TransformComponent TaskComponent SingularTestComponent CustomPythonReadComponent WriteComponent CompoundComponent AliasedTableComponent ExternalTableComponent | Yes | Configuration options for the component. |
ReadComponent
A component that reads data from a data system.
Property | Default | Type | Required | Description |
---|---|---|---|---|
data_plane | One of: SnowflakeDataPlane BigQueryDataPlane DuckdbDataPlane SynapseDataPlane | No | Data Plane-specific configuration options for a component. | |
name | string | No | The name of the model | |
description | string | No | A brief description of what the model does. | |
metadata | ResourceMetadata | No | Meta information of a resource. In most cases it doesn't affect the system behavior but may be helpful to analyze project resources. | |
flow_name | string | No | The name of the flow that the component belongs to. | |
skip | boolean | No | A boolean flag indicating whether to skip processing for the component or not. | |
data_maintenance | DataMaintenance | No | The data maintenance configuration options for the component. | |
tests | ComponentTestColumn | No | Defines tests to run on the data of this component. | |
read | One of: GenericFileReadComponent LocalFileReadComponent S3ReadComponent GcsReadComponent AbfsReadComponent HttpReadComponent MSSQLReadComponent MySQLReadComponent OracleReadComponent PostgresReadComponent SnowflakeReadComponent BigQueryReadComponent | Yes | The read component that reads data from a data system. |
FileReadOptionsBase
Options for locating and parsing files from a specified directory or file path, including file selection criteria and parser to use.
Property | Default | Type | Required | Description |
---|---|---|---|---|
path | string | Yes | Path to the directory or file to read. Path is relative to the connection's root directory, and cannot be an absolute path or traverse outside the root directory. | |
exclude | array[FileSelection] | No | List of conditions to exclude specific files from being processed. | |
include | array[FileSelection] | No | List of conditions to include specific files for processing. | |
parser | auto | One of: Any of: string ("auto") AutoParser Any of: string ("avro") AvroParser Any of: string ("feather") FeatherParser Any of: string ("orc") OrcParser Any of: string ("parquet") ParquetParser Any of: string ("pickle") PickleParser Any of: string ("text") TextParser Any of: string ("xml") XmlParser Any of: string ("csv") CsvParser Any of: string ("excel") ExcelParser Any of: string ("json") JsonParser | No | Parser Resource for reading the files. Defaults to 'auto'. To set specific parser options, use the parser name as a child object. |
archive | Any of: TarArchive ZipArchive | No | Configuration for archive files for processing. | |
load_strategy | LoadStrategy | No | Strategy for loading files, including limits on number and size of files. |
AutoParser
Parser that automatically detects the file format and settings for parsing.
Property | Default | Type | Required | Description |
---|---|---|---|---|
auto | NoParserOptions | Yes | Options for automatically detecting the file format. None need to be specified. |
AvroParser
Parser for the Avro file format.
Property | Default | Type | Required | Description |
---|---|---|---|---|
avro | NoParserOptions | Yes | Options for parsing Avro files. |
CsvParser
Parser for CSV files.
Property | Default | Type | Required | Description |
---|---|---|---|---|
csv | CsvParserOptions | Yes | Options for parsing CSV files. |
CsvParserOptions
Options for parsing CSV files, including separators, header presence, and multi-line values.
Property | Default | Type | Required | Description |
---|---|---|---|---|
all_varchar | False | boolean | No | Option to skip type detection for CSV parsing and assume all columns to be of type VARCHAR. |
allow_quoted_nulls | True | boolean | No | Option to allow the conversion of quoted values to NULL values. |
auto_detect | True | boolean | No | Enables auto detection of CSV parameters. |
auto_type_candidates | array[string] | No | Types that the sniffer will use when detecting CSV column types. VARCHAR is always included as a fallback. | |
columns | object | No | A struct specifying the column names and types. Using this option implies that auto detection is not used. | |
compression | auto | string | No | The compression type for the file, detected automatically from the file extension. |
dateformat | string | No | The date format to use when parsing dates. | |
decimal_separator | . | string | No | The decimal separator of numbers. |
escape | " | string | No | The string that should appear before a data character sequence that matches the quote value. |
force_not_null | array[string] | No | Do not match the specified columns' values against the NULL string. | |
header | False | boolean | No | Specifies that the file contains a header line with the names of each column. |
hive_partitioning | False | boolean | No | Whether to interpret the path as a Hive partitioned path. |
ignore_errors | False | boolean | No | Option to ignore any parsing errors encountered. |
max_line_size | 2097152 | integer | No | The maximum line size in bytes. |
names | array[string] | No | The column names as a list. | |
new_line | string | No | Set the new line character(s) in the file. | |
normalize_names | False | boolean | No | Whether column names should be normalized, removing non-alphanumeric characters. |
null_padding | False | boolean | No | Pad remaining columns on the right with null values if a row lacks columns. |
nullstr | Any of: string array[string] | No | The string or strings that represent a NULL value. | |
parallel | True | boolean | No | Whether the parallel CSV reader is used. |
quote | " | string | No | The quoting string to be used when a data value is quoted. |
sample_size | 20480 | integer | No | The number of sample rows for auto detection of parameters. |
sep | , | string | No | Single-character string to use as the column separator. Default is ','. |
delim | , | string | No | Single-character string to use as the column separator. Default is ','. |
skip | 0 | integer | No | The number of lines at the top of the file to skip. |
timestampformat | string | No | The date format to use when parsing timestamps. | |
types | Any of: array[string] object | No | The column types as either a list (by position) or a struct (by name). | |
dtypes | Any of: array[string] object | No | The column types as either a list (by position) or a struct (by name). | |
union_by_name | False | boolean | No | Whether the columns of multiple schemas should be unified by name, rather than by position. Increases memory consumption. |
date_format | string | No | Format string for parsing dates, following Python's strftime directives, e.g., '%Y-%m-%d'. | |
timestamp_format | string | No | Format string for parsing timestamps, following Python's strftime directives, e.g., '%Y-%m-%d %H:%M:%S'. | |
has_header | True | boolean | No | Indicates if the first row of the CSV is treated as the header row. Defaults to True. |
has_multi_line_value | False | boolean | No | Indicates if the CSV contains values spanning multiple lines. Defaults to False. |
ExcelParser
Parser for Excel files.
Property | Default | Type | Required | Description |
---|---|---|---|---|
excel | ExcelParserOptions | Yes | Options for parsing Excel files. |
ExcelParserOptions
Parsing options for Excel files, including which columns should be parsed as dates and the sheet to be used.
Property | Default | Type | Required | Description |
---|---|---|---|---|
date_format | string | No | Format string for parsing dates, following Python's strftime directives, e.g., '%Y-%m-%d'. | |
timestamp_format | string | No | Format string for parsing timestamps, following Python's strftime directives, e.g., '%Y-%m-%d %H:%M:%S'. | |
date_columns | array[string] | No | List of column names that should be parsed as dates. | |
timestamp_columns | array[string] | No | List of column names that should be parsed as timestamps. | |
sheet_name | string | No | Name of the Excel sheet to parse. If not specified, the first sheet is used by default. |
FeatherParser
Parser for the Feather file format.
Property | Default | Type | Required | Description |
---|---|---|---|---|
feather | NoParserOptions | Yes | Options for parsing Feather files. |
JsonParser
Parser for JSON files.
Property | Default | Type | Required | Description |
---|---|---|---|---|
json | JsonParserOptions | No | Options for parsing JSON files. |
JsonParserOptions
Parsing options for JSON files, particularly focusing on date and timestamp formats.
Property | Default | Type | Required | Description |
---|---|---|---|---|
columns | object | No | A struct that specifies the key names and value types contained within the JSON file (e.g. {key1: 'INTEGER', key2: 'VARCHAR'} ). | |
compression | auto_detect | string | No | The compression type for the file. Detected automatically from the file extension by default. Options include 'uncompressed', 'gzip', 'zstd', and 'auto_detect'. |
convert_strings_to_integers | False | boolean | No | Whether strings representing integer values should be converted to a numerical type. Defaults to False. |
dateformat | iso | string | No | Specifies the date format to use when parsing dates. Default format is 'iso'. |
format | auto | string | No | Specifies the format of the JSON file. Can be one of ['auto', 'unstructured', 'newline_delimited', 'array']. Default is 'auto'. |
hive_partitioning | False | boolean | No | Whether or not to interpret the path as a Hive partitioned path. Defaults to False. |
maximum_depth | -1 | integer | No | Maximum nesting depth to which the automatic schema detection detects types. Set to -1 to fully detect nested JSON types. |
maximum_object_size | 16777216 | integer | No | The maximum size of a JSON object in bytes. Default is 16777216 bytes. |
records | auto | string | No | Determines how JSON objects are treated in parsing. Can be one of ['auto', 'true', 'false']. Default is 'auto'. |
sample_size | 20480 | integer | No | Number of sample objects for automatic JSON type detection. Set to -1 to scan the entire input file. Default is 20480. |
timestampformat | iso | string | No | Specifies the date format to use when parsing timestamps. Default format is 'iso'. |
union_by_name | False | boolean | No | Whether the schema's of multiple JSON files should be unified. Defaults to False. |
date_format | string | No | Format string for parsing dates, following Python's strftime directives, e.g., '%Y-%m-%d'. | |
timestamp_format | string | No | Format string for parsing timestamps, following Python's strftime directives, e.g., '%Y-%m-%d %H:%M:%S'. |
LoadStrategy
Specifications for loading files, including limits on number and size of files.
Property | Default | Type | Required | Description |
---|---|---|---|---|
max_files | integer | No | Maximum number of files to read. | |
max_size | Any of: integer string | No | Maximum size of files to read. Can be an integer or a string that represents a byte size (e.g., '10MB', '1GB'). | |
download_threads | integer | No | Number of threads to use for downloading files. | |
direct_load | DirectLoadOptions | No | Override default direct load behavior. |
DirectLoadOptions
Property | Default | Type | Required | Description |
---|---|---|---|---|
enable | boolean | No | A boolean flag indicating if direct load should be used if supported by the backend engine. default to not use direct load | |
ignore_unknown_values | True | boolean | No | Optional. Indicates if BigQuery should allow extra values that are not represented in the table schema. If true, the extra values are ignored. If false, records with extra columns are treated as bad records, and if there are too many bad records, an invalid error is returned in the job result. The default value is false. The sourceFormat property determines what BigQuery treats as an extra value: |
max_staging_table_age_in_hours | integer | No | For BigQuery backend, the maximum age of the staging table in hours. If the staging table is older than this age, it will be dropped. If not set, default to 6 hours |
OrcParser
Parser for the ORC file format.
Property | Default | Type | Required | Description |
---|---|---|---|---|
orc | NoParserOptions | Yes | Options for parsing ORC files. |
ParquetParser
Parser for the Parquet file format.
Property | Default | Type | Required | Description |
---|---|---|---|---|
parquet | NoParserOptions | Yes | Options for parsing Parquet files. |
PickleParser
Parser for the Pickle file format. Pickle files are generated by Python's pickle
module.
Property | Default | Type | Required | Description |
---|---|---|---|---|
pickle | NoParserOptions | Yes | Options for parsing Pickle files. |
ComponentColumn
Component column expression definition.
No properties defined.
TarArchive
Configuration options for tar archive files.
Property | Default | Type | Required | Description |
---|---|---|---|---|
tar | FileArchiveOptions | No | Configuration for tar archive files. |
TextParser
Parser for text files.
Property | Default | Type | Required | Description |
---|---|---|---|---|
text | NoParserOptions | Yes | Options for parsing text files. |
PartitionMaterialization
Container for options for how data is materialized and stored for partitioned case.
Property | Default | Type | Required | Description |
---|---|---|---|---|
partitioned | PartitionedOptions | No | Field for options for partitioning data. |
PartitionedOptions
Options for partitioning data.
Property | Default | Type | Required | Description |
---|---|---|---|---|
enable_substitution_by_partition_name | boolean | Yes | Enable substitution by partition name. | |
on_schema_change | string ("ignore", "fail", "append_new_columns", "sync_all_columns") | No | Policy to apply when schema changes are detected. | |
output_type | table | string ("table", "view") | No | Output type for partitioned data. Must be either 'table' or 'view'. |
XmlParser
Parser for XML files.
Property | Default | Type | Required | Description |
---|---|---|---|---|
xml | NoParserOptions | Yes | Options for parsing XML files. |
NoParserOptions
No custom parsing options exist for this parser.
No properties defined.
ZipArchive
Configuration options for ZIP archive files.
Property | Default | Type | Required | Description |
---|---|---|---|---|
zip | FileArchiveOptions | No | Configuration for ZIP archive files. |
FileArchiveOptions
Options for working with files inside of the file structure of an archive file format.
Property | Default | Type | Required | Description |
---|---|---|---|---|
path | string | No | Path to the directory in the archive containing files that should be processed. | |
include | array[FileSelection] | No | List of conditions to include specific files in the archive. | |
exclude | array[FileSelection] | No | List of conditions to exclude specific files in the archive. |
FileSelection
Options for selecting files based on various criteria. All criteria specified must be met for a file to be included.
Property | Default | Type | Required | Description |
---|---|---|---|---|
prefix | string | No | Select only files with this prefix. Strongly recommended when parser is not set to 'auto'. | |
suffix | string | No | Select only files with this suffix. Strongly recommended when parser is not set to 'auto'. | |
created_at | FileTimeSelection | No | Include files created within the specified date range. | |
modified_at | FileTimeSelection | No | Include files modified within the specified date range. | |
timestamp_from_filename | FileTimeFromFilenameSelection | No | Extract and include files based on timestamps parsed from filenames. | |
glob | string | No | Glob pattern for including files. | |
regex | string | No | Regular expression pattern for including files. |
FileTimeFromFilenameSelection
Option to extract timestamps from filenames and include files based on those timestamps.
Property | Default | Type | Required | Description |
---|---|---|---|---|
after | string | No | Include files created or modified after this date. | |
before | string | No | Include files created or modified before this date. | |
on_or_after | string | No | Include files created or modified on or after this date. | |
on_or_before | string | No | Include files created or modified on or before this date. | |
pattern | string | Yes | Pattern to extract timestamps from filenames. eg: r'year=(?P<year>\d4)/month=(?P<month>\d2)/day=(?P<day>\d2)' |
FileTimeSelection
Defines time-based selection criteria for including files, based on creation or modification dates.
Property | Default | Type | Required | Description |
---|---|---|---|---|
after | string | No | Include files created or modified after this date. | |
before | string | No | Include files created or modified before this date. | |
on_or_after | string | No | Include files created or modified on or after this date. | |
on_or_before | string | No | Include files created or modified on or before this date. |