GCS Read Component
Component for reading files from a Google Cloud Storage bucket.
Examples
- gcs_read_csv_config.yaml
- gcs_read_json_date_format.yaml
- gcs_read_parquet_shorthand.yaml
component:
read:
connection: my-gcs-connection
gcs:
path: /path/to/csv/files
include:
- suffix: .csv
parser:
csv:
has_header: true
component:
read:
connection: my-gcs-connection
gcs:
path: /path/to/json/files
include:
- suffix: .json
parser:
json:
date_format: "%Y-%m-%d"
timestamp_format: "%Y-%m-%dT%H:%M:%S"
component:
read:
connection: my-gcs-connection
gcs:
path: /path/to/parquet/files
parser: parquet
include:
- modified_at:
on_or_after: '2023-01-01T00:00:00Z'
before: '2025-01-01T00:00:00Z'
GcsReadComponent
GcsReadComponent is defined beneath the following ancestor nodes in the YAML structure:
Below are the properties for the GcsReadComponent. Each property links to the specific details section further down in this page.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| dependencies | array[None] | No | List of dependencies that must complete before this Component runs. | |
| event_time | string | No | Timestamp column in the Component output used to represent Event time. | |
| connection | string | No | Name of the Connection to use for reading data. | |
| columns | array[None] | No | List specifying the columns to read from the source and transformations to make during read. | |
| normalize | boolean | No | Boolean flag indicating whether the output column names should be normalized to a standard naming convention after reading. | |
| preserve_case | boolean | No | Boolean flag indicating whether the case of the column names should be preserved after reading. | |
| uppercase | boolean | No | Boolean flag indicating whether the column names should be transformed to uppercase after reading. | |
| strategy | PartitionedStrategy | No | Ingest strategy when reading files. | |
| gcs | Yes | Options for reading files from a Google Cloud Storage bucket. |
Property Details
Component
A Component is a fundamental building block of a data Flow. Supported Component types include: Read, Transform, Task, Test, and more.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| component | One of: CustomPythonReadComponent ApplicationComponent AliasedTableComponent ExternalTableComponent DbtNodeComponent | Yes | Component configuration options. |
ReadComponent
Component that reads data from a system.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| data_plane | One of: SnowflakeDataPlane BigQueryDataPlane DuckdbDataPlane DatabricksDataPlane | No | Data Plane-specific configuration options for Components. | |
| skip | boolean | No | Boolean flag indicating whether to skip processing for the Component or not. | |
| retry_strategy | No | Retry strategy configuration options for the Component if any exceptions are encountered. | ||
| data_maintenance | No | The data maintenance configuration options for the Component. | ||
| description | string | No | Brief description of what the model does. | |
| metadata | No | Meta information of a resource. In most cases it doesn't affect the system behavior but may be helpful to analyze project resources. | ||
| name | string | Yes | The name of the model | |
| flow_name | string | No | Name of the Flow that the Component belongs to. | |
| tests | No | Defines tests to run on this Component's data. | ||
| read | One of: GenericFileReadComponent LocalFileReadComponent SFTPReadComponent S3ReadComponent GcsReadComponent AbfsReadComponent HttpReadComponent MSSQLReadComponent MySQLReadComponent OracleReadComponent PostgresReadComponent SnowflakeReadComponent BigQueryReadComponent DatabricksReadComponent | Yes | Read component that reads data from a system. |
FileReadOptionsBase
Options for locating and parsing files from a specified directory or file path, including file selection criteria and parser to use.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| path | string | Yes | Path to the directory or file to read. Path is relative to the connection's root directory, and cannot be an absolute path or traverse outside the root directory. | |
| exclude | array[None] | No | List of conditions to exclude specific files from being processed. | |
| include | array[None] | No | List of conditions to include specific files for processing. | |
| parser | auto | One of: Any of: auto Any of: avro Any of: feather Any of: orc Any of: parquet Any of: pickle Any of: text Any of: xml Any of: csv Any of: excel Any of: json | No | Parser Resource for reading the files. Defaults to 'auto'. To set specific parser options, use the parser name as a child object. |
| archive | Any of: | No | Configuration for archive files for processing. | |
| load_strategy | No | Strategy for loading files, including limits on number and size of files. | ||
| time_based_file_selection | Any of: last_modified | No | Method to use for file selection based on a time window. |
AutoParser
Parser that automatically detects the file format and settings for parsing.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| hive_partitioning | False | boolean | No | Whether to extract hive partitioning from the full path name of the file and use it as partition columns. If true, the partition columns will be added to the table as additional columns. The dataset should not contain any columns with the same name as the partition columns. If false, the partition columns will not be added to the table. If not set, the default value is false. e.g.: directory names with key=value pairs like “/year=2009/month=11”, when set to true, will be parsed into partition columns with the names "year" and "month" and the values "2009" and "11" respectively. |
| auto | Yes | Options for automatically detecting the file format. None need to be specified. |
AvroParser
Parser for the Avro file format.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| hive_partitioning | False | boolean | No | Whether to extract hive partitioning from the full path name of the file and use it as partition columns. If true, the partition columns will be added to the table as additional columns. The dataset should not contain any columns with the same name as the partition columns. If false, the partition columns will not be added to the table. If not set, the default value is false. e.g.: directory names with key=value pairs like “/year=2009/month=11”, when set to true, will be parsed into partition columns with the names "year" and "month" and the values "2009" and "11" respectively. |
| avro | Yes | Parsing options for Avro files. |
CsvParser
Parser for CSV files.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| hive_partitioning | False | boolean | No | Whether to extract hive partitioning from the full path name of the file and use it as partition columns. If true, the partition columns will be added to the table as additional columns. The dataset should not contain any columns with the same name as the partition columns. If false, the partition columns will not be added to the table. If not set, the default value is false. e.g.: directory names with key=value pairs like “/year=2009/month=11”, when set to true, will be parsed into partition columns with the names "year" and "month" and the values "2009" and "11" respectively. |
| csv | Yes | Parsing options for CSV files. |
CsvParserOptions
Parsing options for CSV files, including separators, header presence, and multi-line values.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| all_varchar | boolean | No | Option to skip type detection for CSV parsing and assume all columns to be of type VARCHAR. | |
| allow_quoted_nulls | boolean | No | Option to allow the conversion of quoted values to NULL values. | |
| auto_type_candidates | array[string] | No | Types that the sniffer will use when detecting CSV column types. VARCHAR is always included as a fallback. | |
| buffer_size | integer | No | The buffer size in bytes for the CSV reader. | |
| columns | object with property values of type string | No | A struct specifying the column names and types. Using this option implies that auto detection is not used. | |
| compression | string | No | File compression type, detected automatically from the file extension. | |
| dateformat | string | No | Date format to use when parsing dates. | |
| decimal_separator | string | No | Decimal separator of numbers. | |
| delim | string | No | Delimiter of the CSV file. | |
| types | Any of: array[string] object with property values of type string | No | Column types as a struct (by name). | |
| encoding | string | No | Encoding of the CSV file. | |
| escape | string | No | String that should appear before a data character sequence that matches the quote value. | |
| force_not_null | array[string] | No | Do not match the specified columns' values against the NULL string. | |
| header | Any of: boolean integer | No | Indicates whether the file contains a header row. If True, the first row is treated as column names. If False, no header is expected. If an integer, specifies the row number (0-based) to use as the header. | |
| hive_partitioning | boolean | No | Whether to interpret the path as a Hive partitioned path. | |
| ignore_errors | boolean | No | Option to ignore any parsing errors encountered. | |
| new_line | string | No | Line terminator of the CSV file. | |
| max_line_size | integer | No | Maximum line size in bytes. | |
| nullstr | Any of: string array[string] | No | String or strings that represent a NULL value. | |
| names | array[string] | No | List of column names. | |
| normalize_names | boolean | No | Whether column names should be normalized, removing non-alphanumeric characters. | |
| null_padding | boolean | No | Pad remaining columns on the right with null values if a row lacks columns. | |
| parallel | boolean | No | Whether the parallel CSV reader is used. | |
| quote | string | No | String to use as the quoting character. | |
| sample_size | integer | No | Number of sample rows for auto detection of parameters. | |
| sep | string | No | Single-character string to use as the column separator. Default is ','. | |
| skip | integer | No | Number of lines at the top of the file to skip. | |
| timestampformat | string | No | Date format to use when parsing timestamps. | |
| union_by_name | boolean | No | Whether the columns of multiple schemas should be unified by name, rather than by position. Increases memory consumption. | |
| has_header | True | boolean | No | Indicates if the first row of the CSV is treated as the header row. Defaults to True. |
| has_multi_line_value | False | boolean | No | Indicates if the CSV contains values spanning multiple lines. Defaults to False. |
| auto_detect | True | boolean | No | Enables auto detection of CSV parameters. |
ExcelParser
Parser for Excel files.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| hive_partitioning | False | boolean | No | Whether to extract hive partitioning from the full path name of the file and use it as partition columns. If true, the partition columns will be added to the table as additional columns. The dataset should not contain any columns with the same name as the partition columns. If false, the partition columns will not be added to the table. If not set, the default value is false. e.g.: directory names with key=value pairs like “/year=2009/month=11”, when set to true, will be parsed into partition columns with the names "year" and "month" and the values "2009" and "11" respectively. |
| excel | Any of: | Yes | Parsing options for Excel files. |
DuckDBExcelParserOptions
Parsing options for Excel files using DuckDB. Refer to the DuckDB documentation for more information here: https://duckdb.org/docs/guides/file_formats/excel_import.html
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| duckdb_options | Yes | Parsing options for DuckDB files. |
DuckDBExcelParserOptionsWrapper
Wrapper for the DuckDB Excel Parser Options.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| layer | string | No | Layer to use for parsing the Excel file. If not specified, the default layer is used. | |
| open_options | array[string] | No | Additional open_options to pass to the st_read function. |
FeatherParser
Parser for the Feather file format.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| hive_partitioning | False | boolean | No | Whether to extract hive partitioning from the full path name of the file and use it as partition columns. If true, the partition columns will be added to the table as additional columns. The dataset should not contain any columns with the same name as the partition columns. If false, the partition columns will not be added to the table. If not set, the default value is false. e.g.: directory names with key=value pairs like “/year=2009/month=11”, when set to true, will be parsed into partition columns with the names "year" and "month" and the values "2009" and "11" respectively. |
| feather | Yes | Parsing options for Feather files. |
FilenameTimestampPattern
Pattern to extract timestamps from file names for the purpose of time-based file selection.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| pattern | string | Yes | Pattern to extract timestamps from filenames. eg: r'year=(?P<year>\d4)/month=(?P<month>\d2)/day=(?P<day>\d2)' |
JsonParser
Parser for JSON files.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| hive_partitioning | False | boolean | No | Whether to extract hive partitioning from the full path name of the file and use it as partition columns. If true, the partition columns will be added to the table as additional columns. The dataset should not contain any columns with the same name as the partition columns. If false, the partition columns will not be added to the table. If not set, the default value is false. e.g.: directory names with key=value pairs like “/year=2009/month=11”, when set to true, will be parsed into partition columns with the names "year" and "month" and the values "2009" and "11" respectively. |
| json | No | Parsing options for JSON files. |
JsonParserOptions
Parsing options for JSON files, particularly focusing on date and timestamp formats.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| columns | object with property values of type string | No | Struct specifying key names and value types contained within the JSON file (e.g. {key1: 'INTEGER', key2: 'VARCHAR'}). | |
| compression | string | No | File compression type, detected automatically from the file extension by default. Options include 'uncompressed', 'gzip', 'zstd', and 'auto_detect'. | |
| convert_strings_to_integers | boolean | No | Whether strings representing integer values should be converted to a numerical type. Defaults to False. | |
| dateformat | string | No | Specifies the date format to use when parsing dates. DuckDB Default format is 'iso'. | |
| format | string | No | Specifies the format of the JSON file. Can be one of ['auto', 'unstructured', 'newline_delimited', 'array']. | |
| hive_partitioning | boolean | No | Whether or not to interpret the path as a Hive partitioned path. Defaults to DuckDB. | |
| ignore_errors | boolean | No | Whether to ignore any parsing errors encountered. Defaults to False. | |
| map_inference_threshold | integer | No | Maximum number of elements in a map to use map inference. Set to -1 to use map inference for all maps. | |
| maximum_depth | integer | No | Maximum nesting depth to which the automatic schema detection detects types. | |
| maximum_object_size | integer | No | Maximum size of a JSON object in bytes. | |
| records | string | No | Determines how JSON objects are treated in parsing. Can be one of ['auto', 'true', 'false']. | |
| sample_size | integer | No | Number of sample objects for automatic JSON type detection. Set to -1 to scan the entire input file. | |
| timestampformat | string | No | Specifies the date format to use when parsing timestamps. | |
| union_by_name | boolean | No | Whether the schemas of multiple JSON files should be unified. |
LoadStrategy
Specifications for loading files, including limits on number and size of files.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| max_files | integer | No | Maximum number of files to read. | |
| download_threads | integer | No | Number of threads to use for downloading files. | |
| direct_load | No | Override default direct load behavior. |
DirectLoadOptions
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| enable | boolean | No | Boolean flag indicating if direct load should be used if supported by the backend engine. Default is not to use direct load. | |
| ignore_unknown_values | True | boolean | No | Optional. Indicates if BigQuery should allow extra values that are not represented in the table schema. If true, the extra values are ignored. If false, records with extra columns are treated as bad records, and if there are too many bad records, an invalid error is returned in the job result. The default value is false. The sourceFormat property determines what BigQuery treats as an extra value: |
| max_staging_table_age_in_hours | integer | No | For BigQuery backend, the maximum age of the staging table in hours. If the staging table is older than this age, it will be dropped. If not set, default to 6 hours | |
| nested_fields_as_string | boolean | No | For BigQuery backend, if true, nested fields will be converted to strings. If false, nested fields will be converted to JSON objects. |
OrcParser
Parser for the ORC file format.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| hive_partitioning | False | boolean | No | Whether to extract hive partitioning from the full path name of the file and use it as partition columns. If true, the partition columns will be added to the table as additional columns. The dataset should not contain any columns with the same name as the partition columns. If false, the partition columns will not be added to the table. If not set, the default value is false. e.g.: directory names with key=value pairs like “/year=2009/month=11”, when set to true, will be parsed into partition columns with the names "year" and "month" and the values "2009" and "11" respectively. |
| orc | Yes | Parsing options for ORC files. |
PandasExcelParserOptions
Parsing options for Excel files using pandas. Refer to the pandas documentation for more information here: https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| pandas_options | Yes | Parsing options for Pandas files. |
PandasExcelParserOptionsWrapper
Wrapper for the Pandas Excel Parser Options.
| Property | Default | Type | Required | Description |
|---|---|---|---|---|
| date_format | string | No | Format string for parsing dates, following Python's strftime directives, e.g., '%Y-%m-%d'. | |
| date_columns | array[string] | No | List of column names that should be parsed as dates. | |
| timestamp_columns | array[string] | No | List of column names that should be parsed as timestamps. | |
| sheet_name | string | No | Name of the Excel sheet to parse. If not specified, the first sheet is used by default. | |
| header | Any of: |