GCS Read Component
Component for reading files from a Google Cloud Storage bucket.
Examples
- gcs_read_csv_config.yaml
- gcs_read_json_date_format.yaml
- gcs_read_parquet_shorthand.yaml
component:
read:
connection: my-gcs-connection
gcs:
path: /path/to/csv/files
include:
- suffix: .csv
parser:
csv:
has_header: true
component:
read:
connection: my-gcs-connection
gcs:
path: /path/to/json/files
include:
- suffix: .json
parser:
json:
date_format: "%Y-%m-%d"
timestamp_format: "%Y-%m-%dT%H:%M:%S"
component:
read:
connection: my-gcs-connection
gcs:
path: /path/to/parquet/files
parser: parquet
include:
- modified_at:
on_or_after: '2023-01-01T00:00:00Z'
before: '2025-01-01T00:00:00Z'
GcsReadComponent
GcsReadComponent
is defined beneath the following ancestor nodes in the YAML structure:
Below are the properties for the GcsReadComponent
. Each property links to the specific details section further down in this page.
Property | Default | Type | Required | Description |
---|---|---|---|---|
event_time | string | No | Timestamp column in the component output used to represent event time. | |
connection | string | No | The name of the connection to use for reading data. | |
columns | array[ComponentColumn] | No | A list specifying the columns to read from the source and transformations to make during read. | |
normalize | boolean | No | A boolean flag indicating if the output column names should be normalized to a standard naming convention after reading. | |
preserve_case | boolean | No | A boolean flag indicating if the case of the column names should be preserved after reading. | |
uppercase | boolean | No | A boolean flag indicating if the column names should be transformed to uppercase after reading. | |
strategy | PartitionedStrategy | No | Ingest strategy when reading files. | |
gcs | FileReadOptionsBase | Yes | Options for reading files from a Google Cloud Storage bucket. |
Property Details
Component
A component is a fundamental building block of a data flow. Types of components that are supported include: read, transform, task, test, and more.
Property | Default | Type | Required | Description |
---|---|---|---|---|
component | One of: ReadComponent TransformComponent TaskComponent SingularTestComponent CustomPythonReadComponent WriteComponent CompoundComponent AliasedTableComponent ExternalTableComponent | Yes | Configuration options for the component. |
ReadComponent
A component that reads data from a data system.
Property | Default | Type | Required | Description |
---|---|---|---|---|
data_plane | One of: SnowflakeDataPlane BigQueryDataPlane DuckdbDataPlane SynapseDataPlane FabricDataPlane DatabricksDataPlane | No | Data Plane-specific configuration options for a component. | |
description | string | No | A brief description of what the model does. | |
metadata | ResourceMetadata | No | Meta information of a resource. In most cases it doesn't affect the system behavior but may be helpful to analyze project resources. | |
name | string | Yes | The name of the model | |
flow_name | string | No | The name of the flow that the component belongs to. | |
skip | boolean | No | A boolean flag indicating whether to skip processing for the component or not. | |
data_maintenance | DataMaintenance | No | The data maintenance configuration options for the component. | |
tests | ComponentTestOptions | No | Defines tests to run on the data of this component. | |
read | One of: GenericFileReadComponent LocalFileReadComponent SFTPReadComponent S3ReadComponent GcsReadComponent AbfsReadComponent HttpReadComponent MSSQLReadComponent MySQLReadComponent OracleReadComponent PostgresReadComponent SnowflakeReadComponent BigQueryReadComponent DatabricksReadComponent | Yes | The read component that reads data from a data system. |
FileReadOptionsBase
Options for locating and parsing files from a specified directory or file path, including file selection criteria and parser to use.
Property | Default | Type | Required | Description |
---|---|---|---|---|
path | string | Yes | Path to the directory or file to read. Path is relative to the connection's root directory, and cannot be an absolute path or traverse outside the root directory. | |
exclude | array[FileSelection] | No | List of conditions to exclude specific files from being processed. | |
include | array[FileSelection] | No | List of conditions to include specific files for processing. | |
parser | auto | One of: Any of: string AutoParser Any of: string AvroParser Any of: string FeatherParser Any of: string OrcParser Any of: string ParquetParser Any of: string PickleParser Any of: string TextParser Any of: string XmlParser Any of: string CsvParser Any of: string ExcelParser Any of: string JsonParser | No | Parser Resource for reading the files. Defaults to 'auto'. To set specific parser options, use the parser name as a child object. |
archive | Any of: TarArchive ZipArchive | No | Configuration for archive files for processing. | |
load_strategy | LoadStrategy | No | Strategy for loading files, including limits on number and size of files. | |
time_based_file_selection | Any of: string FilenameTimestampPattern | No | Method to use for file selection based on a time window. |
AutoParser
Parser that automatically detects the file format and settings for parsing.
Property | Default | Type | Required | Description |
---|---|---|---|---|
auto | NoParserOptions | Yes | Options for automatically detecting the file format. None need to be specified. |
AvroParser
Parser for the Avro file format.
Property | Default | Type | Required | Description |
---|---|---|---|---|
avro | NoParserOptions | Yes | Options for parsing Avro files. |
CsvParser
Parser for CSV files.
Property | Default | Type | Required | Description |
---|---|---|---|---|
csv | CsvParserOptions | Yes | Options for parsing CSV files. |
CsvParserOptions
Options for parsing CSV files, including separators, header presence, and multi-line values.
Property | Default | Type | Required | Description |
---|---|---|---|---|
all_varchar | boolean | No | Option to skip type detection for CSV parsing and assume all columns to be of type VARCHAR. | |
allow_quoted_nulls | boolean | No | Option to allow the conversion of quoted values to NULL values. | |
auto_type_candidates | array[string] | No | Types that the sniffer will use when detecting CSV column types. VARCHAR is always included as a fallback. | |
buffer_size | integer | No | The buffer size in bytes for the CSV reader. | |
columns | object with property values of type string | No | A struct specifying the column names and types. Using this option implies that auto detection is not used. | |
compression | string | No | The compression type for the file, detected automatically from the file extension. | |
dateformat | string | No | The date format to use when parsing dates. | |
decimal_separator | string | No | The decimal separator of numbers. | |
delim | string | No | The delimiter of the CSV file. | |
types | Any of: array[string] object with property values of type string | No | The column types as a struct (by name). | |
encoding | string | No | The encoding of the CSV file. | |
escape | string | No | The string that should appear before a data character sequence that matches the quote value. | |
force_not_null | array[string] | No | Do not match the specified columns' values against the NULL string. | |
header | Any of: boolean integer | No | Specifies that the file contains a header line with the names of each column. | |
hive_partitioning | boolean | No | Whether to interpret the path as a Hive partitioned path. | |
ignore_errors | boolean | No | Option to ignore any parsing errors encountered. | |
new_line | string | No | The line terminator of the CSV file. | |
max_line_size | integer | No | The maximum line size in bytes. | |
nullstr | Any of: string array[string] | No | The string or strings that represent a NULL value. | |
names | array[string] | No | The column names as a list. | |
normalize_names | boolean | No | Whether column names should be normalized, removing non-alphanumeric characters. | |
null_padding | boolean | No | Pad remaining columns on the right with null values if a row lacks columns. | |
parallel | boolean | No | Whether the parallel CSV reader is used. | |
quote | string | No | The quoting string to be used when a data value is quoted. | |
sample_size | integer | No | The number of sample rows for auto detection of parameters. | |
sep | string | No | Single-character string to use as the column separator. Default is ','. | |
skip | integer | No | The number of lines at the top of the file to skip. | |
timestampformat | string | No | The date format to use when parsing timestamps. | |
union_by_name | boolean | No | Whether the columns of multiple schemas should be unified by name, rather than by position. Increases memory consumption. | |
has_header | True | boolean | No | Indicates if the first row of the CSV is treated as the header row. Defaults to True. |
has_multi_line_value | False | boolean | No | Indicates if the CSV contains values spanning multiple lines. Defaults to False. |
auto_detect | True | boolean | No | Enables auto detection of CSV parameters. |
ExcelParser
Parser for Excel files.
Property | Default | Type | Required | Description |
---|---|---|---|---|
excel | Any of: DuckDBExcelParserOptions PandasExcelParserOptions | Yes | Options for parsing Excel files. |
DuckDBExcelParserOptions
Parsing options for Excel files using DuckDB. Refer to the DuckDB documentation for more information here: https://duckdb.org/docs/guides/file_formats/excel_import.html
Property | Default | Type | Required | Description |
---|---|---|---|---|
duckdb_options | DuckDBExcelParserOptionsWrapper | Yes | Options for parsing DuckDB files. |
DuckDBExcelParserOptionsWrapper
Wrapper for the DuckDB Excel Parser Options.
Property | Default | Type | Required | Description |
---|---|---|---|---|
layer | string | No | The layer to use for parsing the Excel file. If not specified, the default layer is used. | |
open_options | array[string] | No | Additional open_options to pass to the st_read function. |
FeatherParser
Parser for the Feather file format.
Property | Default | Type | Required | Description |
---|---|---|---|---|
feather | NoParserOptions | Yes | Options for parsing Feather files. |
FilenameTimestampPattern
Pattern to extract timestamps from file names for the purpose of time-based file selection.
Property | Default | Type | Required | Description |
---|---|---|---|---|
pattern | string | Yes | Pattern to extract timestamps from filenames. eg: r'year=(?P<year>\d4)/month=(?P<month>\d2)/day=(?P<day>\d2)' |
JsonParser
Parser for JSON files.
Property | Default | Type | Required | Description |
---|---|---|---|---|
json | JsonParserOptions | No | Options for parsing JSON files. |
JsonParserOptions
Parsing options for JSON files, particularly focusing on date and timestamp formats.
Property | Default | Type | Required | Description |
---|---|---|---|---|
columns | object with property values of type string | No | A struct that specifies the key names and value types contained within the JSON file (e.g. {key1: 'INTEGER', key2: 'VARCHAR'} ). | |
compression | string | No | The compression type for the file. Detected automatically from the file extension by default. Options include 'uncompressed', 'gzip', 'zstd', and 'auto_detect'. | |
convert_strings_to_integers | boolean | No | Whether strings representing integer values should be converted to a numerical type. Defaults to False. | |
dateformat | string | No | Specifies the date format to use when parsing dates. DuckDB Default format is 'iso'. | |
format | string | No | Specifies the format of the JSON file. Can be one of ['auto', 'unstructured', 'newline_delimited', 'array']. | |
hive_partitioning | boolean | No | Whether or not to interpret the path as a Hive partitioned path. Default to use DuckDB default. | |
ignore_errors | boolean | No | Whether to ignore any parsing errors encountered. Defaults to False. | |
map_inference_threshold | integer | No | The maximum number of elements in a map to use map inference. Set to -1 to use map inference for all maps. | |
maximum_depth | integer | No | Maximum nesting depth to which the automatic schema detection detects types. | |
maximum_object_size | integer | No | The maximum size of a JSON object in bytes. | |
records | string | No | Determines how JSON objects are treated in parsing. Can be one of ['auto', 'true', 'false']. | |
sample_size | integer | No | Number of sample objects for automatic JSON type detection. Set to -1 to scan the entire input file. | |
timestampformat | string | No | Specifies the date format to use when parsing timestamps. | |
union_by_name | boolean | No | Whether the schema's of multiple JSON files should be unified. |
LoadStrategy
Specifications for loading files, including limits on number and size of files.
Property | Default | Type | Required | Description |
---|---|---|---|---|
max_files | integer | No | Maximum number of files to read. | |
max_size | Any of: integer string | No | Maximum size of files to read. Can be an integer or a string that represents a byte size (e.g., '10MB', '1GB'). | |
download_threads | integer | No | Number of threads to use for downloading files. | |
direct_load | DirectLoadOptions | No | Override default direct load behavior. |
DirectLoadOptions
Property | Default | Type | Required | Description |
---|---|---|---|---|
enable | boolean | No | A boolean flag indicating if direct load should be used if supported by the backend engine. default to not use direct load | |
ignore_unknown_values | True | boolean | No | Optional. Indicates if BigQuery should allow extra values that are not represented in the table schema. If true, the extra values are ignored. If false, records with extra columns are treated as bad records, and if there are too many bad records, an invalid error is returned in the job result. The default value is false. The sourceFormat property determines what BigQuery treats as an extra value: |
max_staging_table_age_in_hours | integer | No | For BigQuery backend, the maximum age of the staging table in hours. If the staging table is older than this age, it will be dropped. If not set, default to 6 hours | |
nested_fields_as_string | boolean | No | For BigQuery backend, if true, nested fields will be converted to strings. If false, nested fields will be converted to JSON objects. |
OrcParser
Parser for the ORC file format.
Property | Default | Type | Required | Description |
---|---|---|---|---|
orc | NoParserOptions | Yes | Options for parsing ORC files. |
PandasExcelParserOptions
Parsing options for Excel files using pandas. Refer to the pandas documentation for more information here: https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html
Property | Default | Type | Required | Description |
---|---|---|---|---|
pandas_options | PandasExcelParserOptionsWrapper | Yes | Options for parsing Pandas files. |
PandasExcelParserOptionsWrapper
Wrapper for the Pandas Excel Parser Options.
Property | Default | Type | Required | Description |
---|---|---|---|---|
date_format | string | No | Format string for parsing dates, following Python's strftime directives, e.g., '%Y-%m-%d'. | |
date_columns | array[string] | No | List of column names that should be parsed as dates. | |
timestamp_columns | array[string] | No | List of column names that should be parsed as timestamps. | |
sheet_name | string | No | Name of the Excel sheet to parse. If not specified, the first sheet is used by default. | |
header | Any of: integer array[integer] | No | Row number(s) to use as the column names, and the start of the data. If a list of integers is passed, the values in the list indicate the row number(s) to use as the column names, and the start of the data. | |
names | array[string] | No | List of column names to use. If not specified, and header is True, then the first row of the data is used. If a list is passed, it must match the length of the data. | |
index_col | Any of: integer array[integer] | No | Column number(s) to set as the index (0-based). If a list is passed, the values in the list indicate the column number(s) to set as the index. | |
usecols | Any of: string array[string] | No | str, list-like, or callable, default None If None, then parse all columns. If str, then indicates comma separated list of Excel column letters and column ranges (e.g. “A:E” or “A,C,E:F”). Ranges are inclusive of both sides. If list of int, then indicates list of column numbers to be parsed (0-indexed). If list of string, then indicates list of column names to be parsed. If callable, then evaluate each column name against it and parse the column if the callable returns True. | |
dtype | object with property values of type string | No | A struct that specifies the key names and value types contained within the JSON file (e.g. {key1: 'INTEGER', key2: 'VARCHAR'} ). | |
skiprows | Any of: integer array[integer] | No | Row number(s) to skip (0-based) before reading the data. If a list is passed, the values in the list indicate the row number(s) to skip. | |
nrows | integer | No | Number of rows to read. If not specified, all rows are read. | |
na_values | Any of: string array[string] | No | String or list of strings to recognize as NA/NaN. If a list is passed, the values in the list indicate the strings to recognize as NA/NaN. | |
keep_default_na | boolean | No | Whether to include the default NaN values when parsing the data. If True, the default NaN values are included. If False, the default NaN values are not included. | |
na_filter | boolean | No | Whether to filter out rows with NA values. If True, rows with NA values are filtered out. If False, rows with NA values are included. | |
parse_dates | Any of: boolean array[boolean] | No | Whether to parse dates. If True, dates are parsed. If False, dates are not parsed. If a list is passed, the values in the list indicate whether the corresponding columns should be parsed as dates. | |
skipfooter | integer | No | Number of lines at the end of the file to skip (0-based). If a list is passed, the values in the list indicate the number of lines to skip. |
ParquetParser
Parser for the Parquet file format.
Property | Default | Type | Required | Description |
---|---|---|---|---|
parquet | NoParserOptions | Yes | Options for parsing Parquet files. |
PickleParser
Parser for the Pickle file format. Pickle files are generated by Python's pickle
module.
Property | Default | Type | Required | Description |
---|---|---|---|---|
pickle | NoParserOptions | Yes | Options for parsing Pickle files. |
ComponentColumn
Component column expression definition.
No properties defined.
TarArchive
Configuration options for tar archive files.
Property | Default | Type | Required | Description |
---|---|---|---|---|
tar | FileArchiveOptions | No | Configuration for tar archive files. |
TextParser
Parser for text files.
Property | Default | Type | Required | Description |
---|---|---|---|---|
text | NoParserOptions | Yes | Options for parsing text files. |
PartitionedStrategy
Partitioned Ingest Strategy. The user is expected to provide 2 functions, a list function that lists partitions in the source, and a read function that reads a partition from the source.
Property | Default | Type | Required | Description |
---|---|---|---|---|
partitioned | PartitionedOptions | No | Options for partitioning data. | |
on_schema_change | string ("ignore", "fail", "append_new_columns", "sync_all_columns") | No | Policy to apply when schema changes are detected. Defaults to 'fail' if not provided. |
PartitionedOptions
Options related to partition optimization - in particular, the policy that determines which partitions to ingest.
Property | Default | Type | Required | Description |
---|---|---|---|---|
enable_substitution_by_partition_name | boolean | Yes | Enable substitution by partition name. | |
output_type | table | string ("table", "view") | No | Output type for partitioned data. Must be either 'table' or 'view'. This strategy applies only to Transforms. |
XmlParser
Parser for XML files.
Property | Default | Type | Required | Description |
---|---|---|---|---|
xml | NoParserOptions | Yes | Options for parsing XML files. |
NoParserOptions
No custom parsing options exist for this parser.
No properties defined.
ZipArchive
Configuration options for ZIP archive files.
Property | Default | Type | Required | Description |
---|---|---|---|---|
zip | FileArchiveOptions | No | Configuration for ZIP archive files. |
FileArchiveOptions
Options for working with files inside of the file structure of an archive file format.
Property | Default | Type | Required | Description |
---|---|---|---|---|
path | string | No | Path to the directory in the archive containing files that should be processed. | |
include | array[FileSelection] | No | List of conditions to include specific files in the archive. | |
exclude | array[FileSelection] | No | List of conditions to exclude specific files in the archive. |
FileSelection
Options for selecting files based on various criteria. All criteria specified must be met for a file to be included.
Property | Default | Type | Required | Description |
---|---|---|---|---|
prefix | string | No | Select only files with this prefix. Strongly recommended when parser is not set to 'auto'. | |
suffix | string | No | Select only files with this suffix. Strongly recommended when parser is not set to 'auto'. | |
created_at | FileTimeSelection | No | Include files created within the specified date range. | |
modified_at | FileTimeSelection | No | Include files modified within the specified date range. | |
timestamp_from_filename | FileTimeFromFilenameSelection | No | Extract and include files based on timestamps parsed from filenames. | |
glob | string | No | Glob pattern for including files. | |
regex | string | No | Regular expression pattern for including files. |
FileTimeFromFilenameSelection
Option to extract timestamps from filenames and include files based on those timestamps.
Property | Default | Type | Required | Description |
---|---|---|---|---|
after | string | No | Include files created or modified after this date. | |
before | string | No | Include files created or modified before this date. | |
on_or_after | string | No | Include files created or modified on or after this date. | |
on_or_before | string | No | Include files created or modified on or before this date. | |
since | TimeDelta | No | Include files created or modified within the specified time delta from now. | |
pattern | string | Yes | Pattern to extract timestamps from filenames. eg: r'year=(?P<year>\d4)/month=(?P<month>\d2)/day=(?P<day>\d2)' |
FileTimeSelection
Defines time-based selection criteria for including files, based on creation or modification dates.
Property | Default | Type | Required | Description |
---|---|---|---|---|
after | string | No | Include files created or modified after this date. | |
before | string | No | Include files created or modified before this date. | |
on_or_after | string | No | Include files created or modified on or after this date. | |
on_or_before | string | No | Include files created or modified on or before this date. | |
since | TimeDelta | No | Include files created or modified within the specified time delta from now. |
TimeDelta
Property | Default | Type | Required | Description |
---|---|---|---|---|
seconds | 0 | integer | No | The number of seconds. |
minutes | 0 | integer | No | The number of minutes. |
hours | 0 | integer | No | The number of hours. |
days | 0 | integer | No | The number of days. |
weeks | 0 | integer | No | The number of weeks. |
months | 0 | integer | No | The number of months. |
years | 0 | integer | No | The number of years. |