Read data from Blob Store
In this guide, we will learn how to create a read component to ingest data from file systems, including object stores like Amazon S3, Google Cloud Storage, and Azure Blob Filesystem.
Prerequisites
Before beginning, ensure you have the following:
- An existing Ascend project (see: Create a Project)
- An existing Ascend Workspace (see: Setting Up a Workspace)
- At least one of the following blob store or file system connections:
- UI
Setting up your Read Component
- Within your Workspace, click on the Files icon on the left navigation bar. This will open the File Tree in the left panel.
- In your File Tree, locate your top-level project folder.
- Within the project folder, expand the
flows
folder and select the flow you want to add your read component to. If you don't have a flow yet, you can learn how to create one here. - Right-click on the flow folder, select
New File
, and provide a name for your read component, ensuring it has a.yaml
extension (e.g.example_rc.yaml
).
Configuring your Read Component
- Add the following yaml code to your file, replacing the value in the
connection
field with the name of the connection you are using.
component:
read:
connection: lake_on_s3
- Next, you'll add a key depending on the source you're reading from. Examples of allowed values are
s3
,abfs
,gcs
, andlocal
. In this example, we will be reading from an Amazon S3 bucket:
component:
read:
connection: lake_on_s3
s3:
Selecting files to read
- Add the
path
key to specify the path to the directory or file to read.
component:
read:
connection: lake_on_s3
s3:
path: listing/binary/
The path specified in the component is relative to the root directory specified in the connection. For example, if the root
on the connection is set to /data/
, and the path is set to listing/binary/
, the full path we will read from will be /data/listing/binary/
.
The path cannot be an absolute path or traverse outside the root directory.
- We can specify filters on our component to include or exclude certain files, using the
include
orexclude
keys. To see a full list of file filtering options, check out our reference guide. In the example below, we use theinclude
filter to include only the files that match the criteria defined, such as files that end with the suffixcsv
and are created after2024-01-01
.suffix
is just the file type suffix without a period.
component:
read:
connection: lake_on_s3
s3:
path: listing/binary/
include:
- suffix: csv
- created_at:
after: 2024-01-01
Selecting a parser type
- Now, you'll define how you want to parse the data. For a full list of parser types and their options, check out our reference guide. In this example, we will use the
auto
parser, which will automatically determine how to parse the data based on the file type.
component:
read:
connection: lake_on_s3
s3:
path: listing/binary/
include:
- suffix: csv
- created_at:
after: 2024-01-01
parser: auto
Configuring the load strategy
- You can specify a load strategy to limit the number of files or the size of files loaded at once, and set the number of parallel threads for downloading files from the source. For a full list of load strategy options, check out our reference guide. In this example, we will set the
max_size
to 1GB. When downloading files, Ascend will attempt to group them into batches that do not exceed 1 GB.
component:
read:
connection: lake_on_s3
s3:
path: listing/binary/
include:
- suffix: csv
- created_at:
after: 2024-01-01
parser: auto
load_strategy:
max_size: 1GB
After you are done configuring your component, you can hit Save
to save the changes to you project. To run your component, check out our guide on running a component in a Workspace.
More Examples:
Reading CSV data with custom parsing and filtering files by using a glob pattern
In this example we are reading pipe-delimited (|
) files that are partitioned by year/month/date.
component:
read:
connection: lake_on_gcs
gcs:
path: listing/partitioned/year=20
include:
- glob: "**/year=2024/month=0*/**/*.csv"
parser:
csv:
sep: "|"
has_header: true
date_format: "%Y/%m/%d"
timestamp_format: '%Y-%m-%d %H:%M:%S'
Let's break down the configuration:
glob: "**/year=2024/month=0*/**/*.csv"
- Theglob
filter option lets you specify a glob pattern to include or exclude certain files. The glob pattern we have specified will match all files that are in theyear=2024
directory, are in anymonth=0*
directory, and have a.csv
extension.sep: "|"
- The delimiter used in the file.has_header: true
- This ensures that the CSV header is read in as the column names. If the CSV file does not have a header, set this tofalse
.date_format: "%Y/%m/%d"
- The date format used in the CSV file so that the date columns can be properly loaded asdate
type columns.timestamp_format: '%Y-%m-%d %H:%M:%S'
- The timestamp format used in the CSV file so that the timestamp columns can be properly loaded astimestamp
type columns.