Read data from Blob Store
In this guide, we will learn how to create a read component to ingest data from file systems, including object stores like Amazon S3, Google Cloud Storage, and Azure Blob Filesystem.
Prerequisites
Before beginning, ensure you have the following:
- An existing Ascend project (see: Create a Project)
- An existing Ascend Workspace (see: Setting Up a Workspace)
- At least one of the following blob store or file system connections:
- UI
Setting up your Read Component
- Within your Workspace, click on the Files icon on the left navigation bar. This will open the File Tree in the left panel.
- In your File Tree, locate your top-level project folder.
- Within the project folder, expand the
flows
folder and select the flow you want to add your read component to. If you don't have a flow yet, you can learn how to create one here. - Right-click on the flow folder, select
New File
, and provide a name for your read component, ensuring it has a.yaml
extension (e.g.example_rc.yaml
).
Configuring your Read Component
- Add the following yaml code to your file, replacing the value in the
connection
field with the name of the connection you are using.
my_project/flows/my_flow/components/read_from_s3.yaml
component:
read:
connection: lake_on_s3
- Next, you'll add a key depending on the source you're reading from. Examples of allowed values are
s3
,abfs
,gcs
, andlocal
. In this example, we will be reading from an Amazon S3 bucket:
my_project/flows/my_flow/components/read_from_s3.yaml
component:
read:
connection: lake_on_s3
s3:
Selecting files to read
- Add the
path
key to specify the path to the directory or file to read.
my_project/flows/my_flow/components/read_from_s3.yaml
component:
read:
connection: lake_on_s3
s3:
path: listing/binary/
File Paths
The path specified in the component is relative to the root directory specified in the connection. For example, if the root
on the connection is set to /data/
, and the path is set to listing/binary/
, the full path we will read from will be /data/listing/binary/
.
The path cannot be an absolute path or traverse outside the root directory.
- We can specify filters on our component to include or exclude certain files, using the
include
orexclude
keys. To see a full list of file filtering options, check out our reference guide. In the example below, we use theinclude
filter to include only the files that match the criteria defined, such as files that end with the suffixcsv
and are created after2024-01-01
.suffix
is just the file type suffix without a period.
my_project/flows/my_flow/components/read_from_s3.yaml
component:
read:
connection: lake_on_s3
s3:
path: listing/binary/
include:
- suffix: csv
- created_at:
after: 2024-01-01
Selecting a parser type
- Now, you'll define how you want to parse the data. For a full list of parser types and their options, check out our reference guide. In this example, we will use the
auto
parser, which will automatically determine how to parse the data based on the file type.
my_project/flows/my_flow/components/read_from_s3.yaml
component:
read:
connection: lake_on_s3
s3:
path: listing/binary/
include:
- suffix: csv
- created_at:
after: 2024-01-01
parser: auto