Skip to main content
Version: 3.0.0

Read data from Blob Store

In this guide, we will learn how to create a read component to ingest data from file systems, including object stores like Amazon S3, Google Cloud Storage, and Azure Blob Filesystem.

Prerequisites

Before beginning, ensure you have the following:

Setting up your Read Component

  1. Within your Workspace, click on the Files icon on the left navigation bar. This will open the File Tree in the left panel.
  2. In your File Tree, locate your top-level project folder.
  3. Within the project folder, expand the flows folder and select the flow you want to add your read component to. If you don't have a flow yet, you can learn how to create one here.
  4. Right-click on the flow folder, select New File, and provide a name for your read component, ensuring it has a .yaml extension (e.g. example_rc.yaml).

Configuring your Read Component

  1. Add the following yaml code to your file, replacing the value in the connection field with the name of the connection you are using.
my_project/flows/my_flow/components/read_from_s3.yaml
component:
read:
connection: lake_on_s3
  1. Next, you'll add a key depending on the source you're reading from. Examples of allowed values are s3, abfs, gcs, and local. In this example, we will be reading from an Amazon S3 bucket:
my_project/flows/my_flow/components/read_from_s3.yaml
component:
read:
connection: lake_on_s3
s3:

Selecting files to read

  1. Add the path key to specify the path to the directory or file to read.
my_project/flows/my_flow/components/read_from_s3.yaml
component:
read:
connection: lake_on_s3
s3:
path: listing/binary/
File Paths

The path specified in the component is relative to the root directory specified in the connection. For example, if the root on the connection is set to /data/, and the path is set to listing/binary/, the full path we will read from will be /data/listing/binary/.

The path cannot be an absolute path or traverse outside the root directory.

  1. We can specify filters on our component to include or exclude certain files, using the include or exclude keys. To see a full list of file filtering options, check out our reference guide. In the example below, we use the include filter to include only the files that match the criteria defined, such as files that end with the suffix csv and are created after 2024-01-01. suffix is just the file type suffix without a period.
my_project/flows/my_flow/components/read_from_s3.yaml
component:
read:
connection: lake_on_s3
s3:
path: listing/binary/
include:
- suffix: csv
- created_at:
after: 2024-01-01

Selecting a parser type

  1. Now, you'll define how you want to parse the data. For a full list of parser types and their options, check out our reference guide. In this example, we will use the auto parser, which will automatically determine how to parse the data based on the file type.
my_project/flows/my_flow/components/read_from_s3.yaml
component:
read:
connection: lake_on_s3
s3:
path: listing/binary/
include:
- suffix: csv
- created_at:
after: 2024-01-01
parser: auto

Configuring the load strategy

  1. You can specify a load strategy to limit the number of files or the size of files loaded at once, and set the number of parallel threads for downloading files from the source. For a full list of load strategy options, check out our reference guide. In this example, we will set the max_size to 1GB. When downloading files, Ascend will attempt to group them into batches that do not exceed 1 GB.
my_project/flows/my_flow/components/read_from_s3.yaml
component:
read:
connection: lake_on_s3
s3:
path: listing/binary/
include:
- suffix: csv
- created_at:
after: 2024-01-01
parser: auto
load_strategy:
max_size: 1GB

After you are done configuring your component, you can hit Save to save the changes to you project. To run your component, check out our guide on running a component in a Workspace.

More Examples:

Reading CSV data with custom parsing and filtering files by using a glob pattern

In this example we are reading pipe-delimited (|) files that are partitioned by year/month/date.

my_project/flows/my_flow/components/read_from_gcs.yaml
component:
read:
connection: lake_on_gcs
gcs:
path: listing/partitioned/year=20
include:
- glob: "**/year=2024/month=0*/**/*.csv"
parser:
csv:
sep: "|"
has_header: true
date_format: "%Y/%m/%d"
timestamp_format: '%Y-%m-%d %H:%M:%S'

Let's break down the configuration:

  • glob: "**/year=2024/month=0*/**/*.csv" - The glob filter option lets you specify a glob pattern to include or exclude certain files. The glob pattern we have specified will match all files that are in the year=2024 directory, are in any month=0* directory, and have a .csv extension.
  • sep: "|" - The delimiter used in the file.
  • has_header: true - This ensures that the CSV header is read in as the column names. If the CSV file does not have a header, set this to false.
  • date_format: "%Y/%m/%d" - The date format used in the CSV file so that the date columns can be properly loaded as date type columns.
  • timestamp_format: '%Y-%m-%d %H:%M:%S' - The timestamp format used in the CSV file so that the timestamp columns can be properly loaded as timestamp type columns.

Next Steps