Skip to main content
Version: 3.0.0

Create a Smart Python Read Component

In this guide, we'll build a Smart Python Read Component that leverages Ascend's Smart partitioning feature to efficiently ingest partitioned datasets by discovering and processing partitions in parallel.

Prerequisites​

Create a new Component​

Begin from your workspace Super Graph view. Follow these steps to create your component:

  1. Double-click the Flow where you want to create your component
  2. Right-click anywhere in the Flow Graph
  3. Hover over Create Component, then over Read in the expanded menu, and click From Scratch menu
  4. Complete the form with these details:
    • Select your Flow
    • Enter a descriptive Component Name like read_sales
    • Select Python as your file type form

Create your Smart Python Read Component​

Structure your Smart Python Read Component following this pattern, based on our Otto's Expeditions project:

  1. Import necessary packages: Include Ascend resources (reader), context handlers (ComponentExecutionContext), data processing libraries (Polars and pyarrow in this examples), filesystem libraries (gcsfs in this examples), and logging utilities (log)
  2. Define a partition listing function with the @reader.list() decorator: Discover and yield partitions to process
  3. Define a partition read function with the @reader.read() decorator: Read each partition identified by the listing function
  4. Return structured data: Ensure the read function returns an Arrow table for efficient downstream processing

The @reader.list() and @reader.read() decorators integrate into Ascend's Smart partitioning framework, enabling parallel ingestion across data partitions.

Example​

smart_read.py
import gcsfs
import polars as pl
import pyarrow as pa

from ascend.resources import reader
from ascend.resources.reader import ListItem
from ascend.common.events import log
from ascend.application.context import ComponentExecutionContext


@reader.list()
def list_partitions(context: ComponentExecutionContext):
"""List parquet file partitions for metabook events."""
fs = gcsfs.GCSFileSystem()
files = fs.glob("gs://ascend-io-gcs-public/ottos-expeditions/lakev0/generated/events/metabook.parquet/year=*/month=*/day=*/*.parquet")
files = [ListItem(name=file) for file in files]
log(f"Found {len(files)} partitions")
yield from files


@reader.read()
def read_partition(context: ComponentExecutionContext, item: ListItem) -> pa.Table:
"""Read a single partition of metabook events."""
log(f"Reading partition {item.name}")
df = pl.read_parquet(f"gs://{item.name}")
return df.to_arrow()

Check out our reference guide for additional examples and advanced options.

🎉 Congratulations! You've successfully created a Smart Python Read Component in Ascend.