Create a Smart Python Read Component
In this guide, we'll build a Smart Python Read Component that leverages Ascend's Smart partitioning feature to efficiently ingest partitioned datasets by discovering and processing partitions in parallel.
Prerequisites​
- Ascend Flow
Create a new Component​
Begin from your workspace Super Graph view. Follow these steps to create your component:
- Using the Component Form
- Using the Files Panel
- Double-click the Flow where you want to create your component
- Right-click anywhere in the Flow Graph
- Hover over Create Component, then over Read in the expanded menu, and click From Scratch
- Complete the form with these details:
- Select your Flow
- Enter a descriptive Component Name like
read_sales
- Select Python as your file type
- Open the files panel in the top left corner
- Navigate to and select your desired Flow
- Right-click on the components directory and choose New file
- Name your file with a descriptive name like
read_sales.py
and press enter
Create your Smart Python Read Component​
Structure your Smart Python Read Component following this pattern, based on our Otto's Expeditions project:
- Import necessary packages: Include Ascend resources (
reader
), context handlers (ComponentExecutionContext
), data processing libraries (Polars
andpyarrow
in this examples), filesystem libraries (gcsfs
in this examples), and logging utilities (log
) - Define a partition listing function with the
@reader.list()
decorator: Discover and yield partitions to process - Define a partition read function with the
@reader.read()
decorator: Read each partition identified by the listing function - Return structured data: Ensure the read function returns an Arrow table for efficient downstream processing
The @reader.list()
and @reader.read()
decorators integrate into Ascend's Smart partitioning framework, enabling parallel ingestion across data partitions.
Example​
smart_read.py
import gcsfs
import polars as pl
import pyarrow as pa
from ascend.resources import reader
from ascend.resources.reader import ListItem
from ascend.common.events import log
from ascend.application.context import ComponentExecutionContext
@reader.list()
def list_partitions(context: ComponentExecutionContext):
"""List parquet file partitions for metabook events."""
fs = gcsfs.GCSFileSystem()
files = fs.glob("gs://ascend-io-gcs-public/ottos-expeditions/lakev0/generated/events/metabook.parquet/year=*/month=*/day=*/*.parquet")
files = [ListItem(name=file) for file in files]
log(f"Found {len(files)} partitions")
yield from files
@reader.read()
def read_partition(context: ComponentExecutionContext, item: ListItem) -> pa.Table:
"""Read a single partition of metabook events."""
log(f"Reading partition {item.name}")
df = pl.read_parquet(f"gs://{item.name}")
return df.to_arrow()
Check out our reference guide for additional examples and advanced options.
🎉 Congratulations! You've successfully created a Smart Python Read Component in Ascend.