Read
Read components in Ascend help you bring data from various external sources into your Flow's running on a given Data Plane. They work with Connections to securely access data, which then flows into your Data Plane for further processing.
Key Features
- Versatile Connections: Connect to various sources, including warehouses, databases, and cloud storage, making data ingestion flexible and efficient.
- Support for Multiple File Formats: Handle formats like JSON, CSV, and Parquet with built-in parsers.
- Data Validation: Keep data quality in check with built-in validation tests.
- Optimization Options: Boost performance by tuning flow resources, adjusting run frequency, and setting up partitioning.
How It Works
You can configure read components to connect to data sources using YAML files or Python. Configuration settings tell the system how to access the data, what to pull, and how to interpret it.
Configuration Requirements
For complete details on the read components available and their configuration options, see the Read Component Reference.
To set up a read component, you'll need:
- The Connection ID: Identifies the data source using the name of the associated Connection file.
- The Source Type: Specifies the type of source, like cloud storage (S3, Azure Blob) or databases (MySQL, PostgreSQL).
- A Path or Query: Defines the data location or SQL query used to fetch data.
- Include/Exclude Patterns: Optional filters to specify which files or datasets to include or exclude.
- Parser Type: Tells how to interpret the data (e.g.,
csv
,json
, orparquet
). You can also set it toauto
. - Tests: Optional data validation tests to ensure data quality.
Types of Read Components
There is a large variety of sources that can be connected to and ingested via Ascend. The following categories below are the most common sources that you are likely to use. While
Aliases
Aliases are references to tables within the Data Plane that your system already has access to. Use them to simplify workflows or reuse existing data without duplication, keeping things consistent and manageable.
External Tables
External tables link the Data Plane directly to external databases or storage systems, offloading heavy processing to improve performance.
Ingested Sources
Data sources that are not currently connected to your Data Plane can be ingested into Ascend using read components. These sources include:
-
Cloud Storage & File Sources: Read components can pull data from cloud storage services like AWS S3, Google Cloud Storage, and Azure Blob. They handle file formats like JSON, CSV, and Parquet and let you control data selection using include/exclude patterns.
-
Data Warehouses: Data warehouses like Snowflake, Redshift, and BigQuery work seamlessly with read components, making it easy to pull structured data through SQL queries — perfect for analytics and business intelligence.
-
Databases: Support for databases like MySQL, PostgreSQL, and SQL Server lets you connect to relational sources and extract data from tables or custom queries.
-
APIs and Web Services: Read components can ingest data directly from APIs and web services, integrating data like metrics, logs, or other dynamic sources into your flows. They handle formats such as JSON and XML, adapting to different API structures.
Parsing and Schemas
Parsing helps read components correctly interpret data based on its format; this is particularly important for ingested sources coming from cloud storage or APIs. Parsers automatically detect formats like CSV, JSON, and Parquet, but you can set them manually if needed. Schema validation checks incoming data against predefined schemas and flags any mismatches.
Read Strategies
Ascend supports a number of read strategies to help you get the most out of your read components. These include:
Partitioned
Partitioned read components in Ascend.io are designed to read data from partitioned sources, such as partitioned tables in databases or partitioned folders in cloud storage. This approach is particularly useful for large datasets, where reading the entire dataset repeatedly would be inefficient.
Key Differences for Partitioned Reads
- Partition Awareness: Partitioned reads track the partitions that have been read, so that they can be skipped in subsequent reads. This applies to both partitioned tables and file systems.
- Change Detection: Partitioned reads can also detect changes to partitions and only read new or updated data. For example, if a file has been updated, the partitioned read can replace the old file with the new version.
Benefits of Partitioned Reads
- Reduced Data Volume: By only reading the necessary partitions, partitioned reads can significantly reduce the amount of data processed, improving performance and reducing costs.
- Faster Data Updates: Partitioned reads can quickly update data in Ascend by only reading new or updated files, without needing to reprocess the entire dataset.
- Increased Parallelism: Partitioned reads can be parallelized, allowing for more efficient use of resources and faster data processing.
- Simplified Data Management: Partitioned reads can make data management simpler by breaking down large datasets into smaller, more manageable pieces.
- Reduced Costs: By reducing the amount of data processed, partitioned reads can also help reduce costs associated with data processing.
Key Elements for Partitioned Reads
- File Systems: Partitioned reads are automatically supported for file systems that are partitioned, such as S3, GCS, and ADLS.
- Warehouses: Partitioned reads are automatically supported for partitioned tables in warehouses.
Incremental
Incremental read components in Ascend.io are designed to ingest data incrementally, meaning they only read new or updated data since the last successful read. This approach is particularly useful for large datasets, where reading the entire dataset repeatedly would be inefficient. Incremental read
Key Differences for Incremental Reads
Incremental read connectors have distinct features that differentiate them from standard read connectors:
- State Management: These connectors maintain state information to track the last read position or timestamp, ensuring that only new or updated data is read.
- Configuration for Incremental Reads: Incremental read connectors include additional configurations to specify how to identify new or updated data.
Benefits of Incremental Reads
- Improved Efficiency: By only ingesting new or updated data, incremental reads reduce the amount of data processed, saving time and computational resources.
- Scalability: Incremental reads make it feasible to handle large and continuously growing datasets without needing to reprocess the entire dataset.
- Reduced Costs: Minimized data processing leads to lower costs associated with compute resources, particularly in cloud environments where data processing is metered.
- Faster Data Updates: Incremental reads enable quicker updates to data pipelines, ensuring that the latest data is available for analysis without significant delays.
- Lower Impact on Source Systems: Since incremental reads pull smaller, targeted data batches, they reduce the load on source systems compared to full dataset reads.
Key Elements for Incremental Reads
incremental
: Defines the parameters for incremental data ingestion specific to incremental read connectors.field
: The field used to track new or updated data (e.g., a timestamp or an ID). This ensures that only data added or modified after the last read is ingested.start
: Specifies the starting point for incremental reads (e.g., a specific timestamp), defining where the incremental reading should begin.
Best Practices
- Monitor State Management: Regularly review state management settings to ensure the accuracy of incremental reads and avoid missing or duplicating data.
- Optimize Incremental Fields: Choose appropriate fields (like timestamps or unique IDs) that reliably indicate new or updated data for accurate incremental ingestion.
CDC
Native CDC support is available for a number of database sources. When available, CDC can be a more efficient way to replicate data into Ascend than traditional incremental methods.
Key Differences for CDC Reads
- CDC Support: CDC reads require the source database to support Change Data Capture (CDC).
Benefits of CDC Reads
- Efficiency: CDC reads can be more efficient than incremental reads, particularly for large datasets.
- Completeness: CDC reads can provide a more complete history of data changes, as they capture all changes to the data, not just new or updated records.
Data Update (Merge) Strategies
When a read component is configured to update data in a target component, Ascend will use a number of strategies to perform the update. The specific strategy depends on the type of data source.
Database, Warehouse, & Event Sources
For systems that support incremental & CDC record retrieval, the most common merge strategies are:
- Latest: The latest data is used to update the target. If the target component already contains data, the new data replaces the old data.
- History: The full history of data is merged with the target. This is useful for maintaining a complete history of data changes.
File Sources
For file sources, the default strategy is to always replace file contents with updated file contents.