Data engineering best practices in Ascend
Ascend empowers data engineers to implement software development lifecycle (SDLC) best practices to automate their data pipelines. Ascend is highly configurable to meet the scalability needs at any data volumes across any data engineering team size. This guide outlines best practices that work well for most organizations. Customize as needed!
Development and deployment workflow
In Ascend, developers work in their own Workspace to develop and test data pipelines for a Project. Once their code is ready, they merge their changes into the default (usually main
) Git branch, which automatically deploys the Project to its Deployment in the Development Environment (if, following our best practices here, you deploy it from the default Git branch). From there, code can be promoted to the Staging and Production Deployments. You can configure Development and/or Staging to work on smaller sampled datasets for faster iteration and lower costs through Parameters.
In your Ascend Instance, you'll typically have:
- one or more Projects
- defined as a Repository + path to a project directory within the repository
- collections (as directories of files) of Connections, Flows, Automations, Profiles, and more
- one Deployment per Project per Environment (Development, Staging, Production)
- typically named
[Project] ([Environment])
, e.g.Otto's Expeditions (Production)
- one Git branch typically named
[environment]
, e.g.production
- (recommended) omit for
development
in favor of your default branch, allowing continuous integration testing
- (recommended) omit for
- typically named
- One Workspace per developer (using the Development Environment)
- typically named
[Developer name]
, e.g.Cody
,Sean
,Shifra
- one Workspace can be used to work across different Projects, Profiles, and branches
- many Git branches typically named
[developer_name]/[thing_being_developed]
- multiple Workspaces needed if a developer requires access to the Staging or Production Environments
- typically named
- One Profile for each Workspace and Deployment
- typically named
workspace_[developer_name].yaml
anddeployment_[environment].yaml
- consider keeping a template profile for new developers named
workspace_template.yaml
- typically named
Resource setup
Repository management
We recommend starting with a single repository (monorepo) named ascend-projects
. Public project templates are provided for supported Data Planes to get you started following the best practices outlined below. If you don't have a Git provider, we recommend GitHub as it's common and easy to get started with. GitLab, Bitbucket, Azure DevOps, and nearly any other Git provider can also be used.
Project management
Ascend Projects are specified as a Repository + path to a project directory within the repository. Typically, projects are placed in the root of the repository or a subdirectory.
An example repository structure:
ascend-projects
├── projectA
├── projectB
└── projectC
Environments and security boundaries
The Ascend Instance is the top-level security boundary. Most organizations will only have one Data Plane (such as Snowflake, BigQuery, or Databricks). The Ascend Instane Store requires data and compute. We recommend isolating the Ascend Instance Store's Data Plane. Only your organization's administrators and the Ascend Instance's identity should have access to the Ascend Instance Store.
Isolating data and compute will be specific to your Data Plane(s) and organization's requirements. For BigQuery, we'd recommend separate GCP projects, though fine-grained access control in a single project may be sufficient. In Snowflake, separate warehouses and databases are recommended but separate accounts are typically superfluous.
Typically, organizations will have three Ascend Environments:
- Development: for developers' Ascend Workspaces and continuous integration testing of your default (typically
main
) branch - Staging: controlled testing environment that closely resembles production
- Production: the live environment where your data pipelines run
Each Environment should have isolated data and compute resources in the Data Plane, with secure users or roles that have access only to their designated resources. Every Environment, including the Instance itself, comes with a dedicated Ascend Vault for secrets management. While this enables secure separation of secrets between Environments, you can also integrate your own external vaults and control Environment access as needed.
Additional Environments can be created to meet specific organizational needs, such as separating protected health information (PHI) from non-PHI data. In such cases, append the data classification to the Environment name (e.g., Dev_PHI
).
Git processes
The default and deployment branches should be protected.
An example branch list:
$ git branch
cody/add-new-python-component
cody/fix-cron-schedule
cody/more-unit-testing
* main
production
staging
A developer will merge (through the Ascend UI or your Git provider's own process) their changes into the default branch (main
in the example above). This will trigger an update in the Development Deployment, reflecting the state of the default branch. Once you're confident in the changes (or automatically through some process), you then merge the default branch into the staging
branch, resulting in an update of the Staging Deployment. Finally, you merge the staging
branch into the production
branch, updating the Production Deployment.
Ascend file naming conventions
Files are sorted alphabetically in the Ascend UI. Common types are often prefixed to group them together.
Connection naming best practices
When naming connection files, use descriptive identifiers that reflect the resource's purpose, such as billing_database.yaml
or customer_data_lake.yaml
. For connections that need different configurations across environments (Development/Staging/Production), it is strongly recommended to keep a single YAML file and parameterize the values within it. You can find examples of this approach across all three Data Planes in Otto's Expeditions.
Data plane connections
Name your Data Plane connection files like data_plane_[type].yaml
.
data_plane_bigquery.yaml
data_plane_databricks.yaml
data_plane_snowflake.yaml
Use additional postfix descriptors as needed.
Most cases of using separate files are covered by using Parameters in your Data Plane configuration file. You should only have multiple for a given Data Plane if you are using separate Data Plane instances within a single Environment.
data_plane_databricks_eastus.yaml
data_plane_databricks_westus2.yaml
Data connections
Name your data connection files like [read|write]_[type]_[description].yaml
. Use additional postfix descriptors as needed.
read_gcs_raw.yaml
read_abfss_raw.yaml
write_gcs_processed.yaml
Profiles
Name your developer profiles like workspace_[developer_name].yaml
and deployment profiles like deployment_[environment].yaml
. Consider keeping a template profile for new developers named workspace_template.yaml
.
An example list of profiles:
deployment_development.yaml
deployment_production.yaml
deployment_staging.yaml
workspace_cody.yaml
workspace_sean.yaml
workspace_shifra.yaml
workspace_template.yaml
Managing parameters
Parameters are inherited through the hierarchy of Project > Profile > Flow > Flow Run > Component.
Next steps
Learn specific best practices for each of the following Data Planes: