Best practices for Databricks
This article extends our recommended best practices for Databricks.
Databricks workspace management
We recommend using a single Databricks workspace with Unity Catalog enabled for working with Ascend. You may need to create separate workspaces depending on your needs. In general, Databricks organizes data globally in the Unity Catalog, whereas Databricks Workspaces are used to manage access control and compute resources.
Utilize fine-grained permissions to control access levels and resource usage on your Databricks workspace(s).
Unity Catalog catalog management
Create separate catalogs for each Ascend Deployment and Workspace. We recommend naming these [ASCEND_PROJECT_NAME]_[ENVIRONMENT]
and [ASCEND_PROJECT_NAME]_WORKSPACE_[DEVELOPER_NAME]
, respectively.
Though we follow our convention of using uppercase for catalog names, Databricks lowercases them.
OTTOS_EXPEDITIONS_DEVELOPMENT
OTTOS_EXPEDITIONS_STAGING
OTTOS_EXPEDITIONS_PRODUCTION
OTTOS_EXPEIDTIONS_WORKSPACE_CODY
OTTOS_EXPEDITIONS_WORKSPACE_SEAN
OTTOS_EXPEDITIONS_WORKSPACE_SHIFRA
When sharing a Databricks workspace with non-Ascend applications, you may want to additionally prefix catalogs with ASCEND_
.
Compute resources
Databricks offers multiple compute types (and various flavors of those types). For Ascend, we recommend working with:
- all-purpose clusters (for PySpark)
- SQL warehouses (for SQL and everything else)
Unity Catalog must be enabled on all Databricks compute to be used with Ascend. We strongly recommend using SQL warehouses and separating Flows that require PySpark due to the startup time of all-purpose clusters (typically 5+ minutes).
You can accomplish the above with Profile and Flow Parameters. Set both details for a cluster (cluster_id
and http_path
) alongside the details for a warehouse (http_path
) in your profile, e.g. deployment_dev.yaml
:
profile:
parameters:
databricks:
workspace_url: <your-databricks-workspace-url>
client_id: <your-databricks-service-principal-client-id>
cluster_id: <your-databricks-cluster-id>
cluster_http_path: <your-databricks-cluster-http-path>
warehouse_http_path: <your-databricks-warehouse-http-path>
catalog: OTTOS_EXPEDITIONS_DEV
schema: DEFAULT
defaults:
- kind: Flow
name:
regex: .*
spec:
data_plane:
connection_name: data_plane_databricks
Then override the http_path
in your Flow:
flow:
version: 0.1.0
parameters:
databricks:
schema: SQL_FLOW
http_path: ${parameters.databricks.warehouse_http_path}
Or to use the all-purpose cluster:
flow:
version: 0.1.0
parameters:
databricks:
schema: PYSPARK_FLOW
http_path: ${parameters.databricks.cluster_http_path}
Our best practices are implemented as part of Otto's Expeditions!
We recommend creating a separate all-purpose cluster and warehouse for each Environment named ASCEND_[ENVIRONMENT]_[TYPE]
.
ASCEND_DEVELOPMENT_CLUSTER
ASCEND_DEVELOPMENT_WAREHOUSE
ASCEND_STAGING_CLUSTER
ASCEND_STAGING_WAREHOUSE
ASCEND_PRODUCTION_CLUSTER
ASCEND_PRODUCTION_WAREHOUSE
Additionally, if using Databricks as your Instance store, create the ASCEND_INSTANCE_STORE
warehouse.
Access control (service principals)
Use Databricks service principals to manage access control to the catalogs and compute above. Create a service principal for each Ascend Environment (Development, Staging, Production) and optionally one for the Instance Store. Give full permissions to the catalog and compute resources for the service principal.