Skip to main content
Version: 3.0.0

Best practices for Databricks

This article extends our recommended best practices for Databricks.

Databricks workspace management

We recommend using a single Databricks workspace with Unity Catalog enabled for working with Ascend. You may need to create separate workspaces depending on your needs. In general, Databricks organizes data globally in the Unity Catalog, whereas Databricks Workspaces are used to manage access control and compute resources.

Utilize fine-grained permissions to control access levels and resource usage on your Databricks workspace(s).

Unity Catalog catalog management

Create separate catalogs for each Ascend Deployment and Workspace. We recommend naming these [ASCEND_PROJECT_NAME]_[ENVIRONMENT] and [ASCEND_PROJECT_NAME]_WORKSPACE_[DEVELOPER_NAME], respectively.

tip

Though we follow our convention of using uppercase for catalog names, Databricks lowercases them.

OTTOS_EXPEDITIONS_DEVELOPMENT
OTTOS_EXPEDITIONS_STAGING
OTTOS_EXPEDITIONS_PRODUCTION
OTTOS_EXPEIDTIONS_WORKSPACE_CODY
OTTOS_EXPEDITIONS_WORKSPACE_SEAN
OTTOS_EXPEDITIONS_WORKSPACE_SHIFRA

When sharing a Databricks workspace with non-Ascend applications, you may want to additionally prefix catalogs with ASCEND_.

Compute resources

Databricks offers multiple compute types (and various flavors of those types). For Ascend, we recommend working with:

  • all-purpose clusters (for PySpark)
  • SQL warehouses (for SQL and everything else)

Unity Catalog must be enabled on all Databricks compute to be used with Ascend. We strongly recommend using SQL warehouses and separating Flows that require PySpark due to the startup time of all-purpose clusters (typically 5+ minutes).

tip

You can accomplish the above with Profile and Flow Parameters. Set both details for a cluster (cluster_id and http_path) alongside the details for a warehouse (http_path) in your profile, e.g. deployment_dev.yaml:

profile:
parameters:
databricks:
workspace_url: <your-databricks-workspace-url>
client_id: <your-databricks-service-principal-client-id>
cluster_id: <your-databricks-cluster-id>
cluster_http_path: <your-databricks-cluster-http-path>
warehouse_http_path: <your-databricks-warehouse-http-path>
catalog: OTTOS_EXPEDITIONS_DEV
schema: DEFAULT

defaults:
- kind: Flow
name:
regex: .*
spec:
data_plane:
connection_name: data_plane_databricks

Then override the http_path in your Flow:

flow:
version: 0.1.0
parameters:
databricks:
schema: SQL_FLOW
http_path: ${parameters.databricks.warehouse_http_path}

Or to use the all-purpose cluster:

flow:
version: 0.1.0
parameters:
databricks:
schema: PYSPARK_FLOW
http_path: ${parameters.databricks.cluster_http_path}

Our best practices are implemented as part of Otto's Expeditions!

We recommend creating a separate all-purpose cluster and warehouse for each Environment named ASCEND_[ENVIRONMENT]_[TYPE].

ASCEND_DEVELOPMENT_CLUSTER
ASCEND_DEVELOPMENT_WAREHOUSE
ASCEND_STAGING_CLUSTER
ASCEND_STAGING_WAREHOUSE
ASCEND_PRODUCTION_CLUSTER
ASCEND_PRODUCTION_WAREHOUSE

Additionally, if using Databricks as your Instance store, create the ASCEND_INSTANCE_STORE warehouse.

Access control (service principals)

Use Databricks service principals to manage access control to the catalogs and compute above. Create a service principal for each Ascend Environment (Development, Staging, Production) and optionally one for the Instance Store. Give full permissions to the catalog and compute resources for the service principal.