Dash is an ideal Python based front-end for your Databricks Spark Backend

Plotly
6 min readJun 23, 2020

📌 December 6, 2023: We are continuously sharing new content related to Plotly integration options with Databricks, which at present now include (Databricks SDK, python SQL connector, Jobs API, databricks connect v2, and the CLI). The most popular integration driving adoption is the python SQL connector. Check out our technical documentation or official partnership page for updated articles. (NB: Use of the licensed Databricks-Dash library in a notebook has been an anti-pattern use of Dash so far, and does have implementation caveats particularly for Azure Databricks customers. Please contact Plotly for details.)

We’re delighted to announce that Plotly and Databricks are partnering to bring cloud-distributed Artificial Intelligence (AI) & Machine Learning (ML) to a vastly wider audience of business users. By integrating the Plotly Dash frontend with the Databricks backend, we are offering a seamless process to transform AI and ML models into production-ready, dynamic, interactive, web applications. This partnership with Databricks empowers Python developers to easily and quickly build Dash apps that are connected to a Databricks Spark cluster. The direct integration, databricks-dash, is distributed by Plotly and available with Plotly’s Dash Enterprise.

Plotly’s Dash is a Python framework that enables developers to build interactive, data-rich analytical web apps in pure Python, with no JavaScript required. Traditional “full-stack” app development is done in teams with some members specializing in backend/server technologies like Python, some specializing in front-end technologies like React, and some specializing in data science. Dash provides a tightly-integrated backend and front-end, entirely written in Python. This means that data science teams producing models, visualizations and complex analyses no longer need to rely on backend specialists to expose these models to the front-end via APIs, and no longer need to rely on front-end specialists to build user interfaces to connect to these APIs. If you’re interested in Dash’s architecture, please see our “Dash is React for Python” article.

Databricks’ unified platform for data and AI rests on top of Apache Spark, a distributed general-purpose cluster computing framework originally developed by the Databricks founders. With enough hardware and networking availability, Apache Spark scales horizontally naturally due to its distributed architecture. Apache Spark has a rich collection of APIs, MLlib, and integration with popular Python scientific libraries (e.g. pandas, scikit-learn, etc). The Databricks Data Science Workspace provides managed, optimized, and secure Spark clusters. This enables developers and data scientists to focus on building and optimizing models and worry less about infrastructure aspects such as speed, reliability, building fault-tolerant systems, etc. Databricks also abstracts away many manual administrative duties (such as creating a cluster, auto-scaling hardware, and managing users) and simplifies the development process by enabling users to create IPython-like notebooks.

With Dash apps connected to Databricks Spark clusters, Dash + Databricks gives business users the powerful magic of Python and pyspark.

Databricks is the industry-leading Spark platform, and Plotly’s Dash is the industry-leading library for building UIs and web apps in Python. By using Dash and Databricks together, data scientists can quickly deliver production-ready AI and ML apps to business users that are backed by Databricks Spark clusters. A typical Dash + Databricks app is usually less than a thousand lines of code written in Python (no Javascript required). These Dash apps can vary from simple UIs for simulation models to complex dashboards acting as read/write interfaces to your Databricks Spark cluster and large amounts of data stored in a data warehouse. With Dash apps connected to Databricks Spark clusters, Dash + Databricks gives business users the powerful magic of Python and pyspark.

Currently, there are two ways to integrate Dash with Databricks:

  1. databricks-dash supports a Notebook-like approach meant for quick Dash app prototyping within the Databricks notebook environment.
  2. databricks-connect supports a development-like approach meant for productionizing.

More details on each integration methods follow:

databricks-dash

databricks-dash is a closed-source, custom library that can be installed and imported on any Databricks notebook. With the use of import, developers can start building Dash applications on the Databricks notebook itself. Like regular Dash applications, Dash applications in Databricks notebooks maintain their usage of app layouts and callbacks. Any PySpark code that deals with complex models or simple ETL processes written on Databricks notebooks can be easily integrated into Dash applications with minimal code migrations. Once a Flask (Python) server runs, the generated Dash application becomes hosted on your Databricks instance with a unique url. It is important to note that these Dash applications on Databricks notebooks are running on shared resources and lack a load balancer. Thus, databricks-dash is great for quick prototyping and iterating but is not recommended for production deployments. For any data scientist or developer interested in taking this Dash application using databricks-dash to production, Plotly’s Dash Enterprise documentation can provide you all the steps to help you get there by using databricks-connect.

Here is a minimal self-contained example of using databricks-dash to create a Dash app from the Databricks notebook interface. After installing the databricks-dash library, run the example by copying and pasting the following code block into a Databricks notebook cell. Here’s a video demo of how to use databricks-dash to accompany the code below.

# Imports
import plotly.express as px
import dash_core_components as dcc
import dash_html_components as html
from dash.dependencies import Input, Output
from databricks_dash import DatabricksDash
# Load Data
df = px.data.tips()
# Build App
app = DatabricksDash(__name__)
server = app.server
app.layout = html.Div([
html.H1("DatabricksDash Demo"),
dcc.Graph(id='graph'),
html.Label([
"colorscale",
dcc.Dropdown(
id='colorscale-dropdown', clearable=False,
value='plasma', options=[
{'label': c,'value': c}
for c in px.colors.named_colorscales()
])
]),
])
# Define callback to update graph
@app.callback(
Output('graph', 'figure'),
[Input("colorscale-dropdown", "value")]
)
def update_figure(colorscale):
return px.scatter(
df, x="total_bill", y="tip", color="size",
color_continuous_scale=colorscale,
render_mode="webgl", title="Tips"
)
if __name__ == "__main__":
app.run_server(mode='inline', debug=True)

The result of this code block is this app:

A simple Dash app within a Databricks notebook

Here is a slightly larger example that uses PySpark to perform data pre-processing on the Databricks cluster. The dashboard itself is styled using Dash Design Kit, so the dash-design-kit package must be installed along with databricks-dash. This example is based on the Databricks-connect application template but has been modified to use databricks_dash.DatabricksDash instead of dash.Dash.

A more complex Dash app within a Databricks notebook

databricks-connect

databricks-connectis the recommended way to get PySpark models and Dash applications on Databricks notebooks to production. databricks-connect is a Spark client library distributed by Databricks that allows locally written Spark jobs to be run on a remote Databricks cluster. After installing and configuring databricks-connect and PySpark, developers and data scientists can run Dash and PySpark code on their favorite IDEs and no longer need to use Databricks notebooks. To make this happen, simply import PySpark, as you would import any other python modules, and write PySpark code with your Dash code base. We’ve made this video demo of how to utilize databricks-connect. The end result of this is a Dash application that can query our Databricks cluster for distributed processing, which is essential for big data use cases. This is important because using databricks-connect means our Dash application can be deployed to Plotly’s Dash Enterprise and be production-ready, which is the ideal workflow in Python!

Here is an example of a Dash application with databricks-connect. This Dash application uses Yelp’s open dataset and plots out restaurant establishments in Toronto, Calgary, and Montreal on a map. Once we click Submit, this triggers a Spark job on our Databricks cluster, with filtering and matching based on given criteria.

A Dash app on Dash Enterprise, connecting to a Databricks Spark cluster through databricks-connect

So in summary, the two ways to integrate Dash with Databricks offer advantages for quick prototyping in a Notebook-like fashion or for high-performance production deployment of analytical apps. Both methods provide a path to leverage Plotly’s Dash Enterprise as the recommended solution to operationalize AI/ML models and data directly to business users.

Databricks brings the best-in-class Python analytic processing backend and Plotly’s Dash brings the best-in-class Python front-end! The documentation for installing, creating, and deploying databricks-dash applications will be available in the next version of Dash Enterprise 4.0 in July 2020.

We’ll be posting some more info about our Databricks partnership in the coming weeks on our Twitter and LinkedIn, so stay tuned! If you have any questions or would like to learn more about Plotly Dash and Databricks integration, email info@plotly.com, and we’ll get you started!

--

--

Plotly

The low-code framework for rapidly building interactive, scalable data apps in Python.