Create a Databricks Workspace in Azure

Imagine you are working in a sales department of a retail company. You need to analyze the data of the customers and sales to gain insights into customer behavior, product performance, and sales trends. But there is a huge amount of data to analyze, and we need such a tool that helps analyze large data and get trends of it. Microsoft Azure provides such a service which helps in processing big data and analytics purposes, Azure Databricks. This article, let us understand more about Azure data bricks and creating a Databricks workspace in Azure.

Overview of Azure Databricks

  • Azure Databricks is an Apache-spark-based big data technology and analytical platform that Microsoft brought as one of its services in Azure.
  • Designed with the founders of Apache-spark, Databricks is integrated with Azure to provide a one-click setup, streamlined workflows, and an interactive workspace that enables collaboration between data scientists, data engineers, and business analysts.

Features of Azure Databricks

  • As Databricks is an Apache-spark-based environment, one of the key features we get is Spark SQL and dataframes which is a library that supports working with structured data which is quite similar to tables.
  • Databricks also provides some services for streaming the data which helps in real-world applications and scenarios like IOT and live systems
  • It also provides the advantage of Machine learning with the Spark framework itself.
  • As everything is based on the Spark Core API, we have the feasibility to use the programming stack such as R, SQL, Python, Java, and Scala. It is limited to these technologies but, for data processing, this is quite enough to play around.

Creating a Databricks workspace in Azure

Let us look at a step-by-step approach to creating a databrick workspace in Azure.

Step 1: Creating an Azure Databricks Workspace instance

Navigate to the Azure portal and sign in with your credentials

Now, go to the Azure marketplace search for Azure DataBricks service, and click on the Create button

Now you will be navigated to a page where you can mention all the details of your workspace.

In the basics tab, enter the project details like subscription name and resource group name, you can use the existing resource group or create a new resource group according to your requirements.

Following that, mention the instance details like the name of the workspace, deployment region, and pricing tier.

Azure offers three pricing tiers for Databricks service i.e. premium, standard, and trial. Choose according to your requirements. A short description of each tier is available there itself, to learn more about the pricing tiers of Azure Databricks. Check out the official docs.

These are the mandatory details that you need to mention to create a simple and basic workspace instance. Other tabs such as networking, security, encryption, etc can be customized according to your requirements. These details are optional for this moment.

Now, hit the Review + create button to review and create your workspace.

Once the validation is passed, review the details of your workspace instance and click on create and your instance will start deploying.

Step 2: Launching the Databricks workspace

Once your workspace instance is deployed, click on the Go to resource button.

Now, click on the Launch Workspace button. The Azure Databricks uses the single sign-on(SSO), so there is no need to enter our credentials again.

This is the UI of Azure Databricks.

These are the options and services available in the Databricks ( also depends on the pricing tier you chose).

Step 3: Exploring Azure Databricks workspace.

Go to the workspace tab, We can see the details and list of workspaces we have.

As we have created a workspace instance, we can see the options available for import or code in our workspace.

In the workspace folder, we have two subfolders as users and shared, where the users represent the owners of the workspace and shared represents the users for whom the access is shared with.

The next folder is “Repos”. Here we can create a repository and import the code from various external resources like GitHub. Let’s see a demo.

Click on the Add Repo button. Now add the Git repository URL and Git Provider. Click on create Repo.

Now we need to authorize the Databricks in GitHub. we have two options for authorization, one is directly from Github account and other is by using personal access tokens.

Let us do it by linking with Github account.

Once the authorization process is done, we will be navigate back to the Databricks UI. Now we can play around with the repo.

There are two other options, one is favourites, we can able to add any repos, folders, notebooks, etc into favorites. Other one is Trash. This is like a recycle bin. once any resource is deleted with in the workspace, it will be moved into the trash. we can restore it or permanently delete it if it is no longer needed.

Step 4: Creating clusters in Workspace.

Apache spark clusters are nothing but a computing engine that is responsible for computing purposes. It acts similar to the Azure virtual machine, where we can provision resources for computing purposes.

Let us create a cluster. Navigate to the compute section in the side menu. under that choose clusters.

The basic configuration of the cluster is provided by default. we can modify if needed or just simply click create compute.

Once the cluster starts running, now we can compute any kind of code or logic in our workspace. Make sure we have create a compute before trying to run any kind of code or logic in our workspace.

Now let us create a python notebook in our workspace. Navigate to home folder of our workspace and click on create which shows a dropdown list of available options to create. choose Notebook.

Now, run a simple python code.

Databricks is known for its performance. Any kind of larger queries can be computed with in minutes.

This is how, we can create and play around a Azure databricks workspace.

Azure Databricks Workspace – FAQs

How is Azure Databricks different from Apache Spark?

A. While Azure Databricks is built on Apache Spark, it includes several enhancements for performance, security, and usability, including a fully managed service, automated scaling, and an optimized runtime.

What Databricks service of Azure offers on top of the Apache Spark?

A. Azure Databricks service offers some advanced features on top of the Spark platform including

  • Secure cloud storage integration
  • ACID transaction via Delta Lake integration
  • Unity catalog for metadata management
  • Cluster management
  • Photon query engine
  • Notebooks and workspaces
  • Administration controls
  • Optimized Spark Runtime
  • Automation tools

What languages can I use in Azure Databricks notebooks?

A. Azure Databricks supports multiple languages, including Python, R, Scala, SQL, and Java. You can use any of these languages in a single notebook by specifying the language at the beginning of a cell.

How do I ingest data into Azure Databricks?

A. Data can be ingested into Azure Databricks using different methods such as Azure Data Factory, directly from Azure storage accounts, or using built-in connectors and APIs to load data from external sources.



Contact Us