Azure Databricks is an Apache Spark based analytics platform and one of the leading technologies for big data processing, developed together by Microsoft and Databricks. It is used to process large workloads of data and also helps in data engineering, data exploring and visualizing data using Machine learning. It is extremely fast, scalable and easy to use.
Azure Databricks workspace is integrated with Azure to provide streamlined workflows. Data ingestion in Azure happens through Azure Data Factory, in batches or in real-time. This data will be stored in Azure Data Lake storage for a long term usage. Azure Databricks is used to read this data from Blob Storage, Data lake storage and Azure SQL Data warehouse and Cosmos DB.
ADF Pipeline with Databricks configuration :
Databricks delivers a unified analytics platform powered by Apache Spark which is an open-source cluster-computing framework. In Azure Databricks, Databrick Notebook is a web-based document, containing executable code and its output. Each Notebook, contains multiple cells where multiple commands can be executed and can be created in multiple language as Python, Scala, R and SQL. While creating notebooks, default cluster (created already) can be attachedand can be modified later also.
Creating a new Notebook for Testing:
When we talk about Databricks Testing, it is basically the testing of Databricks Notebooks, which can be created to perform specific ETL/ELT tasks and as mentioned previously are configured in ADF Pipelines. One Pipeline can have multiple Notebooks configured in different pipeline activities, which is mostly about Reading the data from different source and writing to various Target systems. It also includes data transformation before writing.
In most of the Analytics Project data is read from ADLS and Azure SQL DWH, then written to the another ADLS folder.
Method 1: When reading the data from the Azure SQL DWH, we need to create the connection first as mentioned in below screenshot:
This Notebook can be used/called in another notebook as:
Here “/Shared/Shared/connections/” is the path of the Notebook, where database connection was created earlier.
After connecting to the database, we can access the database tables. Output of any SQL query can be stored in a dataset.
Spark will automatically read the schema from the database table and maps its types back to Spark SQL types.
You can query this jdbc table using spark SQL, in the same notebook.
In this way, we can easily validate SQL database tables, which are having billions of records and take huge time to process the complete query in SSMS. Syntax in Spark SQL and SQL query is almost same, so we can get the same output in less time and can visualize the data also, which will be more helpful in data analysis.
Method 2: Similar to SQL Data warehouse table, we can access the files from Azure Data Lake storage also, which might be stored in different formats as csv, txt, parquet etc. T
In case of ADLS files, ADLS Storage need to be mounted on Databricks cluster. Once mounting is completed, there are multiple ways to access the ADLS file.
Below is the easiest way to access the ADLS files for data testing, if file location is known.
Create a data frame to access the file from ADLS location –
Create a view on this data frame
Which can be queried again using Spark SQL?
If file location is not sure, we can use dbutils.fs.ls (“<file path>”)to list out all the child folders and file details from a path :
Command used in above in above screenshot, will show all the folders and files created inside folder2 folder.
The goal of Data testing is to make sure that end to end system runs smoothly and error -free while maintaining performance and security. Using Databricks with SparkSQL, makes Data Analysis activities more easy and more effective for data testing. We can create multiple views using the above methods and these views can be joined together to validate data between multiple files/database tables.