Azure Databricks Testing

Azure Databricks Testing

Azure Databricks is an Apache Spark based analytics platform and one of the leading technologies for big data processing, developed together by Microsoft and Databricks. It is used to process large workloads of data and also helps in data engineering, data exploring and visualizing data using Machine learning. It is extremely fast, scalable and easy to use.

Azure Databricks workspace is integrated with Azure to provide streamlined workflows. Data ingestion in Azure happens through Azure Data Factory, in batches or in real-time. This data will be stored in Azure Data Lake storage for a long term usage. Azure Databricks is used to read this data from Blob Storage, Data lake storage and Azure SQL Data warehouse and Cosmos DB.

ADF Pipeline with Databricks configuration :

Databricks configuration
Notebook path

Databricks delivers a unified analytics platform powered by Apache Spark which is an open-source cluster-computing framework. In Azure Databricks, Databrick Notebook is a web-based document, containing executable code and its output. Each Notebook, contains multiple cells where multiple commands can be executed and can be created in multiple language as Python, Scala, R and  SQL. While creating notebooks, default cluster (created already) can be attachedand can be modified later also.

Creating a new Notebook for Testing:

Create Notebook

When we talk about Databricks Testing, it is basically the testing of Databricks Notebooks, which can be created to perform specific ETL/ELT tasks and as mentioned previously are configured in ADF Pipelines. One Pipeline can have multiple Notebooks configured in different pipeline activities, which is mostly about Reading the data from different source and writing to various Target systems. It also includes data transformation before writing.

In most of the Analytics Project data is read from ADLS and Azure SQL DWH, then written to the another ADLS folder.

Method 1: When reading the data from the Azure SQL DWH, we need to create the connection first as mentioned in below screenshot:

Sample database

This Notebook can be used/called in another notebook as:

connections

Here “/Shared/Shared/connections/” is the path of the Notebook, where database connection was created earlier.

After connecting to the database, we can access the database tables. Output of any SQL query can be stored in a dataset.

Spark

Spark will automatically read the schema from the database table and maps its types back to Spark SQL types.

You can query this jdbc table using spark SQL, in the same notebook.

Sample databricks

In this way, we can easily validate SQL database tables, which are having billions of records and take huge time to process the complete query in SSMS. Syntax in Spark SQL and SQL query is almost same, so we can get the same output in less time and can visualize the data also, which will be more helpful in data analysis.

Method 2: Similar to SQL Data warehouse table, we can access the files from Azure Data Lake storage also, which might be stored in different formats as csv, txt, parquet etc.          T

In case of ADLS files, ADLS Storage need to be mounted on Databricks cluster. Once mounting is completed, there are multiple ways to access the ADLS file.

Below is the easiest way to access the ADLS files for data testing, if file location is known.

Create a data frame to access the file from ADLS location –

File1

Create a view on this data frame

Create or replace

Which can be queried again using Spark SQL?

select

If file location is not sure, we can use dbutils.fs.ls (“<file path>”)to list out all the child folders and file details from a path :

folder 2

Command used in above in above screenshot, will show all the folders and files created inside folder2 folder.

The goal of Data testing is to make sure that end to end system runs smoothly and error -free while maintaining performance and security. Using Databricks with SparkSQL, makes Data Analysis activities more easy and more effective for data testing. We can create multiple views using the above methods and these views can be joined together to validate data between multiple files/database tables.

Leave a Reply

Retype the CAPTCHA code from the image
Change the CAPTCHA codeSpeak the CAPTCHA code
 

SOAIS - Worksoft Newsletter

To view on your browser, click here
Facebook Twitter LinkedIn
Dear Default Value,
 

Welcome to SOAIS Newsletter of September 2021!

Continuous Testing with Remote Execution
 
The speed of innovation continues to increase, driving rapid and relentless change for today’s ever-evolving IT landscapes, creating greater risk as IT and business teams scramble to ensure timely delivery. How can your organization keep pace? Test more, worry less. With Worksoft’s Connective Automation Platform, you can easily build and maintain automated tests, accelerating testing time without losing scope or volume. You can schedule and execute remote, continuous tests to intercept defects sooner and prioritize remediation - without sacrificing your nights and weekends. Explore how continuous test automation and remote execution can empower your organization.

Click here to connect with us to get more information on our services.
 

Skip Costly Rework with Dynamic Change Resiliency​

Change resiliency is imperative in ever-evolving IT environments. Our patented object action framework streamlines change management by assigning object definitions to your shared assets. The same object may be used in a thousand automation steps, but it can be easily updated by making one simple change to the model definition. The change automatically propagates to every single instance where that object may have been used without a single line of code or manual human involvement. For more change readiness you can also engage our Impact Analysis for SAP to predict how changes in SAP transports will affect your business processes. 

Please click here to watch the video to get a gist.
 

SOAIS Blog – Nuts and bolts of Certify Database Maintenance​

One of the key thing, which is often missed by the organizations, who have invested in using Worksoft Certify for automating their Business Process Validation initiatives, is implementing a Database Maintenance Plan. While the business and the test automation consultants get excited about the shiny new thing that they have got and start building the regression suite; planning and executing a database maintenance plan for most of the customers gets pushed down the priority list. However, since all the test assets in Certify are stored in a Database, a robust database maintenance plan is very important to maintain smooth operation of Certify with acceptable performance criteria. The customers usually start facing issues once they have built significant number of Certify processes which they have started executing on regular basis. Such executions add a lot of data to the tables storing results data and increase the overall size of the Certify database.

Please click here to read the complete blog.
 

Worksoft Blog – Process Intelligence: A Multi-Dimensional Approach

The ability to extract process knowledge has become easier through the years. Technology has evolved to the point where we can deploy capabilities that connect at multiple levels to extract different types of process insight. In the past, organizations were forced to spend enormous energy extracting data manually from different applications and databases. Then, they would have to use things like spreadsheets to transform the data and convert it into meaningful information. 

Please click here and read the complete blog.
 
India
Unit 9, Level 5, Navigator, ITPL,
Bangalore - 560 066.
Phone: +91 80 40071234
US
Suite 101, 1979, N Mill St,
Naperville, IL 60563
Phone 1-800-262-2427
Please click here to Unsubscribe / Unsubscribe Preferences