If you have ever worked on Cloud based solutions, you would have heard the word “Data Lake”. In Azure, it is called as Azure Data Lake Storage or ADLS.
ADLS is a repository for all types of raw data from different sources, which is basically a Blob Storage. This raw data later, can be transformed for reporting, visualization, advanced analytics and machine learning. ADLS can store the data in different file formats like “.txt”, “.csv”, “.parquet” etc. We can preview this data using Azure File Preview in ADLS itself, but only for few specific readable formats, if file have very less records (Approx. 25 Rows). If source file has more than 25 rows, we cannot preview all the records, in fact we need to download the file from ADLS. As Microsoft Azure have a limitation of 250 MB for downloading the files, hence it is not possible to download any file with large data set and it is a big challenge, when you have to analyse the raw data or to validate the transformed data, where you have billions of records.
Polybase is a great way of accessing our files in the cloud via simple T-SQL. Using Polybase we can create external tables in SQL Data warehouse and access the ADLS files on it, where we can easily query the data using T-SQL queries.
Below is the folder “External Tables”, in SQL database Object Explorer where you can see external tables:
Below is the query to create the External Tables:
LOCATION, DATA_SOURCE, FILE_FORMAT are three mandatory properties to create the external table.One can create multiple external tables for each of the file in ADLS.Files which are getting overwritten in ADLS itself, repeatedly, does not require a new table to be created as we do not load the data in external table. Polybase external table is just a utility to access latest files from ADLS.Sometimes, you might get issues while creating the external tables if records in your source file have issues. For that you can use REJECT_VALUE property, where you can mention, number of rows, which could be rejected. Table will be created by excluding those rows. This is one of the File format validations, if source data is not in the expected/same format for all the records, across the file.
Now, for Data Validation, we can use all our SQL queries on external tables, which we use for Data Validation in regular Physical table in SQL Datawarehouse like Record Count Validation, NULL Validation, Primary Key/Foreign Key Validation, Normalization etc.
For Example:If Target for a data load is the SQL Server DWH, then for data validation we can compare the source file which we processed with the data loaded in the DWH to check whether all records are matching between source and Target. This one to one comparison is only applicable when we do not have any business rule or any data normalization applied to the data.
In all BI projects, generally Data cleansing will be done to refine the data for improving the data quality. And this process makes data validation more complex if our target is another ADLS Layer itself, instead of any database. In that case, we can create two external tables, one from the first data lake folder, where we have our source file and another one from the second data lake, for Target file. These 2 external tables can be compared for validating the Count and any kind of business logic/transformation applied.
Parquet files are non-readable file which cannot be validated directly in ADLS. Data Validation for such file is only possible through the External tables.So, we can conclude that ADLS data validation for large data set is possible, if the user has read and write access on data lake as well as on SQL DWH to create external tables.