...

Data Management with Data LakeHouse

Data Warehouse and Data Lake have been used traditional storage architecture in the past.

Datawarehouse is oldestdata management architecture which contains only structured and relational data, which is used for BI Reporting and Data Visualization. It provides ease of access for end users.

On the other hand, Data Lake came into the picture, few years back, when cloud structure was introduced. Data Lake can store unstructured, structured, and semi-structured data. Data Engineers and Data Scientist use to access raw data from Data Lakes for analysis and to develop machine learning models, from different data source.

Data Lakehouse?????????

Data Lake + Data Warehouse = Data Lakehouse

Data Lakehouse

Data Lakehouse is the new data storage architecture, which is having a great storage capacity and have all the benefits provided by Datawarehouse and Data Lake. Lakehouse is capable to store all types of unstructured, structured, and semi-structured data, which can be used by anyone (Data Engineers, Data Analyst, Data Scientist & Business Analyst etc.) for BI Reporting as well as for deep analysis, machine learning models.It Provides ease of access and have advanced analytics capabilities like Data Warehouse, at lower cost. Maintaining a separate Data Lake and Data warehouse structure requires huge expenses.

Data Lakehouse architecture

Delta Lake

Data Lakehouse is having basically 3 Layers:

  1. Ingestion Layer, which is also called as Bronze, is the place where raw data is loaded into ingestion tables.
  2. Staging/Refined Layer, which is also called as Silver, contains refined and cleaned data.
  3. Feature/Agg Data Store is called as Gold, which contains aggregated/curated data.

After each Layer Data Quality is improved and data is transformed in the structured format. In Curated Layer data would be available in finest form and can be consumed for different use cases.

Advantages of Data Lakehouse 

Diverse Data ingestion: Data Lakehouse can store all structured, unstructured, semi structured, batch and streaming data. Any kind of relational data from any database, images, videos, audio, reports etc. can be stored in the Raw form in data Lakehouse.

Cost Effective: Data Lakehouse is cost effective as it removes the unnecessary maintenance cost for 2 different data management architecture, where cost of managing relational data in Data warehouse is huge.

One platform for all use cases: Curated data or Structured data from Delta Lake can be used further for BI Reporting and Dashboards, Advanced analytics, developing Machine Learning Models etc. No need to process this data in any other platform.

Performance: Because of High-speed query engine as Databricks, queries are executed very fast. Also, due to removal of simple ETL activities in Lakehouse architecture, data processing happens on a high speed.

High Level Data Quality:From Raw Layer till curated layer, raw data is cleansed and transformed at multiple staging layers in Delta Lake, where data quality is validated in each layer, making final structured/curated data error-free.

Conclusion

Data Lakehouse is the latest architecture and can be used across all the datatype in a single platform. Due to high Flexibility, low cost and open standards can be adopted for different business use cases. On the other hand, it is quite tricky to implement, due to complex architecture, so before implementing it is recommended to consider all the solutions and choose the best option as per your organization’s requirement.

Leave a Reply

Your email address will not be published. Required fields are marked *