Automating data collection and centralization is great - but how can you keep up with context information of and interdependencies between the datasets? This is where metadata (in particualr data lineage - describing the origins and transformation history of a dataset) comes into play!

This article is based on our research work at the BMW Group about an automated collection and visualization of data lineage throughout data pipelines - check out our paper here.

Why a data lake in the first place?

With the rise of cloud infrastructure, collecting vast amounts of data from various sources is easier than ever before. The analysis of the data is crucial to understand your business, customers and markets to be able to provide the right services and products. Starting in 2010, distributed data processing frameworks such as Apache Spark have increasingly gained momentum to help data engineers and scientists process petabyte-scale data. However, analysis of a dataset can be arbitrarily complex: The majority of data collected today requires thorough preprocessing and often only provides value in combination with other datasets. This is where the pain for larger companies begins: combinating datasets for the anaylsis can be a challenging task — a few examples: a) the data analyst might not know about other datasets b) the company lacks an access management which slows down the access process c) the source systems are not designed for analytical usage.

To cope with the described pererequisites, organizations have increasingly started to adopt and implement data lake concepts.

What is a Data Lake?

In a data lake, data is stored and organized centrally, leading to an overall increase in accessibility to help simplify the analyses of datasets. In organisational terms, this means that departments centralize their data and provide easy means for colleagues from other departments to access it. Sharing data across departments flourishes the companies data ecosystem, enables new perspectives for the data analysis and counteracts the limitations of organizational silos. For a more technical perspective on data lakes, checkout the AWS Whitepaper as reference implementation.

Comparison of data silos (left) in comparison to a data lake as centralized data store (right) from the perspective of a data analyst. As all departments provide their data to the data lake, the data analyst can easily extract the required data from a single data store instead of navigating the organizational jungle of departments to access the desired data.

While accepting arbitrary data formats as input, data lakes tremendously lower the barrier of entry to introduce data from new source. While this is beneficial to collect data at scale, it unveils and implies complex challenges with regards to discoverability and documentation [1]. Expensively gathered data renders useless if there is no effective data governance and metadata management strategy in place, simply because the context of the data can hardly be identified by other stakeholders [2].

According to Gartner, about 80 percent of data lakes do not effectively collect metadata [2].

TL;DR: Funneling data into a centralized data store will hardly provide added-value as long as there is no metadata documented for the ingested data. If you do not know what a dataset is about, you are not going to analyze it and the data is useless.

Why adapting a Data Catalog helps

In order to mitigate this, organizations introduce data catalogs to document (previously) ungoverned data in the platform. It maintains an inventory of datasets and provides documentation and management capabilities for datasets [2]. This metadata can then be utilized by (downstream) consumers to identify the context of data. Typical (actively maintained) metadata describes the owner of a dataset, a general description and the size of the dataset, but also descriptions on the column-level for each contained property.

While this helps data consumers to better understand and grasp the context of data, such catalogs are often not able to sufficiently capture the lineage of datasets given the complex set of variables involved even with thriving data communities in place.

How Data Lineage complements classic data catalogs

Data lineage (aka Data Provenance) surfaces the origins and transformations of data and provides valuable context for data providers and consumers [3]. We typically differentiate between the following two granularities of lineage for retrospective workflow provenance [4, 5]:

Coarse-grained: describes the interconnections of pipelines, databases and tables.
Fine-grained: exposes details on applied transformations that generate and transform data.

Why is this important you might think? Before a data analyst is able to start, the data is usually processed within multiple ETL pipelines. A common practice is to reintroduce the dataset to the data lake in various maturity levels. In the following example, we can see how patient data is first stored in a raw format, then each entry in the dataset is depersonalized and a new property is calculated. The resulting dataset is then merged with another dataset which represents the ready-for-analysis dataset.

Example for a dataset from raw to ready-for-analysis maturity: Each state of the data represents an own dataset within the data lake and each arrow represents an ETL data pipeline.

If we do not track any data lineage, the data pipelines are basically a black box as long as there is no perfect documentation on the contained logic. Just knowing which datasets have been merged (coarse-grained lineage) is already a crucial detail, which should not be taken for granted.

The spectrum of possible use cases for this kind of metadata is astonishingly wide. In the following, I will elaborate three use cases to show the diverse appicability of data lineage.

Use cases

Validate dataset requirements

A stakeholder would like to select a semantic dataset for their business analysis, but recently there has been some confusion about an attribute they require for their analysis. The desired attribute can be extracted from two databases, which differ greatly in their quality. By investigating the attributes within the fine-grained lineage, the stakeholder can trace the attribute back to the initial extraction from the external database and identifies that the attribute originates from the desired database. The analyst can now proceed and use the dataset for his work.

Pipeline debugging

If there are complications, the responsible data engineers might need to investigate the corresponding pipelines and how specific attributes have been created. Imagine, there was an error in the pipeline for dataset x. Now the data engineer has to identify all down-stream datasets which consume data from dataset x. By investigating the lineage graph, he can quickly identify all down-stream dependencies and re-run the respective pipelines.

Identification of compliance issues

There are restrictions on how PII might be utilized and processed within the data pipelines and analyses. In order to enforce the compliance requirements, a tool is needed to control data pipelines for their compliance. With data lineage, the usage of PII and its depersonalization can be tracked and automatically detected throughout the plethora of pipelines. Conclusion

Within this post, we discovered the pitfalls of a basic data lakes and how metadata can greatly increase their practicality. In particular, the collection of data lineage reveils valuable insights into the data pipelines to surface information about the creation process of the output dataset. This metadata opens a versatile set of use cases for the involved stakeholders.

References

[1] I. Suriarachchi and B. Plale, “Crossing analytics systems: A case for integrated provenance in data lakes,” Proceedings of the 2016 IEEE 12th International Conference on e-Science, e-Science 2016, pp. 349–354, 2016.
[2] E. Zaidi, G. De Simoni, R. Edjlali, and A. D. Duncan, “Data Catalogs Are the New Black in Data Management and Analytics,” Gartner, pp. 1–16, 2017.
[3] M. Herschel, R. Diestelkämper, and H. Ben Lahmar, “A survey on provenance: What for? What form? What from?” VLDB Journal, vol. 26, no. 6, pp. 881–906, 2017.
[4] W.-C. Tan, “Provenance in Databases: Past, Current, and Future”, IEEE Data Engineering Bulletin, vol. 30, no. 4, pp. 3{12, 2007. [Online]. Available: sites.computer.org/debull/A07dec/issue1.htm
[5] L. Carata, S. Akoush, N. Balakrishnan, T. Bytheway, R. Sohan, M. Seltzer, and A. Hopper,”A Primer on Provenance”, ACM Queue, vol. 12, no. 3, pp. 1-14, 2014.

Why metadata is key in data lakes

Directing your data pipelines to a data lake is great - but how can you keep up with the dataset context and their interdependencies?

Table of contents