FAQs on Data Lineage

#database #datagovernance #lineage

What is meant by data lineage ?

In Biology, lineage is a sequence of species each of which is considered to have evolved from its predecessor.

Similarly, Data Lineage is a sequence of transformations through intermediary systems to a final data set. Each data set
is considered to have been created from its predecessor through a specific transformation. A transformation maybe a
SQL query or a program in a language such as Python or Scala. Data Lineage can be at any granular level - schema, table
or column.

Why is data lineage important ?

Data Lineage is important because it enables important data governance functions such as:

Business Rules Verification
Change Impact Analysis
Data Quality Verification

What is a data lineage tool ?

A Data Lineage Tool captures metadata of all data transformations, organizes the metadata in a graph and provides access
to the graph through visual interfaces and programmable APIs.

In general data lineage tools use two techniques:

Push: ETL platforms push metadata to a data lineage tool during transformations.
Pull: Data Lineage tools scan logs and query history from databases and data lakes and generate lineage after the event.

Some data lineage tools use both techniques.

Are there open source data catalog tools ?

How do you build data lineage solution for databases ?

Choose one of the open source data catalog projects such as Amundsen, Apache Atlas or Data Lineage.
Follow installation instructions of the project. Some require a Hadoop cluster.
Integrate ETL tools, databases and data engines with the data lineage tool.
Integrate ETL tools, databases and data engines with the data lineage tool.