Data Lineage (3 Part Series)
I want to show an open source Python project data-lineage to visualize and analyze data lineage. The project was developed in collaboration with data teams on data governance initiatives over the last couple of years.
There are a lot of open source and commercial tools to capture data lineage. However there are two main problems expressed by data engineers:
- The projects require a lot of effort to get started and maintain.
- Requires constant discipline in capturing and sending all the metadata.
Both these factors result in incomplete projects and lost opportunities in improving performance, ROI and data quality.
data-lineage solves these problems by choosing the following goals:
- providing fast access to data lineage
- simple setup
- analysis of the lineage using a graph library
To achieve these goals, data lineage has the following features:
- Generate data lineage from query history. Most databases maintain query history for a few days. Therefore the setup costs of an infrastructure to capture and store metadata is minimal.
- Use networkx graph library to create a DAG of the lineage. Networkx graphs provide programmatic access to data lineage providing rich opportunities to analyze data lineage.
- Use Plotly to visualize the graph with tool tips and other rich annotations. Plotly provides a number of features to provide rich graphs with tool tips, color coding and weights based on different attributes of the graph.
You can get a data lineage graph with less than 10 lines of Python code in a Jupyter Notebook.
Right now data-lineage supports postgres and support for more databases is planned.
I appreciate any feedback and please give it a try if you need data lineage for your work.