DEV Community

Cover image for Data Lineage On Redshift
Rajat Venkatesh
Rajat Venkatesh

Posted on • Updated on

Data Lineage On Redshift

Data Lineage is important for data governance and security. In Data warehouses and data lakes, a team of data engineers maintain a canonical set of base tables. The source of data of these base tables
maybe events from the product, logs or third-party data from SalesForce, Google Analytics or Segment.

Data analysts and scientists use the base tables for their reports and machine learning algorithms. They may create derived tables to help with their work. It is not uncommon for a data warehouse or lake to have hundreds or thousands of tables. In such a scenario it is important to use automation and visual tools to track data lineage.

This post describes automated visualization of data lineage in AWS Redshift from query logs of the data warehouse. The techniques are applicable to other technologies as well.

GitHub logo tokern / data-lineage

Generate and Visualize Data Lineage from query history

Tokern Lineage Engine

CircleCI codecov PyPI image image

Tokern Lineage Engine is fast and easy to use application to collect, visualize and analyze column-level data lineage in databases, data warehouses and data lakes in AWS and GCP.

Tokern Lineage helps you browse column-level data lineage

Resources

  • Demo of Tokern Lineage App

data-lineage

Quick Start

Install a demo of using Docker and Docker Compose

Download the docker-compose file from Github repository.

# in a new directory run
wget https://raw.githubusercontent.com/tokern/data-lineage/master/install-manifests/docker-compose/catalog-demo.yml
# or run
curl https://raw.githubusercontent.com/tokern/data-lineage/master/install-manifests/docker-compose/tokern-lineage-engine.yml -o docker-compose.yml

Run docker-compose

docker-compose up -d

Check that the containers are running.

docker ps
CONTAINER ID   IMAGE                                    CREATED        STATUS       PORTS                    NAMES
3f4e77845b81   tokern/data-lineage-viz:latest   ...   4 hours ago    Up 4
…

Workload System of Record

A system of record of all activity in databases is a prerequisite for any type of analysis. For example, AWS Redshift has many system tables and views that record all the activity in the database. Since these tables retain data for a limited time, it is important to persist the data. AWS provides scripts to store the data in tables within Redshift itself. For performance analysis the query log stored in STL_QUERY and STL_QUERYTEXT are the most important.

Tokern reads and processes the records in STL_QUERY & STL_QUERYTEXT at regular intervals. It adds the following information for every query:

  • Type of query such as SELECT, DML, DDL, COPY, UNLOAD etc
  • Source tables & columns which are read by the query if applicable.
  • Source files in S3 for COPY queries.
  • Target table & columns where the data was loaded if applicable.
  • Target files in S3 for UNLOAD queries.

Data Lineage

Tokern uses the system of record to build a network graph for every table & pipeline. An example for infallible_galois is visualized below.

infallible_galois

In the network graph, data moves from left to right. Every node (or circle) represents a table. There is an edge (left to right) to a node if the data load reads from that table. A table can be loaded with data from many tables. For example, the data load for hopeful_matsumoto reads data from hungry_margulis.
The graph can be analyzed programmatically or used to create interactive charts to help data engineers glean actionable information.

Reduce copies of sensitive data

The security engineers at Company S used a scanning tool to find all tables and columns that stored sensitive data like
PII, financial and business sensitive data. They found that there were many more copies than anticipated. An immediate goal
was to reduce the number of copies to reduce the security vulnerability surface area. To achieve the goal they had to:

  • Identify the owners of the copy.
  • Understand their motivation to create copies.
  • Work with owners to eliminate copies or use masked copies of the data.

The challenges to achieving these goals were:

  • Determine the owners of the data.
  • Number of conversations required to understand the need for copies and agree on a workaround.

Due to the sheer number of tables, the task was daunting and required an automation tool.

Table specific Network Graphs

Tokern provided security team with actionable information to discover

  • Owners of copies of data
  • A profile consisting of the queries used to create the table and read it.

The profile helped the security engineers prepare for the conversation with the owners. On a number of occasions,
duplicate tables were eliminated. In other cases, the workflow was changed to eliminate temporary tables for
intermediate data.

In the example below for upbeat_ellis, two intermediate tables with sensitive data were eliminated from the workflow.

upbeat_ellis

Large Network graphs

Another example of a network graph for crazy_terseshkova is shown below.
As seen, the data path is much more complicated with close to a hundred tasks required to eventually load data into the table. The graph is interactive and can be panned & zoomed to focus on specific parts. Metadata of tasks such as start time, run time and table names are shown by hovering over the nodes in the network graph to deal with such
complicated data pipelines.

tereshkova

Conclusion

Security Engineers need data lineage and related automation tools to manage database security. All databases provide a workload system of record. Data Lineage tools can use this information to visualize data lineage as well as use rules to automate checks.

If open source data lineage tools are interesting to you, check out the lineage github project.

GitHub logo tokern / data-lineage

Generate and Visualize Data Lineage from query history

Tokern Lineage Engine

CircleCI codecov PyPI image image

Tokern Lineage Engine is fast and easy to use application to collect, visualize and analyze column-level data lineage in databases, data warehouses and data lakes in AWS and GCP.

Tokern Lineage helps you browse column-level data lineage

Resources

  • Demo of Tokern Lineage App

data-lineage

Quick Start

Install a demo of using Docker and Docker Compose

Download the docker-compose file from Github repository.

# in a new directory run
wget https://raw.githubusercontent.com/tokern/data-lineage/master/install-manifests/docker-compose/catalog-demo.yml
# or run
curl https://raw.githubusercontent.com/tokern/data-lineage/master/install-manifests/docker-compose/tokern-lineage-engine.yml -o docker-compose.yml

Run docker-compose

docker-compose up -d

Check that the containers are running.

docker ps
CONTAINER ID   IMAGE                                    CREATED        STATUS       PORTS                    NAMES
3f4e77845b81   tokern/data-lineage-viz:latest   ...   4 hours ago    Up 4
…

Top comments (0)