Imagine this scenario...
You have started a new job as a Data Engineer. It's an exciting time, new data, new tools, new people! While you are familiarizing yourself, your manager who is also very excited to have you look at their backlog, sends a request across.
"One of our stakeholders has been waiting on this for a long time - could you look into this job that's resulting in a very large table in the data warehouse. The stakeholder would love an aggregated table with the data rolled up to a week."
What is one of the first things you do?
Data Profiling!
It's a systematic analysis of the data in a table to understand the structure and relationships - leading to conclusions on its usability.
The tools to extract this information are measures in descriptive statistics - min, max, mode, frequency, sum, count (You get the picture).
The next set of parameters to understand are data types, nulls, uniqueness which answer questions on missing data, duplicates and so on.
You could run this analysis on individual columns or do a cross-columns analysis.
At the end of this exercise, you should be able to make a few decisions
1) Is this data able to support the analysis/reporting use case?
2) Do the data quality issues originate from the source or from the ETL jobs?
3) Are there inconsistencies in the structure of the table?
So, this was a short introduction to the Data Profiling technique. For further reading, here are some links.
Top comments (0)