I remembered many years ago while working on a project for Anglo American, one of my responsibilities were to write extensive complex SQL queries spanning hundreds and thousands of records, ranging over almost 10 years of data. This, in turn, was wrapped by a restful API and consumed by a bunch of graph intensive frontends. That was the high-level view of the requirements, the details were much more morbid and the data was incomplete for the most part, and like they say, the rest is history. The point is that I spent the next 9 months aggregating data from other sources, extrapolating some pieces out of thin air and packing it all into ETL (Extract, Translate and Load) processes to fix the existing data and make sure new data remained spotless for years to come. I was only later able to define what I was doing as Data Enrichment.
Before we go any further, let’s first define Data Enrichment. As per Technopedia data enrichment is defined as “a general term that refers to processes used to enhance, refine or otherwise improve raw data”. This is quite a broad and slightly ambiguous definition, however, it does give us the gist of it, which is to make data better in every possible way.
As for why we would want to do this; it depends on who is asking really. If you were to explain this to a product owner or stakeholder you would focus on the value that it adds to the business as well as to the product and the overall bottom line. To the developer, you would argue that this is the single most effective way to put a smile on your bosses face. It is furthermore a way to make sure that the data you are visualizing is correct and makes sense and is easy to work with since there are fewer gaps that need to be accounted for and fewer edge cases to worry about. You would really have to go through a lot of trouble or be speaking to someone with no clue about how the digital world works to mess up a motivation for why data enrichment is a positive thing.
This is where the tire meets the road. There is data that needs enrichment, but where do I start? I think a good place to start would be to consider manual vs automated processes first.
The manual way is the oldest method of doing data enrichment. Currently, it is also without an equal at handling the most intricate of edge cases in your data. The human mind and eyes are experts at spotting fictitious data when the data set are understood, and can for instance much easier categorize an image based on its content than a computer (for now at least). The use cases where manual data enrichment intervention is needed is endless and will undoubtedly present itself quite clearly.
The automated way is pretty old as well (integration with third-party sources and services started happening as soon as third party sources and services became a thing). Since its inception, the possibilities have increased almost as rapidly as Moore’s Law itself. Today we have an endless array of approaches, implementations, and integrations to choose from. It ranges from algorithms designed to fix spelling mistakes in your data, adding simple data sets, doing data integration, filling in the missing pieces for conventional data and algorithmic and statistical analysis to machine learning constructs like Tensorflow and Hadoop clusters circling around your data lakes. Each of these mentioned methods is deserving of its own book (not to mention a blog post) and would keep any developer extremely busy to try and master it all.
In my mind data enrichment tools can be grouped into three categories, namely:
- Adding or completing
- Big Data
The first category of data enrichment tools is ETL. Although it is a necessary step towards data warehousing, it can exist as a standalone solution for various data enrichment needs. Regardless of the technical definition, the idea of ETL is to take data, do something with it, and store it again. Typically when marshaling data for the first time, we discover patterns and repeatable processes and we can then use that knowledge to write algorithms that determine what do with the data. Another case is where different data sources need to be integrated on a lower level.
If your needs are enterprise-grade, then consider tools like SQL Server Integration Services or IBM InfoSphere DataStage. If your needs are of the medium to small startup variety I’d suggest skipping the heavy frameworks and going with an RYO approach using a simple Node.js script and AWS Lambda combination or something similar.
2. Adding information or filling in the gaps
The second category is the tools that help you add to simple data set or to fill in the gaps. This is almost the most obvious way of doing data enrichment. You have data, it’s good data, but you can benefit from enhancing your data. The rest almost speaks for itself. You have user emails, but no phone number, then you can use tools like Lusha or LeadGenius to add this information. This is just one example. There are so many ways that you can enrich data by using third-party services from extensive car information to aerial data to related map data based on location. The options are endless.
3. Big Data
Lastly are the big data tools. This I admit is the category where my own personal experience on actual big data falters completely, not to mention the tools that you can use to enrich this data. As mentioned previously the potential of machine learning constructs are fascinating and could change data enrichment forever by the time it reaches climax. I researched this quite a lot, and one service I found that seems to tap into this market is Datanyze.
I think it is safe to say that Data Enrichment is a necessary process if your data lacks luster and that it also might be needed to just grow a product to its next level. This is a wide overview and hopefully, it makes you think about the multiple possibilities and applications of Data Enrichment.