Some consider data enrichment as cheating, some as magic. Actually, it is neither. Data enrichment is a term that comprises a bunch of methods that range from engineering to science, plus a pinch of experience. In this blogpost, I would like to shed some light on different methods of data enrichment, their concepts and requirements. No worries, I will not dive into math.
A couple of rather different methods are typically compiled under this term. But like almost any type of data analysis, data enrichment starts with data preparation:
Cleaning and pre-processing is a prerequisite for data enrichment. It involves for example:
To some extent, these tasks can be done by more or less sophisticated automated methods (outlier detection algorithms, type correction, etc.), but a pinch of experience can help to identify critical issues.
After the preparation, there are basically two options for performing the enrichment: record linkage and statistical matching procedures.
Record linkage in the easiest form is actually not a statistical method, but rather an engineering task. However, it is an important technique to combine different datasets. Record linkage means finding records of an identical object in another dataset. In its simplest form, it is nothing but a database join operation – you join records of different data sources using a common key. But often record linkage is more complicated, and it can then easily become science. A nice example for explaining different levels of difficulty is the case of address matching:
Typical methods used for record linkage are:
The difference between statistical matching and record linkage is that statistical matching does not try to match identical objects like an identical person or an identical product occurring in two different tables “A” and “B”. Instead, we infer additional data from the known data using statistical methods. This can be done by finding a matching partner of an entity in dataset “A” from dataset “B” with similar features. Or we can predict the missing piece of information with a prediction model leveraging the data overlap. Both approaches are only valid if there is a statistical correlation between overlapping variables and enriched variables. The overlap could be either a set of common variables (columns) available in both datasets or a common share of entities (rows) like persons or products that exist in both datasets and can be identified by record linkage, such that all variables on both sides are known for those entities.
Let’s illustrate each case with an example:
Data enrichment can be either engineering or science, depending on the connection between the datasets. If it involves science, it will always use a proven modelling method. Most modelling methods allow the error to be determined, such that the quality of the enrichment can be measured and monitored. Statistical matching is only possible if there is correlation between the datasets. If the correlation between the datasets is too weak, this will become apparent in the quality indicators of the modelling. Since data enrichment is no magic, there are limitations of applicability.