Supercrunch Blog

Markus Lilienthal May 23, 2017 Big Data

Data Enrichment: Pimp my data!

Some consider data enrichment as cheating, some as magic. Actually, it is neither. Data enrichment is a term that comprises a bunch of methods that range from engineering to science, plus a pinch of experience. In this blogpost, I would like to shed some light on different methods of data enrichment, their concepts and requirements. No worries, I will not dive into math.

What data enrichment is about

A couple of rather different methods are typically compiled under this term. But like almost any type of data analysis, data enrichment starts with data preparation:

Cleaning & pre-processing

Cleaning and pre-processing is a prerequisite for data enrichment. It involves for example:

  • Error correction
  • Unification
  • Removal of data, for example erroneous duplicates
  • Occasionally outlier handling, smoothing etc.

To some extent, these tasks can be done by more or less sophisticated automated methods (outlier detection algorithms, type correction, etc.), but a pinch of experience can help to identify critical issues.   

After the preparation, there are basically two options for performing the enrichment: record linkage and statistical matching procedures.

Record linkage

Record linkage in the easiest form is actually not a statistical method, but rather an engineering task. However, it is an important technique to combine different datasets. Record linkage means finding records of an identical object in another dataset. In its simplest form, it is nothing but a database join operation – you join records of different data sources using a common key. But often record linkage is more complicated, and it can then easily become science. A nice example for explaining different levels of difficulty is the case of address matching:

  • Record linkage might involve typo correction or encoding correction at different degrees of complexity.
  • Matching differently coded variables (e.g., addresses and geolocations) might require an additional lookup database – or a mapping algorithm assigning locations to regions.
  • Address linkage mutates into a real machine learning problem if the task is to link context like “Shakespeare’s place of birth” with the town “Stratford-upon-Avon, United Kingdom”.

Typical methods used for record linkage are:

  • Table joins
  • Regular expression matching
  • Nearest neighbour matching
  • Supervised learning techniques

Record linkage example: address matching with spelling correction

Statistical matching

The difference between statistical matching and record linkage is that statistical matching does not try to match identical objects like an identical person or an identical product occurring in two different tables “A” and “B”. Instead, we infer additional data from the known data using statistical methods. This can be done by finding a matching partner of an entity in dataset “A” from dataset “B” with similar features. Or we can predict the missing piece of information with a prediction model leveraging the data overlap. Both approaches are only valid if there is a statistical correlation between overlapping variables and enriched variables. The overlap could be either a set of common variables (columns) available in both datasets or a common share of entities (rows) like persons or products that exist in both datasets and can be identified by record linkage, such that all variables on both sides are known for those entities.

Let’s illustrate each case with an example:

  1. Common variables (columns): An online retail shop wants to understand off-site shopping behaviour. We match consumer panel data in a statistical sense with the shop’s customer database through matching similar on-site buying behaviour. The buying behaviour would be represented by derived descriptive features like purchase frequency, category purchase shares etc.

    Statistical matching using common variables (blue shape illustrates overlapping variables)

  2. Common entities (rows): Assume that 30% of the customers contained in a retailer’s database are participating in an extended loyalty programme. For these customers, there exists another dataset with richer information. Using the 30% overlap, the missing information can be predicted with some statistical error for the remaining 70% of the customers, using the overlap as a training set.

    Statistical matching using common observations (light shapes illustrate statistically filled information)

Neither a cheat, nor magic: rather solid methodology

Data enrichment can be either engineering or science, depending on the connection between the datasets. If it involves science, it will always use a proven modelling method. Most modelling methods allow the error to be determined, such that the quality of the enrichment can be measured and monitored. Statistical matching is only possible if there is correlation between the datasets. If the correlation between the datasets is too weak, this will become apparent in the quality indicators of the modelling. Since data enrichment is no magic, there are limitations of applicability.