Supercrunch Blog

Markus Lilienthal December 1, 2017 Data Science

Prediction – The Master Discipline

Prediction does not only mean forecasting a future value. It can also be used to calculate any hypothetic situation, a what-if scenario, or as a prerequisite for making an optimal decision. Prediction means that a trained model is fed with an unknown situation to produce the expected result. Predictive modelling is often seen as the master discipline of data analysis, but is not the only type of analysis.

Good prediction = good model?

The philosophy of science defines a (scientific) model as a simplified concept of our reality. In this sense a statistical model represents a typically unknown data-generating process in an idealized form. So the model remains an abstraction of what is actually happening – of course acknowledging some inherent simplification error. This is often the case for marketing or psychologic models, which are the most frequent in market research.

The predictive power of a model is actually only one element of the set of quality criteria for any model. Predictive power is only one aspect of validity (=predictive validity). Validity is the extent to which the model describes the real world accurately. Reliability is a weaker criterion and means producing similar results under similar conditions, but not necessarily measuring the correct thing. Besides validity and reliability there are also some more technical, implicit requirements on a model such as identifiability for the sake of model selection, practical computational feasibility or interpretability of the trained parameters.

Prediction is not always the purpose of data modelling. In other cases predictive power is not a primary measure of validity. Examples for non-predictive analysis are explorative or structural; revealing unsupervised learning techniques such as unsupervised clustering, seasonality or outlier detection. In these examples data about the “truth” is not needed and typically also not available. If validity becomes difficult to assess, reliability becomes more important. For example the stability of an unsupervised clustering is an important quality criterion.

Determinants of a good prediction

The path to good prediction goes along a couple of data and model properties:

Data size and completeness: The more data, the better. But more data does not help if the most important variables are nonetheless missing or the model is inappropriate.

Accurate data: The accuracy of the training data and particularly the prediction input always constraints the accuracy of the prediction on individual data points. If, for example, in a daily dataset for some variables the exact measuring date is unknown (e.g., human estimate), it will be very difficult to estimate its effect on a daily base.

Variation in the observations: Explanatory variables that do not vary very much in the training data will not have much effect on the depending variable, and are thus more difficult to separate from noise. In the extreme case if a variable does not vary at all, there is now way to compute its effect.

Feature engineering: In traditional statistical modelling feature engineering is an important success factor to handle nonlinear effects correctly. In modern machine learning the importance of feature engineering is slightly reduced due to the flexibility of the model itself, but is still a success factor.

Aggregation level: Some variables might be difficult to predict on an individual level, but on an aggregate level precise prediction is easier, because the individual randomness is cancelled out.

Appropriate model: The model should be able to reflect all the features of the data which we know are important in the particular case, for example the domain of the output (e.g. shares between 0 and 1) or heterogeneity among the individuals (items or persons). It depends on the problem whether cross-sectional (effects across items or persons) or longitudinal data (effects within items over time) are of primary interest.

Natural boundaries of a good prediction

Even the best model and the best data imaginable comes with some natural boundaries for prediction:

Causality: It is very hard to prove real causality. A couple of approaches are out there to estimate some specific definition of causality. For example Granger causality relies on temporal context (before and after). But you should be aware that they use a specific meaning of causality.

Extrapolation: A model can typically only predict well within the training data range. If the trained model is fed with new data far off the value ranges of the training data, we run into a scenario the model was not fit for. How well the model will perform depends on the robustness and type of the model. For example a linear growth assumption is in most cases only valid in a limited range and is better described by an S-shaped growth if we can properly derive the saturation level from the data. Nonlinear machine learning models can help to add more flexibility on the shape, but also a machine learning model does not automatically implement saturation effects correctly if they appear only outside the observed training data.

Chaos and the governance of noise: Chaos theory claims that some processes cannot be forecasted into an unlimited future, because at some point even very small differences in the start configuration would lead to completely different results in the future. Those kinds of thresholds beyond which forecasts don’t make sense any more exist practically for example for meteorologic or economic forecasts.

External shocks: In most cases the trained model does not consider any exogenous shocks that change the situation completely. Such shocks could, for example, be natural disasters, an economic crash or the sudden entering of a game changing competitor in the market that destroys all past dynamics.


So you see – while predictive analytics might be the master discipline of modelling, there is a lot to consider. What are your experiences with forecasting and predictive modelling?