What’s the fuss about predictive modelling?
Ryan Wedlake, a data analyst at PBT Group
Too often the terms ‘predictive modelling’ and ‘machine learning’ are used interchangeably, however, there are important differences between the two that should be kept in mind when using these techniques for model building.
Predictive modelling involves the use of more traditional models that have a classical underlying mathematical foundation. The simplest of these being linear and logistic regression. These models are used to predict future outcomes by making use of past data. These models are often the first ones to be used to ‘solve a problem’ but require a large amount of historical data to perform well. Also, they tend not to ‘learn’ from the data they are fitted on and generalise poorly to new data unless they are continuously manually updated. However, their accuracy is often quite good for the data they are fitted on. It is also important to keep in mind that these models assume that the data they are fitted on follows a mathematical distribution and so these assumptions need to be validated beforehand, especially if hypothesis testing is to be performed.
Machine learning (ML), on the other hand, has its origins in computer science. ML models tend to be more complex than the traditional mathematical models and do require more computing power for training. ML models can be trained on less historical data and tend to adapt themselves and learn from experiences. This makes ML models better candidates for putting into production because they do not need to be refined as frequently as the traditional predictive models. Another distinguishing feature of ML models from traditional models is that ML models can learn in a supervised and unsupervised way. A supervised ML model includes the traditional regression models that are used to predict an outcome, but the unsupervised models are those models that do not need a dependent variable in order to learn from the data, an example being model used for cluster analytics.
A significant advantage that the traditional model has over the ML models is that they can be more easily explained to an audience that does not have the background of a statistician or data scientist. As a result traditional models are typically used first for solving a business problem. The complexities of ML models make it difficult for business to understand and the outcomes of the models can be difficult to explain. Many businesses are therefore hesitant to use these models for predicting significant business processes. However, this mindset is changing because of the superiority of performance of ML models, their resilience with new data and being easier to put into production.
Whether traditional models or ML models are used there is one inescapable fact that the data the models are trained and validated with needs to be quality data. Data quality is less of an issue with ML models, but it is always recommended to have the best quality data available. Arriving at quality input data to models is not a trivial exercise and that is why PBT Group places a lot of importance on the role of data specialists. It is often the case that the data preparation can take up the longest time of the entire modelling process.
The task of fitting models to answer business questions should be left up to data scientists or people trained in advanced analytics, who tend to understand more about the complexities of the models and, when it is appropriate to use the different kinds of models with different types of data in business situations. The interpretation of the results of models can also be a minefield and this also requires specialised skills.
Excitingly, PBT Group has ventured into the domain of data science, with the aim of having a well-rounded consultancy base within the company. This will enable its clients to utilise the company’s expanding expertise, whether it be for the first step in the modelling process to get the data quality up to standard for the modelling process, or for the subsequent steps of fitting appropriate models and the correct interpretation and utilisation of the model results.