How do you choose which machine learning algorithm to use? It is a fascinating question, as an enormous amount of algorithms has been developed in recent years. To be able to address this question in some detail in this (short) blog post, we will limit ourselves to classification algorithms, which are the algorithms most commonly used in our branch.
As a first step, it is worthwhile to read some empirical research conducted on this topic. A fascinating paper on this topic has been written by Fernández-Delgado et al.*, already back in 2014. In this paper, they conducted an empirical study on the performance of a whopping 179 classification algorithms on 121 datasets. One can imagine that this number of algorithms has grown even further. You should be careful in interpreting the results of these kind of papers (as the authors of this paper also note), but they do give a general overview of which algorithms generally perform well.
So, are we done now? Should we simply read a couple of these papers and choose the algorithms that perform best? No. Unfortunately, there is not one single model which works best on all datasets. This insight is also known as the no free lunch theorem**, and can be proven as follows:
- Every model is a simplification of reality
- Every simplification is based on assumptions
- All assumptions fail in certain situations
- Therefore, no one model works best for all possible situations
To visualize the no free lunch theorem, consider the following graphs below***. On the far left you see the input data of three different distributions of a data set containing red dots and blue dots. For each of the three different input data, ten algorithms are trained which split the area in a red and a blue area. When it is red, all observations within this area are classified by the model as red, and when it is blue, all observations within this area are classified by the model as blue. One can see clearly that the different models have split the area very differently into red areas and blue areas. Stated differently, models have a different approach towards splitting the area into a red and blue area, resulting in different results. Moreover, some models do a better job for one type of input data and worse for the other. For instance, consider QDA on the far right and compare its second row and third row. In the second row it is performing poorly, whereas it is performing very well in the third row.
So how do we deal with this complexity? The answer is to learn which algorithms perform well in what situations. For example, support vector machines generally perform well on datasets which have many variables compared to the number of rows in your dataset. However, the results of this method are difficult to interpret. Boosting algorithms are generally excellent at achieving high accuracy, but easily overfit. Neural Networks can sometimes perform excellent in situations in which other models fail, but you need to invest a large amount of time to tune the model and the model is a black box. If you want the model to be interpretable, you might just want to stick to logistic regression or a decision tree. And if you have no clue what to do, Random Forest is always a good option, because it performs well on many types of datasets.
As there are so many ways in which algorithms can be compared, it is easy to lose yourself in comparisons. In practice, it is wise to simply try out a couple of algorithms on a dataset to see which performs well. Often, this gives you valuable insights on what works and what doesn’t. As the datasets that are available in your branch will probably have similar structure, algorithms that work well on one dataset will also perform well on similar ones. In science, you also see that research areas have their standard machine learning algorithms. For example, in genetics, Support Vector Machines are often used as these datasets have many variables (gene locations for example) versus data points.
So far, we have only discussed a few criteria to judge a classification algorithm on. But we have identified at least 20 other criteria. For example: how well does the algorithm perform on unbalanced datasets? How does multicollinearity influence the performance of the model? And so on. However, we have distilled the seven most important criteria for modelling in the business context on which all classification models can be scored. Then, depending on the situation at hand, one can determine which model is most likely to perform best in the given environment. The seven criteria are as follows:
Percentage of correct classifications on a test set.
Is it possible to evaluate which variables were the most important variables for the classification?
Could you explain how this model works to your grandmother?
What happens when we put 1.000.000 observations into the model?
What happens when we put 1000 variables into the model?
Can it learn you like chocolate and salami, but do NOT like the combination?
How much parameter tuning do you typically need to do to obtain good results?
- Prediction accuracy: The percentage of correct classifications on a test set. Or stated differently: how close are your model predictions to reality? Note: one can use many different measures aside from
- Interpretability of output: Is it possible to evaluate which variables were the most important variables for the classification? In other words: is it not a black box?
- Easy to explain: Could you explain how this model works to your grandparent? When a model is easy to explain, it is also easy to understand. This increases the likelihood that it will be adopted by the users.
- Training speed: What happens with the training speed – the time it takes to train a model on a data set – when we put 1,000,000 observations into the model? Some models take 5 seconds to train, others 5 minutes. But very complex models, such as neural networks can need 5 hours to be trained. And the more observations, the lower the training speed.
- Effect of many features: What happens when we put 1000 variables into the model? Most models only work with a small set of explanatory variables: between 5 and 25 variables is very common. However, working with 1000 variables requires a different approach for a model to capture the dynamics correctly in the data.
- Learns feature interactions: Can it learn that you like chocolate and salami, but do NOT like the combination? Not all models understand this interaction principle by themselves. Instead, a data scientist should force the model to understand such interactions.
- Amount of parameter tuning: How much parameter tuning do you typically need to do to obtain good results?
One can do a similar assessment for other models such as regression models and unsupervised models.
When we consider the seven most often used classification models, how would they score on each of the seven criteria? See the diagram below. Here, the earlier example on modelling gene locations shows that Support Vector Machines indeed work well with many features. But neural networks are a good alternative as well. Note: these algorithms often improve. So, the content might change over time.
To conclude: it is difficult to determine beforehand which machine learning technique is going to be successful on your data set, but by following these steps, you will at least get closer to the answer:
- Get an overview of which algorithms generally work well
- Determine the most important criteria to judge a model on for your type of data sets
- Pick the algorithms that fit your needs best
- Always try a few algorithms
Want to know more and practice?
Join us at one of the many Machine Learning training modules.
*Manuel Fernández-Delgado, Eva Cernadas, Senén Barro and Dinani Amorim. Do we Need Hundreds of Classifiers to Solve Real World Classification Problems? Journal of Machine Learning Research 15 (2014) 3133-3181
**D.H. Wolpert. The supervised learning no-free-lunch theorems.
In Soft Computing and Industry. pages 25-42. Springer, 2002