Class Imbalance in Machine Learning
By Chris Hilsinger-Pate
January 4, 2023
A dataset is said to suffer from class imbalance when the dataset features classes that are unevenly distributed.
The buzz around machine learning is palpable. It's nearly impossible to browse LinkedIn without coming across articles detailing how companies and organizations in different industries are harnessing machine learning. The rise of machine learning has also coincided with machine learning becoming more accessible. Packages like tidymodels and Pandas make building machine-learning models in R and Python rather easy. While building models has never been easier, there are still barriers to building effective models. One of those barriers is class imbalance.
A dataset is said to suffer from class imbalance when the dataset features classes that are unevenly distributed. Class imbalance is an obstacle because it leads to poor model performance. Machine learning models trained on datasets suffering from class imbalance are typically highly effective at predicting the majority class but will perform poorly when predicting the minority class.
Learning how to deal with class imbalance is imperative as many of the most valuable potential applications of machine learning models are reliant on being able to accurately predict the minority class of a dataset. One such potential use is in the world of medicine. The vast majority of the population does not have cancer, but the success of medical teams is dependent on being able to identify and offer remedies to the minority of the population that do have cancer. In this case, the value of being able to accurately predict the minority class is greater than the value of being able to accurately predict the majority class.
Class imbalance can derail the effectiveness of machine learning models. Fortunately, there are options to address class imbalance. Two of the most common (and simplest) ways to counteract class imbalance are oversampling and undersampling. Oversampling involves increasing the number of observations from the minority class, typically by duplicating existing observations. As one might expect, undersampling does the opposite, removing observations from the majority class. Here is a blog on how to execute oversampling and undersampling in R with tidymodels.
After constructing and testing a machine learning model, one must evaluate the model's performance. One's instinct often drives them to look at the model's accuracy to determine its effectiveness, but when looking at a model's metrics -- particularly a model trained on an imbalanced dataset -- it's best to consider metrics such as recall, sensitivity, and specificity. A model that is trained on an imbalanced dataset may boast an accuracy upwards of 90%, yet it would hardly be considered a useful model if its sensitivity (true positive rate; TP/(TP + FN)) is 10%. Compensating for class imbalance will often lower a model's accuracy while creating more balanced sensitivity and specificity (true negative) rates. Here is an example of how a model's metrics changed as adjustments to an imbalanced dataset were made.
As we continue to dive into the technology and embark on our own machine-learning journeys, keep in mind that the most valuable solutions to many of the world's most pressing problems lay in being able to accurately identify the few in the masses.