top of page

Imputation Using Multi-class Classification


Multi-class classification is a machine learning classification task that consists of more than two groups (also called as class or categories). Before we go deeper into multi-class classification technique, let us look at the bigger picture of Machine Learning. Machine Learning became a buzz recently, where almost every institutions mentioned Machine Learning applications. Machine Learning, as the name suggests, is the process of teaching a computer system (machine) specific algorithms that can improve themselves with experience (lots of data). The process of ‘teaching the machine’ is grouped into two categories, Supervised Machine Learning and Unsupervised Machine Learning. The difference between both categories is simply identified whether the training data used when building models contains the required answers (label) or not. If we have training data with a label, for example, we want to identify the gender of a person based on a set of data, and we have a gender variable in training data; this is what we call Supervised Machine Learning. Unsupervised Machine Learning is when the dataset used for training does not contain the required answers (label). The algorithm uses techniques such as clustering and dimensionality reduction to train. A significant application of unsupervised learning is anomaly detection. Supervised models are trained on a labelled dataset. It can either be a continuous label or a categorical label; a Regression model is used for data with continuous label, whereas Classification models is used for data with a categorical label. There are three types of classification models; Binary Classification for dealing label with only two categories, Multi-class classification on the other hand, taking care of label with more than two categories; while Multi-label Classification is used when a single observation contains multiple labels.


Depending on the dataset used to generate the best imputation, many techniques are applied. This article discusses the application of multi-class classification as a method for imputation for the case of more than two classes. In searching for the best method for imputation of missing value in Population Census dataset specifically for age group, our team has applied several methods, including Multivariate Imputation by Chained Equations (MICE). Using the package offered by R (mice) and PyCaret.classification library by Python, we found that Light Gradient Boosting Machine (Light GBM) generated the best estimates for this case. Some benefits in using PyCaret.classification are listed below:

  • open source, low-code machine learning library in Python that allows you to go from preparing your data to deploying your model within seconds in your choice of notebook environment;

  • low-code library makes you more productive (less time spent coding);

  • simple and easy to use machine learning library that will help you to perform end-to-end ML experiments with less lines of code; and

  • It trains multiple models simultaneously and outputs a table comparing the performance of each model by considering a few performance metrics.

By default, there are 14 models applied in PyCaret.classification

1) Naive Bayes

2) K Neighbors Classifier

3) Extreme Gradient Boosting

4) Light Gradient Boosting Machine

5) Random Forest Classifier

6) Quadratic Discriminant Analysis

7) Gradient Boosting Classifier

8) Extra Trees Classifier

9) CatBoost Classifier

10) Logistic Regression

11) Ada Boost Classifier

12) Decision Tree Classifier

13) Linear Discriminant Analysis

14) Ridge Classifier

15) SVM - Linear Kernel


LightGBM is an efficient, fast, distributed learning gradient boosting tree algorithm. The model is known for classification, regression, and ranking. Ke, Guolin, et al. (2017) found that LightGBM speeds up the training process of conventional Gradient Boosting Decision Tree (GBDT) by up to over 20 times while achieving almost the same accuracy. Having overcome the shortcomings of traditional models, LightGBM supports efficient parallel training with the advantages of a fast-training speed, low memory consumption, and the ability to process large amounts of data.


Below is the summary of applying Multi-class Classification for missing values imputation using PyCaret.classification. As the results suggest, the best option to predict Age group in utilising the Modelling approach is Light Gradient Boosting Machine (LightGBM), with a time taken of 0.1530 seconds.


Using a simple example in executing LightGBM, you can try it yourself.


Alternatively, you can learn Multi-class Classification extensively using PyCaret.classification:


Reference:

Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., ... & Liu, T. Y. (2017). Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems, 30.

Schlomer, G. L., Bauman, S., & Card, N. A. (2010). Best practices for missing data management in counseling psychology. Journal of Counseling psychology, 57(1), 1.

Saar-Tsechansky, M., & Provost, F. (2007). Handling missing values when applying classification models.






78 views0 comments
bottom of page