- 1 Handling Missing Data – Abstract
- 1.1 Introduction – Handling Missing Data
- 1.2 Handling Missing Data
- 1.3 Ignoring
- 1.4 Removing (Deletion)
- 1.5 Imputation (Fill-in)
- 1.6 Experiment
- 1.7 Results and Discussion
- 1.8 Affection of Missing Data Amount
- 1.9 Affection of Missing Generator and Handling Method
- 1.10 Recap
- 1.11 Reference
- 1.12 Hiring Data Scientist / Engineer
- 1.13 AI / Data Science Project
- 1.14 Vietnam AI / Data Science Lab
Handling Missing Data – Abstract
The article discusses various types of missing data and how to handle them. We demonstrate how the prediction results are affected by the quality of missing data as well as method of handle missing data through some experiments.
Introduction – Handling Missing Data
For any real data set, missing data is almost unavoidable. There are many possible reasons for this phenomenon including changes in design of data collection, precision of data that user entered, the unwilling of participants surveyed when answering some questions, etc. Detecting and handling these missing values are problems of data wrangling process.
There are 3 major types of Missing data:
- Missing Completely at Random (MCAR): this is actually the random case. The missing record is just a random and there is no correlation between any value between the missing values and values in other variables.
- Missing at Random (MAR): this type of missing means that the propensity for a missing point is not related to the missing data, but to some of the observed data. For example, in a market research survey, for any reasons, some interviewers (of some cities) forgot to ask about income of interviewee, that lead to the ratio of missing income values in these cities higher than other ones. We can consider this is a Missing at Random.
- Missing Not at Random (MNAR): this is the highly biased case. The missingness is related to the value of missing observation. In some cases, the dataset should be re-collected to ensure not to have this type of missing. For example, interviewees with high income rejected to answer about their figure could cause this type of missing.
Handling Missing Data
Yeah, you just ignore it, if you know missing data is MCAR. Although you do not do anything by yourself, the library (such as XGBoost) is the one that do the stuff for you by choosing an appropriate method. So technically, we can count this method as cases of other methods, depends on circumstance.
- Column deletion: another simple to handling missing data is remove that attribute (column deletion). It can be applied when the missing record ratio is high (should be least 60%, but this is not a fixed rule) and the variable is insignificant.
- Row deletion: If the missing value is MCAR and the missing ratio is not very high, we can drop the entire record (row). This method can be acknowledged as listwise deletion. But if the missing case is not MCAR, this method could introduce bias to dataset.
- Pairwise deletion: instead of completely removing unknown records, we will maximize data usage by omitting only when necessary. Pairwise deletion can be considered as a method to reduce the data loss caused by listwise deletion.
- Imputation with Median/Mean/Mode values: these values usually used to fill the missing position. In most of times, the mean value is used. By using mean value, we are keeping mean unchanged after processed. In case of categorical variable, the most popular value (mode) can be used to fill. Imputation method could decrease variance of the attribute. We could extend the imputation by adding information whether value comes from imputation or from original dataset value using boolean type (this technique can be called marking imputed values in some document). However, one must be aware of using this method, if the data missing is not random, using mean can introduce outliners to the data.
- Algorithm-based Imputation: instead of using a constant for imputing missing values, we could model variable with missing values as a function of other features. A regression algorithm can predict them with some assumptions.
- If linear regression is used, we must assume that variables have linear relationship.
- If predicting missing values based on order of a high correlated columns, the process is called hot-deck imputation.
- KNN Imputation: this method can be considered as a variant of median/mean/mode imputation, but instead of calculating these values across all observations, it only does among K nearest observations. One question we should think is how to measure distance between observations.
- Multivariate Imputation Chained Equations: instead of imputation value of each columns separately, we can repeat to estimate missing values based on distribution of other variable. The process repeats until data become stable. This approach has two setting: single and multiple data sets (can also be mentioned as Multiple Imputation by Chained Equations – MICE).
One iteration of MICE
We are using Titanic dataset for experiment, which is quite familiar with most data scientists. The original data consist of 12 variables, include categorical variables and numerical variables. The original task is predicting whether each passenger is survived or not.
We will do classification task with Logistic Regression (fixed among trials). In each experiment, we try to simulate the situation of data missing by removing some existing values from some features of input data. There will be 2 ways to removing data: completely random (MCAR Generator) and random (MAR Generator). Consider MAR Generator, in each trial, values will be removed with different ratio based on values of other feature (in particular, we based on Pclass – a highly correlated variable with Survived status). We track the changing of accuracy across different settings. For cross validation, we apply K-Fold with K=5.
In experiment 1, we observe the changing of accuracy when we removing different amounts of data from some features.
In experiment 2, we generate missing data using MCAR and MAR Generator and use 2 MCAR-compatible methods to handle them. We will find out whether these methods decrease accuracy of classifier model.
Results and Discussion
Affection of Missing Data Amount
In this experiment, we will try to find the correlation (not actually the correlation coefficient but the correlation in general) between the amount of missing data and the output of learning models, as well as the method to handle them. We do this by masking different ratios of a few columns with MCAR setting.
|Masking Sex, Dropping Title||Masking Age, Dropping Title||Masking Age, Sex, Dropping Title||Masking Age, Sex, Keeping Title|
Figure 3 Affection of Missing Ratio. The columns just right to each accuracy columns show the difference between the original (0%) and current setting
As can be seen, the more values is removed, the more accuracy decreasing. But it happens only under some settings.
The Missing Data quantity affected significantly only if the feature brings “unique” information. With the presence of Title feature (extracted from Name), the missing values in Sex column do not decrease the performance of model, even with 99% missing data. It is because the majority of values of Title column (Mr, Mrs, Ms, Dr…) induced information of Sex columns.
With the existence of some features that are important and highly correlated with missing features, the missing data effect of become negligible. One thing we can learn that although its simplicity, removing entire variables should be considered in many cases, especially if there are some features that highly correlate with missing feature. This can be valuable if we do not want to sacrifice performance and waste effort in order to gain a small portion of accuracy (around 1%).
Affection of Missing Generator and Handling Method
In this experiment, we use MCAR and MAR simulator to create modified datasets. With each removing method, we apply on numerical columns (Age and Fare). Then, we use Mean Imputation (so we choose numerical features for removing values) and Listwise Deletion, which compatible which MCAR setting, to handle these missing values and observe the difference of accuracy.
Handling by Mean Imputation
|Missing ratio||MCAR Missing Generator (Age)||MAR Missing Generator (Age)||Difference|
Handling by Listwise Deletion
|Missing ratio||MCAR Missing Generator (Age)||MAR Missing Generator (Age)||Difference|
Figure 4 Different Missing Generators with different MCAR Handling Methods
Once again, we notice that with Mean Imputation, there is not any significant improvements when we use MCAR Missing Generator instead the MAR one. We can see that although Mean Imputation (which is considered as a MCAR-compatible handling method) can distort the correlation between features in case of MAR Missing Generator, the classification task can achieve a comparable accuracy.
On the other hand, in case of using Listwise Deletion, the classifier accuracy is higher when handling method is synced (MCAR Missing Generator). This can be explained by doing listwise deletion, we also throw data from other variables away. So in MAR Generator case, we removed rows with a not randomly mechanism (it is still removed randomly in MCAR Generator case), which worsen the classifier’s accuracy. Note that in one column, there is an increasing in 60% setting. This phenomenon happens because by removing more rows, both the training and testing folds become smaller. We should not consider this as an improvement of model when we increase missing ratio.
All methods of handling missing data may be helpful, but the choice is really based on the circumstance. For better choice, data scientists should understand about the process that generated the dataset, as well as the knowledge of the domain.
Considering the correlation between features are important to decide whether missing data should be handle or just ignore or delete them from dataset.
There are also some aspects of handling missing data we want to show you but due to time and resource limitation, we have not done these experiments yet. We would want to do experiments with more complex methods such as algorithm-based handling, as well as compare the affection over different datasets. We hope to come back to these problems some days.
Multiple Imputation by Chained Equations (MICE): https://www.youtube.com/watch?v=zX-pacwVyvU