Computer Science‎ > ‎

R for Data: Imputation Techniques In Data Science In R

Introduction


Data science as we know is the ability to convert data into information and further translating it into insight.

Data science is never about the obvious stuff but decoding the useful insights out of the obvious stuff and it is job of data scientist to do magic with data and they do magic by getting into the depth of the data, analyzing the pattern and creating insights but what if the data is missing its bits and pieces.

Data with a lot or little of missing values proves to be somewhat of a hindrance for performing the analysis.

TABLE OF CONTENTS

1.Introduction
2.Mtcars dataset
3.Importance of imputation
   3.1 Plot of dataset with NA 
   3.2 Plot of dataset with original mtcars
4.Imputation 
5.Imputation methods
     5.1 Knnimputation
      5.2 Mean
      5.3 Median
      5.4Mode
      5.5 MICE
           
         

Mtcars data set

Let us understand it through an example. Consider a well-known dataset called mtcars available in R inbuilt data sets

mtcars description states - The data was extracted from the 1974 Motor Trend US magazine and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).

Have a look at data set features

It has different variables all present in numeric form and now let us check its missing values or NA present in it.

str (mtcars) # structure of mtcars  dataset
'data.frame':	32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

SUM

sum(is.na(mtcars)
[1] 0

As we can see our data set has no NA or missing values ( no empty values present in this dataset)

Our row has names of cars so we need to bring it into columns to make it an attribute

CREATING NEW COLUMN

s = mtcars
data = cbind(rownames(s),s)
rownames(data)= NULL
View(data)
colnames(data)[1]= "carNames"
Now further structure of data tells us this that dataset have an extra variable or column by car names and it has the class as  the factor.

Lets us induce some NA into dataset

INDUCING NA'S

data$hp[c(1,4)] = NA
data$disp[4] = NA 
##check sum of values
sum(is.na(data))
[1] 4

Importance of imputation

Now before going into procedures of imputing these NA let me give you a demonstration how such NA create impact

we will plot relation between displacement(disp) and horsepower(hp) for these data ( having NA ) and for original mtcars (without any NA ) and evaluate what we see

PLOT OF DATASET WITH NA

library(ggplot2)
g = data
ggplot(data = g, mapping = aes(x=disp,y = hp))+
   geom_point()+
   geom_smooth(alpha = 1/20)


fig(1) = with NA

Now plot the graph for original mtcars data set stored in variable orig


PLOT OF ORIGINAL MTCARS DATASET

orig = mtcars
ggplot(data = orig, mapping = aes(x=disp,y = hp))+
  geom_point()+
  geom_smooth(alpha = 1/20)


fig(2) = without NA

Now have a look at both the images and you might think for once that they look the same but at a closer look you would see difference in them in area of curve between 200 and 300 on x-axis

fig(2) is bent inwards and fig(1) curve is little on the higher side and thus with just less than 5 percent NA, we are not able to see differences in the curve so that shows the importance of imputing missing values.


    Imputation - It refers to the process of imputing values which are NA or missing by using certain techniques so that we can make more sense of data and make accurate predictions.

Note -

  • we impute when missing values are less than 5 percent of data
  • when missing values are more than 40 percent then either we will ignore this column or we will ignore the missing rows for this columns.
  • In some cases when even after the presence of high NA in an important variable we still have no other option but to impute otherwise variance towards target variable gets affected.

Since in our example taken we have less than 5 percent of missing values belonging to column hp we get started with the process of the imputation of missing values.

In R there are a variety of packages that deal with the imputation of missing values. we will discuss only some of them that are used mostly.These are the following

  1. DMwR(knnImputation)
  2. Mean ( base package)
  3. Median( base package)
  4. Mode ( for categorical variables only)
  5. MICE


Imputation methods


Data never lies so it's important to produce the same curve with NA as gets plotted with original NA free dataset and thus we resort to the process of imputation





 1. KnnImputation - 
It is a function available in DMwR package meant for imputation and it works on the principle of nearest neighbour so it imputes a particular value by calculating mean of its nearest members and it is mostly used for numeric variables.

let us try this

KNN IMPUTATION

library(DMwR)
g = knnImputation(g)
View(g)


It provides us with all imputed values but their accuracy is determined by their closeness to the original values as we know from original data set of mtcars.

we get 123, 122, 221 from above imputation for the exact values being 110,110,253 respectively.

2. Mean - It is a base function and we can use it to impute values and as name suggests it imputes values by getting mean of all values in that variable


MEAN IMPUTATION

g$hp[is.na(g$hp)] = mean(g$hp, na.rm = TRUE) 
g$disp[is.na(g$disp)] = mean(g$disp, na.rm = TRUE) 
View(g)

It gave imputed values such as 149,149,229 for the exact values

3. Median - It is a base function and we can use it to impute values and as the name suggests it imputes values by getting the median of all values in that variable and it is generally used for numeric variables.


MEDIAN IMPUTATION

e$hp[is.na(e$hp)] = median(e$hp, na.rm = TRUE) 
e$disp[is.na(e$disp)] = median(e$disp, na.rm = TRUE) 
View(e)

It gave us values 136,136 and 165 for the exact values of mtcars original data.

4. Mode = It is used mostly for categorical variables and it imputes the values as the name suggests on basis of maximum votes.

Here we have car Names column as categorical variable consisting of 32 values so we can easily induce an NA and impute it using mode but we have only one unique observation one time so as we induce a single NA , it will never impute that unique value as mode will search for any value that gets repeated more than once and assign that one to imputed.

The best method to impute a categorical variable is not to rely on mode but to convert them into numeric factors and then use any of numeric variable imputation methods.


CONVERTING FACTOR INTO NUMERIC

g$carNames= as.numeric(g$carNames)
class(g$carNames)
[1] "numeric"
str(g$carNames)
 num [1:32] 18 NA 5 13 14 31 7 21 20 22 ...

If we use any imputation method except mode we will get the original value in place of NA and then we can still convert it back into factors using as.factor


5.MICE - Multivariate Imputation via Chained Equations) is one of the commonly used packages in R.


It works on the assumption that data is missing at random(MAR) and as it means that the probability of missing value depends on the observed values and so it creates an imputation model and imputes values per variable

Once this cycle is complete, multiple data sets are generated. These data sets differ only in imputed missing values. Generally, it’s considered to be a good practice to build models on these datasets separately and combining their results.

The various methods to impute are as

  1. PMM (Predictive Mean Matching) – For numeric variables
  2. logreg(Logistic Regression) – For Binary Variables( with 2 levels)
  3. polyreg(Bayesian polytomous regression) – For Factor Variables (>= 2 levels)
  4. Proportional odds model (ordered, >= 2 levels)

MICE deals with numeric variables only so removing carNames


GETTING PATTERN OF NA

g= g[,-1]
install.packages("mice")
library(mice)
md.pattern(g)
   mpg cyl disp drat wt qsec vs am gear carb hp  
29   1   1    1    1  1    1  1  1    1    1  1 0
 3   1   1    1    1  1    1  1  1    1    1  0 1
     0   0    0    0  0    0  0  0    0    0  3 3

Let’s understand this table. There are 29 observations with no missing values. There are 3 observations with missing values in hp.


PLOTTING GRAPH USING VIM PACKAGE

install.packages("VIM")
library(VIM)
mice_plot <- aggr(g, col=c('navyblue','yellow'),
numbers=TRUE, sortVars=TRUE,
labels=names(g), cex.axis=.7,
gap=3, ylab=c("Missing values","Pattern"))


Only hp has missing values rest no column has missing values

Here are meanings of some parameters used in MICE

  1. m – Refers to 5 imputed data sets
  2. maxit – Refers to no. of iterations taken to impute missing values
  3. method – Refers to method used in imputation. we used predictive mean matching.




MICE IMPUTATION

imputed_Data <- mice(g, m=5, maxit = 50, method = 'pmm', seed = 500)
summary(imputed_Data)



Imputed values are as under

IMPUTED VALUES

check the imputed values
imputed_Data$imp$hp
imputed_Data$imp$hp
    1   2   3   4   5
2 175 123 150 180 109
4 109  93  52 110 123
6 113 109  97 113 123


These are 5 imputed models giving different 5 values for the same missing 3 values of hp column

we can choose any of the 5 imputed data models or even we can combine them to get  an aggregate value for the missing values