Computer Science‎ > ‎

R for Data: Exploring and Visualization data - Loan Automation Example (2)

we are beginning to understand the real value of independent variables towards the target variable as it is clearly shown by different data visualizations that we see and now comes further part that of looking out data for missing values and outlier as we need to create a predictive model in order to generate values on the test data.

After going through data exploration and data visualization we need to focus the following portion

Missing value imputation

As we already know missing values are never a good sign for a dataset as it severely compromises our ability to predict correct results and thus we need to remove through different imputation techniques whatever provides us with good accuracy results.

Only a few variables posses missing values and so we will remove them to tidy the dataset. 
Loan_Status  missing values are the values in test data which we need to predict and so apart from it rest show the real missing values.


full$ApplicantIncome[$ApplicantIncome)]= median(full$ApplicantIncome, na.rm = TRUE)

>    full$LoanAmount[$LoanAmount)]= median(full$LoanAmount, na.rm = TRUE)

>    full$Credit_History[$Credit_History)]= median(full$Loan_Amount_Term, na.rm = TRUE)
full$Loan_Amount_Term[$Loan_Amount_Term)]= median(full$Loan_Amount_Term, na.rm = TRUE)

We can further again check the missing values again the same way to make sure we have taken care of them.  Now we further need to check if there are any outliers that may have been in dataset due to some mistake or wrong reason.

Outliers -  We need to make sure that are no invalid outliers that may compromise the accuracy of our model.


   grubbs.test(full$ApplicantIncome, type = 10)

### output 
Grubbs test for one outlier

data:  full$ApplicantIncome
G = 13.31300, U = 0.81896, p-value < 2.2e-16
alternative hypothesis: highest value 81000 is an outlier

As we see grubs test points us towards the highest income applicant whose income is quite high as an outlier and so it falls to us to decide to consider it as an outlier or not and if we consider we need to impute the value same way we impute for missing values. we can thereby use a boxplot to visualize it.

Here we see that indeed there are certain incomes which are way more than usual and count as outlier but since  they can genuinely be higher incomes of person we leave it that way and same way process can be repeated for other variables as well 

Feature engineering 
This feature is most the critical portion and tests one thinking skills as you can create different variables from the same information and thus increase the prediction accuracy or interpretability of model through. We also create some new features but you as are also free to create some others that show your creative thinking or acumen in terms of understanding of the problem at hand.

Creating features

 full$total_income <- full$ApplicantIncome + full$CoapplicantIncome

## pipe function used to create factors 
 df3 <- full %>%
      mutate(income_status = case_when(
        total_income < 5000                  ~ "low income",
        total_income>5000 & total_income < 13000 ~ "middle income",
        total_income> 13000 & total_income < 28000 ~ "good income",
        total_income > 28000                  ~ ">high income",
        TRUE                             ~ "NA"

    full <- df3
    ## feature engineering for members 
    df4 <- full%>%
      mutate(dependent_members  = case_when(
        Dependents  == "0"                  ~ "no dependent ",
        Dependents == "1" | Dependents == "2" ~ "moderately dependent ",
        Dependents ==  "3+"  ~ " highly dependent",
        TRUE                             ~ "NA"
    full <- df4

You can go one step further and create some more exciting variables from the independent variables that provide a positive effect.

   ggplot(full[1:614,], mapping = aes( x = factor(Loan_Status), fill =  income_status )) +
     geom_bar( position = "dodge")##

Preparing for model 

The next step that we need to take is to convert categorical variables into factors  so that our machine learning algorithm can provide 
us with good predictive accuracy.

 ## converting variable into factor in order to avoid warning messages 
  full$Gender<- as.factor(  full$Gender)
  full$Married<- as.factor(  full$Married)
  full$Education<- as.factor(  full$Education)
  full$Self_Employed<- as.factor(  full$Self_Employed)
  full$Property_Area<- as.factor(  full$Property_Area)
  full$Dependents <- as.factor(  full$Dependents)
 full$dependent_members <- as.factor(     full$dependent_members)
full$income_status <- as.factor(  full$income_status)

   train <- full[1:614,]
   test <- full[615:981,]

   ## removing identifier Loan  ID
   train <- train[,-1]

ML algorithm -

It is one of the portions where we get to see the result of our hard work done on dataset during initial stages and as I apply random forest algorithm you are also free to apply different algorithms of your choice and thereby even perform parameter tuning to even increase your accuracy.

Random forest

test$Loan_Status = "0"
test$Loan_Status = as.factor(test$Loan_Status)
train$Loan_Status = as.factor(train$Loan_Status)
  ## prediction # random forest

 fit_classify = randomForest(Loan_Status~ ., train, importance = TRUE, ntree = 800)

 pred = predict(fit_ classify, test)

## saving it as a submission file

 solution <- data.frame(Loan_ID = test$Loan_ID, Loan_Status = pred)
   # Write the solution to file
    write.csv(solution, file = 'Loan prediction problem ', row.names = F)

Random forest does provide us with importance parameter that helps us to know relative importance of independent variables,
 and so we can remove least important ones to improve accuracy

You can also use other ML algorithms or even try ensembling to improve accuracy and produce better results.