Computer Science‎ > ‎

R for Data: Exploring and Visualization data - Loan Automation Example (1)


Introduction

Data science is an asset to the world in the way it uses data to derive information and produce insights and over the few years, it has been hot topic encompassing different fields as it is evident that data is everywhere. In the initial years mostly it was SAS, which was doing the job for analysis but at present better, effective, open source tools and languages have spread and are doing a pretty good job.

 Data science is not about finding the obvious stuff but understanding the minor detail inside the information, processing it and eventually coming up with insights that actually explain the correlation, relationships between factors and also provide a way to predict the output along with good efficiency. 

TABLE OF CONTENTS

1.Introduction
 2 Example 
  2.1 Problem statement
 3 Data exploration
 4 Data visualization
5 Missing value Imputation
6 Outliers
7 Feature engineering
8 Preparing for model
9 ML algorithms
 

Example 

A finance company deals with providing home loans to customers. They have the presence all over rural and urban areas.Customers apply for the loans and company validates their application and grants them a loan.

Problem statement

 The company wants to automate the loan eligibility process (real-time) based on customer details provided to the while filling up the forms.These details include Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History etc. To automate the process we need to identify the customer's segments, those are eligible for loan amount so that can specifically target these customers.

The dataset is taken from analytics Vidhya   from Loan prediction problem -|||

We are going to solve this problem in R and thus will require the use of some basic packages that will help us to understand various facets of a dataset and develop insights.


Loading libraries

library('ggplot2') # visualization
library('ggthemes') # visualization
library('scales') # visualization
library('dplyr') # data manipulation
library('mice') # imputation
library('randomForest') # classification algorithm
library('data.table')#manipulation


Data exploration- Lets us first explore the dataset first and understand the features or columns provided to us,

There are surely two types of variables in a dataset -

Independent variables - Those variables which are not dependent on each other like Gender is independent of marital status .In other words, the variables from which we have to predict the result that is Loan Status.

Dependent variable - Those variables which depend on other variables like Loan status here is dependent on gender, marital status, and others.These are also called target variables.


Exploring data

train <- read.csv("train_u6lujuX_CVtuZ9i.csv", stringsAsFactors = FALSE)
test <- read.csv("test_Y3wMUE5_7gLdaTN.csv", stringsAsFactors =FALSE)

test$Loan_Status <- NA # to match columns of train dataset
full <- rbind(train,test) # binding them together 
str(full)

Str(full) will tell us all details about the dataset variables and their data types and thus enables us to observe the features closely. We can further check the class or data type of a variable using the class function. 

There are certain categorical variables like gender, marital status, Education, Self_Employed  etc and continuous variables like Applicant Income and Co-applicant Income and we need to predict Loan status which in itself is a categorical variable.

Data exploration

library(gmodels)
CrossTable(full$Gender, full$Loan_Status)

table(full$Self_Employed, full$Loan_Status)

table(full$Education, full$Loan_Status)
     

our exploration provides us with these results



As we see the majority of the loan has been granted mostly towards the males and they constitute almost 79 percent of the training set. Lets us look at the other two outputs 




Here also we see loan is favoured towards people being self_employed and people who at least are graduates and that we can do specifically for all other variables to understand variation.


Data visualization- It is not sufficient to not just look at tables but visualization of data help us connect dots and understand the larger picture with the different combination of variables.

  ggplot(full[1:614,], mapping = aes( x = Loan_Status, fill = Dependents)) +
          geom_bar(  position = "dodge" )## 




visualization

ggplot(full[1:614,], mapping = aes( x = Loan_Status, fill =Property_Area))+
   geom_bar( position = "dodge")##



ggplot(full[1:614,], mapping = aes(x = Loan_Status, fill = Education))+
+   geom_bar(position = "fill") +
+    theme_few() ## stacked bar



As we see more graduates are preferred for the loan  thus evaluating the significance of this variable for predicting loan status


Some of our other tutorials related to data analysis with R