Computer Science‎ > ‎

R For Data and Visualization: Case Study: Retail Analytics 2 - A Data Science Story


In the previous part, we explored data by creating visualizations both univariate and multivariate and clearly understood some relationships between different variables telling us about the dataset.

A few more points to explore 

ggplot

ggplot(data = train, mapping = aes(x =train$Outlet_Location_Type, y = Item_Outlet_Sales))+
  geom_jitter( mapping = aes(color = Outlet_Type))
# tier 1 and tier 2 of sup 1 has high sales density 
# it mostly is due to mart type 








ggplot

ggplot(data = train, mapping = aes(x = Item_Type, y = Item_Outlet_Sales))+
  geom_jitter( mapping = aes(color = Outlet_Type))

## grocery store items of every type sold low in  number 
# supermarket type 1 store items of every item sold high in number 
# supermarket type 2 and sup type 3 of every items sold moderayte in no .
# for every item supermarket type that is outlet type plays a major part














Now we will move to the next stage of model preparation and selecting variables to be used for predicting the target variable.



Model preparation


Missing value
As we observed there are some missing values that we need to take care of before of.


missing value

colSums(is.na(train))
colSums(is.na(test))






Now we need to take care of missing values in datasets
In order to impute a  categorical variable like  Outlet_Size, we will use mode

missing value imputation

train$Outlet_Size =  (as.numeric (as.factor(train$Outlet_Size)))
train$Outlet_Size[is.na(train$Outlet_Size)]= mode(train$Outlet_Size, na.rm = TRUE)

## imputing categorical variable
## imputing continuous variable  by mean

train$ Item_Weight[is.na( train$Item_Weight)] = mean(train$ Item_Weight,na.rm = TRUE)   

In order to increase accuracy, we can use MICE package too for imputing categorical and continuous variables
and the same way we can replicate it over test dataset.


Outliers
We further need to check the outlier present and see if they are more and impute them accordingly


Boxplot

     # now again boxplot
            boxplot(train$Item_Visibility, col = "blue")
           abline(h = mean(train$Item_Visibility), col = "orange", lwd = 2))



outliers

library(outliers)
library(gmodels)

 grubbs.test(train$Item_MRP, type = 10)
    (v = which(train$Item_MRP == 266.8884, arr.ind = TRUE))

boxplot(train$Item_MRP, col = "red", main = "Boxplot- Item_MRP")
                






Modification of features
Now we need to modify features to make suitable for input to a machine learning algorithm


features

 
                # converting into factors except outlet size
                train$Item_Fat_Content <- as.factor(train$Item_Fat_Content)
                 unique(train$Item_Fat_Content)
                train$Item_Type <- as.factor(train$Item_Type)
                train$Outlet_Location_Type = as.factor(train$Outlet_Location_Type)
                train$Outlet_Type<- as.factor(train$Outlet_Type)


Features


                #
                # test converting into factors
                test$Outlet_Type <- as.factor(test$Outlet_Type)
                test$Outlet_Location_Type <- as.factor(test$Outlet_Location_Type)
                test$Item_Type <- as.factor(test$Item_Type)
                test$Item_Fat_Content <- as.factor(test$Item_Fat_Content)
                str(test)



We further remove identifiers like item and outlet identifier so that we can build a Ml model on it 

removing identifier

train <- train[,-c(1,7)]

Machine learning 

Let us apply ML algorithms to  fully predict the result

we start with random forest and so one can apply different algorithms to improve accuracy or try different alternatives like Cross-validation, ensembling to improve accuracy.

ML

  library(randomForest)
 rf_model1 <- randomForest(Item_Outlet_Sales~ .,data = train, importance = TRUE, ntree = 800)# taking all variables
 

pred_rf1 <- predict(rf_model1, test)


To further check the model effect on train and  we can simply remove least important variables one by one and see its effect on accuracy 

ML

rf_model1 # details about model

 varImpPlot(rf_model1) # importance of variables with respect to target variable

we can further save the result in a form of a cv and submit it .

ML

 solution2 <- data.frame(Item_Identifier= test$Item_Identifier,
                                        Outlet_Identifier= test$Outlet_Identifier, Item_Outlet_Sales = pred_rf2)
                
                write.csv(solution2, file = 'bigmart sales 2 ', row.names = F)

we can try different methods involving creating new features thereby and try boosting algorithms xgboost or GBM to further enhance results.