Computer Science‎ > ‎

R For Data and Visualization: Case Study: Retail Analytics 1 - GGPlot Examples


Everywhere when we visit we see markets coming up with different malls and stores and their business seems to be busy as people generally prefer to visit the newly created stores and end up buying seeing the new offers and deals. One question for thought is how often do these big malls woo customers to come and shop at these stores.The word of the moment is  'data'   as these malls continuously store data about purchases and combinations of purchases. 

Data about these customers is stored effectively and updated regularly by these malls and stores unlike small shops and this is where small shops even after serving customers for years they lose on a competitive edge.

In this tutorial, we are going to explore how such big malls use analytics to predict sales  and gain competitive edge 
 and thus win the competition and move towards progress and consolidation.

Table of contents

1 Introduction
2 Problem statement
3 Loading libraries
4 Loading data
5  Exploratory data analysis
6 Data visualization
7  Model preparation
8 Machine learning  algorithm


In order to start it well, we will consider the problem statement of the dataset chosen for this task.
The dataset to be  used for study of retail is taken from analytics Vidhya site and is called as  Big mart sales
Problem Statement

The data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store have been defined. The aim is to build a predictive model and find out the sales of each product at a particular store. Using this model, BigMart will try to understand the properties of products and stores which play a key role in increasing sales. We need to predict sales of the test dataset

Loading libraries

We will initially start by loading libraries re


library(data.table) #manipulation
library(ggplot2)# visulaization

Loading dataset

loading data

train <-fread("Train_UWu5bXk.csv", header = T, stringsAsFactors = F)# 

test <- fread("Test_u94Q5KV.csv", header = T, stringsAsFactors = F)## FAST READING THE TEST

Data exploration
Let's start by exploring the dataset in order to meet our objectives 

exploration of data


We get the following variables of dataset 

Observations: 8,523
Variables: 12
$ Item_Identifier           <chr> "FDA15", "DRC01", "FDN15", "FDX07", "NCD19", "FDP36", "FDO10", "FDP10...
$ Item_Weight               <dbl> 9.300, 5.920, 17.500, 19.200, 8.930, 10.395, 13.650, NA, 16.200, 19.2...
$ Item_Fat_Content          <chr> "Low Fat", "Regular", "Low Fat", "Regular", "Low Fat", "Regular", "Re...
$ Item_Visibility           <dbl> 0.016047301, 0.019278216, 0.016760075, 0.000000000, 0.000000000, 0.00...
$ Item_Type                 <chr> "Dairy", "Soft Drinks", "Meat", "Fruits and Vegetables", "Household",...
$ Item_MRP                  <dbl> 249.8092, 48.2692, 141.6180, 182.0950, 53.8614, 51.4008, 57.6588, 107...
$ Outlet_Identifier         <chr> "OUT049", "OUT018", "OUT049", "OUT010", "OUT013", "OUT018", "OUT013",...
$ Outlet_Establishment_Year <int> 1999, 2009, 1999, 1998, 1987, 2009, 1987, 1985, 2002, 2007, 1999, 199...
$ Outlet_Size               <chr> "Medium", "Medium", "Medium", "", "High", "Medium", "High", "Medium",...
$ Outlet_Location_Type      <chr> "Tier 1", "Tier 3", "Tier 1", "Tier 3", "Tier 3", "Tier 3", "Tier 3",...
$ Outlet_Type               <chr> "Supermarket Type1", "Supermarket Type2", "Supermarket Type1", "Groce...
$ Item_Outlet_Sales         <dbl> 3735.1380, 443.4228, 2097.2700, 732.3800, 994.7052, 556.6088, 343.552...

As we see the dataset having different variables are present in different structures reveal information.

Data visualization
 In order to ideally understand data and relevance with respect to the target variable  Item_Outlet_Sales 
 we create several plots in order to visualize it.

Univariate analysis


ggplot(data = train, mapping = aes(x = Item_Weight, y =Item_Outlet_Sales))+
##  weight of items has no affect on sales


ggplot(data = train, mapping = aes(x = Item_Fat_Content, y = Item_Outlet_Sales))+

## Regular>LOW Fat >LF>low fat == reg  # all are diff categories

As it is the evident majority of sales i terms of fat content belong to low fat and regular category.

We can further explore all variables to understand relations 

Multivariate analysis


ggplot(data = train, mapping = aes(x = Item_Visibility, y = Item_Outlet_Sales))+
  geom_jitter( mapping = aes(color = (Item_MRP)))

#low visible items have good sales as they are more 
## maybe there prices are less 


ggplot(data = train, mapping = aes(x = Item_Visibility,y= Item_Outlet_Sales))+
  geom_jitter( mapping = aes(color = Outlet_Type)) 

Above plot reveals some insights 

 supermarket type 1 > supermarket type 2> supermarket type 3>grocery store type of store matters more

In a similar way, we can produce more visualizations to extract patterns and extract insights that can help us really get understanding about various variables


## converting empty values into NA
train$Outlet_Size = as.factor(train$Outlet_Size)
 train$Outlet_Size[train$Outlet_Size == ""] = NA

ggplot(data = train, mapping = aes(x =train$Outlet_Size, y = Item_Outlet_Sales))+
  geom_jitter( mapping = aes(color = Outlet_Type))

As still there are NA values in the dataset which we will impute in the next part but still it gives us an idea about the  distribution