Computer Science‎ > ‎

R for Data: Using ggplot To Create Visualizations In R


1.Introduction

Data is huge and it is everywhere but along with that comes the need to understand data and base our decisions after drawing inferences from data.

One of the major steps that we have in the field in data science is first exploring the data thereby presenting it in the form of informative plots and it is referred to as Data Visualization.

We are very visual creatures as a large portion of brain dedicates itself to visual processing. Images are able to grab our attention easily, we are immediately drawn to them. 

John Tukey was an American mathematician best known for the development of the FFT algorithm and boxplot.Here are some of his notable quotes on the importance of visualization through graphs.

The simple graph has brought more information to the data analyst’s mind than any other device.” — John Tukey

The greatest value of a picture is when it forces us to notice what we never expected to see.” — John Tukey.

Graphs provide us with ability to map connections or interrelations among two or more things by a number of distinctive dots, lines, bars, etc


TABLE OF CONTENT

1.Introduction
2.Loading  ggplot and the mtcars dataset
3.Types of plots
  3.1. Scatterplots
  3.2.Line chart
  3.3.Bar chart
  3.4.Boxplot


2. Loading ggplot and the  mtcars dataset 

First step- Installing the package and loading its library.

INSTALLING GGPLOT

install.packages("ggplot2")
library(ggplot2)

Note- One need to install package only once but load library every time for a new session


Second step- Loading the mtcars data frame.

To clearly understand the basics we are going to take a simple and small dataset  mtcars present in R library itself comprises

mtcars  itself comprises observations for fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).

LOADING MTCARS

data()        # showing all inbuilt datasets
mtcars        # showing mtcars dataset



3.Types of plots
In this article, we are going to know how to use one of the most crucial package ggplot2 in R for generating plots of data and make data exploration a fun process.

This article describes the different type of graphs used in the ggplot library for data exploration. They are namely

  1. Scatterplots.
  2. Line chart.
  3. Bar charts.
  4. Boxplot


1.Scatterplots

They are used to represent distribution between different variables in form of points scattered all over. They are are most basic and simple to make and every data point gets a chance to be represented however it becomes somewhat less distinct to see the image unlike the case with a line chart.

geom is a geometrical object that a plot uses to represent data and different plots use different geoms like bar chart uses bar geoms , line chart uses line geoms, boxplot uses boxplot geoms.

mappings are used to map different aesthetics (written in the brackets under aes) assigning different axis x and y to variables and other aesthetics like color, fill also associated with other variables to be represented in the plot.

 

CREATING PLOT FOR MTCARS

ggplot(data = mtcars)+
geom_point(mapping = aes (x = mpg, y= cyl))# giving  different axis to different variables 


mpg - miles per gallon.

cyl - number of cylinders.

Plot no 1

As one can see , more no of cylinders mean fewer miles per gallon

Now we can think it may be low as more cylinders provide more power but now we have to confirm the above hypothesis.

GGPLOT HP VS CYL

ggplot(data = mtcars)+
geom_point(mapping = aes(x = hp, y = cyl))



Plot no 2

hp - horsepower.


As we can see increased cylinders show increased horsepower thereby confirming our hypothesis that mileage is lower at the expense of increased horsepower.

To consider Car Names in a row as a variable in our consideration we use the following code to bring it into columns so we can represent them in plots as we did for hp, mpg.

CREATING COLUMN CARNAMES

d<- mtcars                      # storing dataset  in a variable 
mtacrs <- cbind(rownames(d), d) # column bind row names and dataset
 rownames(d) <- NULL           # valuation of row names as null
colnames(d)[1]= "CarNames"    # giving column no  1 and naming the       variable created 


A slight variation in plots code can provide us with more beautiful aesthetics as we can easily include color into the plots to make the visualization look more relevant.


COLOURED GGPLOT

ggplot(data = mtcars)+
geom_point(mapping = aes(x = mpg , y = hp,color = d$CarNames))



Plot no 3

Here we have mapped an aesthetic color to the name of a variable that is Car Names, similarly, we can map other aesthetics to the plots.

It sometimes becomes difficult to map such aesthetics as in above Car Names variable has a lot of values and it does not scale well so we will consider another dataset mpg.

mpg comprises of observations collected by the US Environment Protection Agency on 38 models of a car having variables such as displ ( a car engine size in litres), hwy ( a car’s fuel efficiency on a highway).

GGPLOT WITH DIIFERENT COLOUR

ggplot(data = mpg) + 
geom_point(mapping = aes(x = displ, y = hwy), color = "orange")



Plot no 4


We can choose different colors on our own interest to make it look more suitable.

Note- + sign must at the end of code lines not at the beginning of the next line.


2.Line chart

A line chart or line graph is a type of chart which displays information as a series of data points called 'markers' connected by straight line segments.

LINE CHART DISPL VS HWY

ggplot(data = mpg) + 
geom_smooth(mapping = aes(x = displ, y = hwy))




Plot no 5


This is line plot for the same scatter plot above that is for plot no 4.

we can even combine the both as given below.

COMBINED PLOT FOR LINES AND POINTS

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
 geom_point(mapping = aes(color = class)) + 
 geom_smooth()



Plot no 6


Apart from combining both line and point geoms we have represented the different class of vehicles with different colors as we have done above but this time the vehicles being less, it looks more promising and easily distinguishable.



3. Bar charts 


They are mostly to represent frequencies of different categories.Often we need to compare the results of different surveys, or of different conditions within the same overall survey.Bar charts are often excellent for illustrating differences between two distributions.

Consider another dataset diamonds to plot bar geoms.


BAR CHART USING DIAMONDS DATASET

 ggplot(data = diamonds) + 
 geom_bar(mapping = aes(x = cut))


Plot no 7

cut represents the overall cutting done on the diamond.

count represents the actual number.


COLOURED BAR CHART

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = cut))


we can color bar chart using fill here as it completely fills the entire bars with colors.

Plot no 8


GGPLOT USING FILL

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = color), position = "fill")



Plot no 9


Here color in chart represents diamond colors, from J (worst) to D (best).

Also position = “fill” stacks bars on the same height.

To make comparison further better we can use position = “dodge “ that places overlapping bars adjacent to each other to make comparison easier.

GGPLOT USING DODGE

ggplot(data = diamonds) + 
    geom_bar(mapping = aes(x = cut, fill = color),position = "dodge")


Plot no 10



4.Boxplot

They are mainly used to represent outliers and it is their ability to represent the difference between distributions and showing outliers for different categories of a variable.

IQR = Upper quartile - Lower quartile

Some critical points about boxplot are as under

  1. A distribution with a positive skew would have a longer whisker in the positive direction than in the negative direction.
  2. A larger mean than median would also indicate a positive skew.

BOXPLOT

ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot()

Plot no 11


A slight variation can result in from switching of the x and y-axes.

coord_flip() - switches the x and y-axes. This is useful, if you want horizontal boxplots. It’s also useful for long labels as it’s hard sometimes to get them to fit without overlapping on the x-axis.

GGPLOT WITH COORD_FLIP

ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot()


Plot no 12