Computer Science‎ > ‎

R for Data: Data transformation in R using dplyr


Introduction

Majority of the time data is always messy and unorganized and requires preprocessing and ceratin manipulations but we need to have faster and better ways of doing transformations with data as it is the requirement.

Data size is not consistent at all and neither the type of ambiguity related to data and in R there is a special package just to deal with only the data manipulation effectively.  we are going to understand the main basics and core functions associated with data manipulation  in  and the package that comes to our rescue is  none other than dplyr


TABLE OF CONTENTS

1.Introduction
2. Dplyr package
3.Functions in dplyr
  3.1 Filter
   3.2 Arrange
   3.3 Select
   3.4  Mutate
   3.5 Summarise



Dplyr package
Dplyr package  in R is one of the most  crucial packages meant to perform data  manipulation as it has different functions under it to  achieve the task effectively and efficiently

LOADING DPLYR

install.packages("dplyr")
library(dplyr)


 2.FUNCTIONS IN DPLYR

  •  Filter -Pick observations by their values 
  •  Arrange -Reorder the rows.
  •  Select - Pick variables by their names.
  •  Mutate - Create new variables with functions of existing variables.
  •  Summarise - Collapse many values down to a single summary.
These functions alone are more than sufficient to be very better at understanding and performing manipulations and let's explore each of them one by one.



  1. Filter -   It enables us to pick observations by their values and in simple words, we  can filter a dataset easily
we will use the air quality dataset in R to apply all these functions.

LOADING AIRQUALITY

airquality[1:5,] # selecting only 5 rows 

 Ozone Solar.R Wind Temp Month Day
1      41     190  7.4   67     5   1
2      36     118  8.0   72     5   2
3      12     149 12.6   74     5   3
4      18     313 11.5   62     5   4
5      NA      NA 14.3   56     5   5


 The air quality dataset comprises the following variables which mean the following -
A data frame with 154 observations on 6 variables.

[,1] Ozone  numeric  Ozone (ppb)
[,2] Solar.R  numeric  Solar R (lang)
[,3] Wind  numeric  Wind (mph)
[,4] Temp  numeric  Temperature (degrees F)
[,5] Month  numeric  Month (1--12)
[,6] Day  numeric  Day of month (1--31)


 Here we can use the filter function to our advantage and select merely few variables by creating a condition that filters dataset. A general example analog to this is filtering out tea leaves on serving tea and here we can easily change the condition to our suitability as we don't need to run to loop, unlike other programming languages in order to traverse it throughout the dataframe.

FILTER FUNCTION

 a = filter(airquality,Month == 7, Day == 2)
 
# results saved to a variable a 

a  # results printed 
Ozone Solar.R Wind Temp Month Day
1    49     248  9.2   85     7   2

The meaning of arguments inside filter function

  • The first argument tells is the data frame
  • The subsequent arguments tell us what to do with data frame selecting the variable as it explains the necessary condition to filter.



2. Arrange 
 It works in a similar fashion to filter but it changes their order instead of selecting rows. In simple words as we generally use the word arrange in our day to day lives, in order to keep things in an established order. It cant put a general condition like that of a filter but it can help us in keeping column and rows in a  required order as demonstrated clearly in the following example.

ARRANGE FUNCTION



arrange(airquality, desc(Day))

  Ozone Solar.R Wind Temp Month Day
1    NA      NA 14.3   56     5   5
2    18     313 11.5   62     5   4
3    12     149 12.6   74     5   3
4    36     118  8.0   72     5   2
5    41     190  7.4   67     5   1


we can use more than one variable name to sort the dataframe but an order of preference in sorting would always be with the first variable mentioned.

3. Select
 It allows one to select multiple variables by their name or by multiple variations and neglect the non-useful ones. Select merely provides us with an opportunity to take into consideration the required variables and it proves to be effective in certain context.

Let's consider you are participating in a data science hackathon and after having drawn a correlation plot you know that there are only a few variables that are worth taking into consideration for the further model building but hackathon has large no of variables and they are somewhat spread along the dataset.In that case most probably you will either have to track the column number of required variables  and thus takes time and in such cases where one needs to select only a few variables out of numerous variables as one can easily write the few names of variables and get the required dataframe thus saving time which in turn proves to be useful in a hackathon.

SELECT FUNCTION

select(airq,Ozone,Month)
 Ozone Month
1    41     5
2    36     5
3    12     5
4    18     5
5    NA     5


select(airq,Ozone:Month)
  Ozone Solar.R Wind Temp Month
1    41     190  7.4   67     5
2    36     118  8.0   72     5
3    12     149 12.6   74     5
4    18     313 11.5   62     5
5    NA      NA 14.3   56     5





4. Mutate
It is generally used to create a new set of variables and in other new features to be added to the dataframe. It has important use too as it helps to create new variables from existing variables all at once and it works amazingly better with pipes in R that we will cover later. In other words, we can create multiple new variables all at once using mutate function and it adds them to existing dataframe.  

MUTATE FUNCTION

mutate(airq,Ozonepercent = Ozone/100 )
  Ozone Solar.R Wind Temp Month Day Ozonepercent
1    41     190  7.4   67     5   1         0.41
2    36     118  8.0   72     5   2         0.36
3    12     149 12.6   74     5   3         0.12
4    18     313 11.5   62     5   4         0.18
5    NA      NA 14.3   56     5   5           NA
  
If we intend to keep only the newly created variables then we can use transmute as that establishes the purpose of creating a dataframe of only newly created variables and we don't need to create variables one by one as it anyway takes time and reduces our productivity and purpose to use R efficiently.

TRANSMUTE

transmute(airq,Ozonepercent = Ozone/100 )
  Ozonepercent
1         0.41
2         0.36
3         0.12
4         0.18
5           NA



5. Summarise

It collapses a data frame to a single row and makes it easier to preprocess information and we can easily reduce the dimensionality of our information to process it better.
It becomes entirely useful with group_by and as it is indicated as below. 
group_by is used to a create an ordered group that enables the dataframe to be ordered on the basis of variables to be grouped by.


SUMMARISE WITH GROUP_BY

summarise(airq,ozonemean = mean(Wind))
  ozonemean
1     10.76

summarise(by,ozonemean = mean(Wind))
# A tibble: 139 x 3
# Groups:   Day [?]
     Day  Wind ozonemean
   <int> <dbl>     <dbl>
 1     1   4.1       4.1
 2     1   6.9       6.9
 3     1   7.4       7.4
 4     1   8.6       8.6
 5     2   5.1       5.1
 6     2   8.0       8.0
 7     2   9.2       9.2
 8     2   9.7       9.7
 9     2  13.8      13.8
10     3   2.8       2.8
# ... with 129 more rows