Introduction
Majority of the time data is always messy and unorganized and requires preprocessing and ceratin manipulations but we need to have faster and better ways of doing transformations with data as it is the requirement.
Data size is not consistent at all and neither the type of ambiguity related to data and in R there is a special package just to deal with only the data manipulation effectively. we are going to understand the main basics and core functions associated with data manipulation in and the package that comes to our rescue is none other than dplyr
TABLE OF CONTENTS
1.Introduction2. Dplyr package
3.Functions in dplyr
3.1 Filter
3.2 Arrange
3.3 Select
3.4 Mutate
3.5 Summarise
Dplyr package
Dplyr package in R is one of the most crucial packages meant to perform data manipulation as it has different functions under it to achieve the task effectively and efficiently
LOADING DPLYR
install.packages("dplyr")
library(dplyr)
2.FUNCTIONS IN DPLYR
- Filter -Pick observations by their values
- Arrange -Reorder the rows.
- Select - Pick variables by their names.
- Mutate - Create new variables with functions of existing variables.
- Summarise - Collapse many values down to a single summary.
These functions alone are more than sufficient to be very better at understanding and performing manipulations and let's explore each of them one by one.
- Filter - It enables us to pick observations by their values and in simple words, we can filter a dataset easily
we will use the air quality
dataset in
R to apply all these functions.
LOADING AIRQUALITY
airquality[1:5,] # selecting only 5 rows
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
The air quality dataset comprises the following variables which mean the following -
A data frame with 154 observations on 6 variables.
[,1] Ozone numeric Ozone (ppb)
[,2] Solar.R numeric Solar R (lang)
[,3] Wind numeric Wind (mph)
[,4] Temp numeric Temperature (degrees F)
[,5] Month numeric Month (1--12)
[,6] Day numeric Day of month (1--31)
Here we can use the filter function to our advantage and select merely few variables by creating a condition that filters dataset. A general example analog to this is filtering out tea leaves on serving tea and here we can easily change the condition to our suitability as we don't need to run to loop, unlike other programming languages in order to traverse it throughout the dataframe.
FILTER FUNCTION
a = filter(airquality,Month == 7, Day == 2)
# results saved to a variable a
a # results printed
Ozone Solar.R Wind Temp Month Day
1 49 248 9.2 85 7 2
The meaning of arguments inside filter function
- The first argument tells is the data frame
- The subsequent arguments tell us what to do with data frame selecting the variable as it explains the necessary condition to filter.
2. Arrange
It works in a similar fashion to filter but it changes their order instead of selecting rows. In simple words as we generally use the word arrange in our day to day lives, in order to keep things in an established order. It cant put a general condition like that of a filter but it can help us in keeping column and rows in a required order as demonstrated clearly in the following example.
ARRANGE FUNCTION
arrange(airquality, desc(Day))
Ozone Solar.R Wind Temp Month Day
1 NA NA 14.3 56 5 5
2 18 313 11.5 62 5 4
3 12 149 12.6 74 5 3
4 36 118 8.0 72 5 2
5 41 190 7.4 67 5 1
we can use more than one variable name to sort the dataframe but an order of preference in sorting would always be with the first variable mentioned.
3. Select
It allows one to select multiple variables by their name or by multiple variations and neglect the non-useful ones. Select merely provides us with an opportunity to take into consideration the required variables and it proves to be effective in certain context.
Let's consider you are participating in a data science hackathon and after having drawn a correlation plot you know that there are only a few variables that are worth taking into consideration for the further model building but hackathon has large no of variables and they are somewhat spread along the dataset.In that case most probably you will either have to track the column number of required variables and thus takes time and in such cases where one needs to select only a few variables out of numerous variables as one can easily write the few names of variables and get the required dataframe thus saving time which in turn proves to be useful in a hackathon.
SELECT FUNCTION
select(airq,Ozone,Month)
Ozone Month
1 41 5
2 36 5
3 12 5
4 18 5
5 NA 5
select(airq,Ozone:Month)
Ozone Solar.R Wind Temp Month
1 41 190 7.4 67 5
2 36 118 8.0 72 5
3 12 149 12.6 74 5
4 18 313 11.5 62 5
5 NA NA 14.3 56 5
4. Mutate
It is generally used to create a new set of variables and in other new features to be added to the dataframe. It has important use too as it helps to create new variables from existing variables all at once and it works amazingly better with pipes in R that we will cover later. In other words, we can create multiple new variables all at once using mutate function and it adds them to existing dataframe.
MUTATE FUNCTION
mutate(airq,Ozonepercent = Ozone/100 )
Ozone Solar.R Wind Temp Month Day Ozonepercent
1 41 190 7.4 67 5 1 0.41
2 36 118 8.0 72 5 2 0.36
3 12 149 12.6 74 5 3 0.12
4 18 313 11.5 62 5 4 0.18
5 NA NA 14.3 56 5 5 NA
If we intend to keep only the newly created variables then we can use transmute as that establishes the purpose of creating a dataframe of only newly created variables and we don't need to create variables one by one as it anyway takes time and reduces our productivity and purpose to use R efficiently.
TRANSMUTE
transmute(airq,Ozonepercent = Ozone/100 )
Ozonepercent
1 0.41
2 0.36
3 0.12
4 0.18
5 NA
5. Summarise
It collapses a data frame to a single row and makes it easier to preprocess information and we can easily reduce the dimensionality of our information to process it better.
It becomes entirely useful with group_by and as it is indicated as below.
group_by is used to a create an ordered group that enables the dataframe to be ordered on the basis of variables to be grouped by.
SUMMARISE WITH GROUP_BY
summarise(airq,ozonemean = mean(Wind))
ozonemean
1 10.76
summarise(by,ozonemean = mean(Wind))
# A tibble: 139 x 3
# Groups: Day [?]
Day Wind ozonemean
<int> <dbl> <dbl>
1 1 4.1 4.1
2 1 6.9 6.9
3 1 7.4 7.4
4 1 8.6 8.6
5 2 5.1 5.1
6 2 8.0 8.0
7 2 9.2 9.2
8 2 9.7 9.7
9 2 13.8 13.8
10 3 2.8 2.8
# ... with 129 more rows