1.Introduction Data is huge and it is everywhere but along with that comes the need to understand data and base our decisions after drawing inferences from data. One of the major steps that we have in the field in data science is first exploring the data thereby presenting it in the form of informative plots and it is referred to as Data Visualization. We are very visual creatures as a large portion of brain dedicates itself to visual processing. Images are able to grab our attention easily, we are immediately drawn to them. John Tukey was an American mathematician best known for the development of the FFT algorithm and boxplot.Here are some of his notable quotes on the importance of visualization through graphs. “The simple graph has brought more information to the data analyst’s mind than any other device.” — John Tukey “The greatest value of a picture is when it forces us to notice what we never expected to see.” — John Tukey. Graphs provide us with ability to map connections or interrelations among two or more things by a number of distinctive dots, lines, bars, etc 2. Loading ggplot and the mtcars dataset First step- Installing the package and loading its library. Note- One need to install package only once but load library every time for a new session Second step- Loading the mtcars data frame. To clearly understand the basics we are going to take a simple and small datasetmtcars present in R library itself comprises
3.Types of plots In this article, we are going to know how to use one of the most crucial package ggplot2 in R for generating plots of data and make data exploration a fun process.
1.Scatterplots They are used to represent distribution between different variables in form of points scattered all over. They are are most basic and simple to make and every data point gets a chance to be represented however it becomes somewhat less distinct to see the image unlike the case with a line chart. A geom is a geometrical object that a plot uses to represent data and different plots use different geoms like bar chart uses bar geoms , line chart uses line geoms, boxplot uses boxplot geoms. mappings are used to map different aesthetics (written in the brackets under aes) assigning different axis x and y to variables and other aesthetics like color, fill also associated with other variables to be represented in the plot.
mpg - miles per gallon. cyl - number of cylinders. Plot no 1 As one can see , more no of cylinders mean fewer miles per gallon Now we can think it may be low as more cylinders provide more power but now we have to confirm the above hypothesis. Plot no 2 hp - horsepower. As we can see increased cylinders show increased horsepower thereby confirming our hypothesis that mileage is lower at the expense of increased horsepower. To consider Car Names in a row as a variable in our consideration we use the following code to bring it into columns so we can represent them in plots as we did for hp, mpg. A slight variation in plots code can provide us with more beautiful aesthetics as we can easily include color into the plots to make the visualization look more relevant. Plot no 3 Here we have mapped an aesthetic color to the name of a variable that is Car Names, similarly, we can map other aesthetics to the plots. It sometimes becomes difficult to map such aesthetics as in above Car Names variable has a lot of values and it does not scale well so we will consider another dataset
Plot no 4 We can choose different colors on our own interest to make it look more suitable. Note- + sign must at the end of code lines not at the beginning of the next line. 2.Line chart A line chart or line graph is a type of chart which displays information as a series of data points called 'markers' connected by straight line segments. Plot no 5 This is line plot for the same scatter plot above that is for plot no 4. we can even combine the both as given below. Plot no 6 Apart from combining both line and point geoms we have represented the different class of vehicles with different colors as we have done above but this time the vehicles being less, it looks more promising and easily distinguishable. 3. Bar charts They are mostly to represent frequencies of different categories.Often we need to compare the results of different surveys, or of different conditions within the same overall survey.Bar charts are often excellent for illustrating differences between two distributions. Consider another dataset Plot no 7 cut represents the overall cutting done on the diamond. count represents the actual number. we can color bar chart using fill here as it completely fills the entire bars with colors. Plot no 8 Plot no 9 Here color in chart represents diamond colors, from J (worst) to D (best). Also position = “fill” stacks bars on the same height. To make comparison further better we can use position = “dodge “ that places overlapping bars adjacent to each other to make comparison easier. Plot no 10 4.Boxplot They are mainly used to represent outliers and it is their ability to represent the difference between distributions and showing outliers for different categories of a variable. IQR = Upper quartile - Lower quartile Some critical points about boxplot are as under
Plot no 11 A slight variation can result in from switching of the x and y-axes.
Plot no 12 Some of our other tutorials related to data analysis with R R for Data |
Computer Science >