Computer Science‎ > ‎

Python for Data: (5) Data Analysis with Pandas (Basic)

Why pandas ? 

There are many data analysis tools available, so why pandas ? 
If you've spent time in a spreadsheet software like Microsoft Excel, Apple Numbers, or Google Sheets and are eager to take your data analysis skills to the next level, this tutorial is for you! and you are at the right place. 

Pandas is a powerful tool that allows you to do anything and everything with colossal data sets. These are the major task that we performs, analyzing, organizing, sorting, filtering, pivoting, aggregating, munging, cleaning, calculating, and more!  We can say it's "Excel on steroids"!

Here one can find data set (also available in the reference) on which we are gonna analyse some important statistics.  Here we are gonna use 3 python libraries. pandas for data analysis, numpy for calculation and mathematics and lastly matplotlib for visualization of graphs. 
Now let's read the data 
Here we can see the shape of our data and head (top 5 rows) with all the columns. 
You can see that we have 128 rows and 8 columns here. 
we are going to perform following task with this 'house.csv' data set.

1. Load the data using pandas

2. Summarize each field in the data, i.e. mean, average etc.

3. Group data by the field [nbhd].

(a) Give average sqft, average price and average bedroom of each group.

(b) Plot for each field ([sqft], [bedroom], [price] etc). Use a boxplot that visualizes the statistical information about them.

(c) For each group of [nbhd], draw a prediction line for [price] vs [sqft]

Here we are gonna see the data description i.e. Mean, max, standard deviation etc. of each column. 'Transpose' here is working similar to matrix transpose where it rotate our columns. see below - 
Now let's summarize our data using pandas 'groupby'
Here is the output of this line of our code. 
we want to plot the dependency of prices on the size of the house and number of bathrooms ans so on. Therefore, we used 'groupby' here. Boxplots of (sqft, bedroom, price) with respect to "nbhd" groupby field.
Prediction line plot "price vs sqft" for "nbhd"= nbhd01
Here one can see the plot and nice visualization by 'matplotlib' library
here you see, how price is increasing with respect to square fit/size of the house for group 01 i.e. [nbhd01]. Now let's see for second group. 
here is the plot. 
Similarly for last group [nbhd03
Here we are done with basic pandas operations. We have seen how we can visualize data and it's analysis. We saw how price is dependent on size of the house and other factor among 3 groups of house data. 


What's Next 

Linear Regression via Normal Equations using same 'house.csv' data. 
where we will learn below matrix decomposition methods to perform Linear regression. 
(a) Gaussian elimination
(b) Cholesky decomposition
(c) QR decomposition