Computer Science‎ > ‎

Python for Data: (6) Data pre-processing & Linear Regression with Gradient Descent


Hello Machine Learners & practitioners, 

In this blog we are gonna learn how to optimize our parameters to get best prediction for linear regression. But, before that we will be working with a data-set  where we need to refine it. It will be helpful for all of you to know how to refine/pre-process data for such operations. 

Here is the dependencies that we will be using through-out this blog. Here pandas will be useful to load your data. since we are working with text file , we need pandas to load it easily. Here you can download this data set. Let's  get started ....!!! 

Data Pre-processing 

In this part we will refine our data. this part includes following tasks - 

1. convert any non-numeric values to numeric values. For example you can replace a country name
with an integer value or more appropriately use hot-one encoding. [Hint: use hashmap (dict) or
pandas:get_dummies]. 

2. If required drop out the rows with missing values or NA.

3. Split the data into a train(80%) and test(20%) .




As one can see we have loaded a text file with the help of pandas.read_fwf. colspecs - A list of pairs (tuples) giving the extents of the fixed-width fields of each line as half-open intervals (i.e., [from, to[ ). String value ‘infer’ can be used to instruct the parser to try detecting the column specifications from the first 100 rows of the data which are not being skipped via skiprows. In simple words colspecs defined your columns with the help of indices. See those numbers in the second line of the above code. 
Another way is to split your columns by (,) comma. This only works when your in your data columns are separated by comma (see below) otherwise use the above way.  

Both the ways are correct and works fine. 

 This is how our data look-like. Here we need to convert categorical data into dummies. Meaning, see column [city1], [city2] and so on. these strings need to be integer to process it numerically. 

Here we are gonna convert categorical data into indicator/dummy variable. And we drop the both  'city' columns  by using "iloc" since they are not relevant for this task. 
Here we are finished with pre-processing part. We have successfully modified this text file called 'airq402' ap per our requirement to perform regression with gradient descent. Let's jump to next task now which is to optimize the loss and choose best parameter to reach to best prediction. 
Here is the gradient descent python function which accept 5 inputs. Such as - X, y, theta, number of iteration. Alpha is learning parameter means, how fast you wanna travel all across your function to reach to minimum loss. Remember, big alpha values can leads you to divergence and small alpha values can take long time and many iterations to reach to optimum point in the function. Therefore, one needs to take alpha wisely. 

It turns out that to make the best line to model the data, we want to pick parameters  β  (i.e. theta in our code) that allows our predicted value to be as close to the actual value as possible. In other words, we want the distance or residual between our hypothesis m(x) and y to be minimized. So we formally define a cost function using ordinary least squares that is simply the sum of the squared distances. To find the liner regression line, we minimize. 
Here you can see that we have entered our parameter as alpha = 0.00000001 and theta is 1, then we are running our algorithm 100 times. means number of iterations were set to 100. The cost is being reduced iteration after iteration. you can run the same part of code by changing the alpha values  and see the difference in convergence time and rate. 

Here we changed the value of our learning rate (alpha). Let's see the output - 

That was gradient descent minimizing your cost function to fir the model effectively and efficiently. Here we done with this blog. 

What's next 

Regression with Tensorflow
Linear Classification with Stochastic Gradient Descend