Computer Science‎ > ‎

Python for Data: (7) Linear Classification with Stochastic Gradient Descent

Dear Machine Learners, 
It's too much with regression. After regression classification is the most used algorithm in the world of data analytics/science. Here we are with linear classification with SGD (stochastic gradient descent). SGD here is to optimize our betas (model parameter). This time we are using a data-set called 'bank.csv'. Find it here.

Data-set description : 
The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.

Attribute Information:
Input variables:
# bank client data:
- age (numeric)
- job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
- marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
- education (categorical: 'basic.4y','basic.6y','basic.9y','','illiterate','professional.course','','unknown')
- default: has credit in default? (categorical: 'no','yes','unknown')
- housing: has housing loan? (categorical: 'no','yes','unknown')
- loan: has personal loan? (categorical: 'no','yes','unknown')
# related with the last contact of the current campaign:
- contact: contact communication type (categorical: 'cellular','telephone') 
- month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
# other attributes:
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14 - previous: number of contacts performed before this campaign and for this client (numeric)
15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')
# social and economic context attributes
16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)
17 - cons.price.idx: consumer price index - monthly indicator (numeric) 
18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric) 
19 - euribor3m: euribor 3 month rate - daily indicator (numeric)
20 - nr.employed: number of employees - quarterly indicator (numeric)

Output variable (desired target):
21 - y - has the client subscribed a term deposit? (binary: 'yes','no')

Our target is to classify a given client whether he has subscribed for Term deposits or not based on prior data and observation. 
Let's get started.... 
Importing dependencies and loading/reading the data-set. 
Here is the top 5 rows from whole data. Here we will refine our data-set, meaning dropping irrelevant columns and changing categorical variable into dummies. 
Here we are dropping above three columns because for our classification these columns are not playing any vital or influential role.
Now, normalizing the data.
 Since the range of values of raw data varies widely, in some machine learning algorithms, objective functions will not work properly without normalization. For example, the majority of classifiers calculate the distance between two points by the Euclidean distance. If one of the features has a broad range of values, the distance will be governed by this particular feature. Therefore, the range of all features should be normalized so that each feature contributes approximately proportionately to the final distance.

Another reason why feature scaling is applied is that gradient descent converges much faster with feature scaling than without it.
Test Data set
Stack arrays in sequence horizontally (column wise).

1. Take a sequence of arrays and stack them horizontally to make a single array. Rebuild arrays divided by hsplit.
2. This function continues to be supported for backward compatibility, 
   but you should prefer np.concatenate or np.stack. The np.stack function was added in NumPy 1.10.
3. Flatten Return a flattened copy of the matrix.
4. All N elements of the matrix are placed into a single row   

Defining Functions for random shuffle in the data and calculating logloss function.
1. Random Shuffle: This is needed for each epoch of the algorithm. One epoch is definite as the time/iterations taken by the model to see the whole data once. Since in SGD we see each record one by one, so 1 epoch will be when the algorithm finish seeing the data from first record until last record.
So that's is why a random shuffle is required before the start of next epoch so that the algorithm don't see the data again in the same manner. Randomness helps in more generalization of the solution.

2. In logloss function an offset "epsilon = 1e-15" is used to offset the value of logit function (exp(x)/(1+exp(x))) slightly so that the logit function is not exactly 0 or 1. Becasue at 0 or 1 the logloss function will become infinity and hence algorithm will not converge

Here we need to decide which step length to take for SGD. Many are out there. Such as - 
  1. Bold-driver
  2. ada-grad
  3. adam
  4. adamax
  5. adadelta
  6. Nadam
  7. rmsprop
We are gonna work with bold-driver in this blog. later we will see ada-grad in next blog. BoldDriver Step-length for SGD: This algorithm increases the step-length mu by  (mu x mu_plus) if the new log-loss is better than the old log-loss otherwise it decreases the step-length mu by  (mu x mu_minus). Step-size is adapted  only once after each epoch,not for every (inner) iteration

Logestic Regression with SGD: 
Log-loss function is minimized to obtained the values of parameters. Here the log-loss has been defined in such a manner that we need to used gradient descent for optimizing the solution.

This is the python function to perform SGD with logistic regression. Now we are set to use this algorithm up on our data-set. Let's see at every epochs what is our loss. 

|f(xi−1) − f(xi)| Graph : The graph of |f(xi−1) − f(xi)| Vs iteration shows the convergence of the log-loss function. For a good algorithm the difference in log-loss function should decrease with iteration and as we move towards the optimal solution there is very minimal decrease in log-loss function.
To see the convergence graph plots are always a great option. Let's see the results.

As we can see the loss became almost minimum (zero) after 20 iterations/epochs. 
Test Log-loss graph: This should also decrease with iterations. Since the algorithm  doesn't see the Test Data it is important to test the accuracy of the algorithm with Test data. If the algorithm performs badly with the Test Data this means that either there is overftting in the model or there is no generalization in the model.

Conclusion: Here we are getting more bad results. Meaning the model is not performing on test data (i.e. unseen data for algorithm). The loss is high and need to be reduced by using different model. 

What's Next
Ada-grad vs Bold-driver
let's see who will win the race