Computer Science‎ > ‎

Python for Data: (14) Support Vector Machines (SVM) Using SkLearn


Support Vector Machine also called Large Margin Intuition as it separates the different classes with the margin which is as far as from classes. 
this is how svm converts the input space to the feature space and identifies the support vector and make a margin to separate the classes.

Importing the libraries

import numpy as np
import pandas as pd

Importing the dataset

dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values

Let's see how our data looks in the data frame after importing, and in matrix form X,y.
array([[    19,  19000],
       [    35,  20000],
       [    26,  43000],
       [    27,  57000],
       [    19,  76000],
       [    27,  58000],
       [    27,  84000],...])
array([0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, ....])

dataset.head() shows the at max 5 entries of data imported from the file, to the data frame
name dataset, X, y is the arrays which have features and prediction value respectively.

Splitting the dataset into the Training set and Test set

    Always recommended to split the data and test the accuracy of the data.You can play around with test_size, and random_state, they define the size and random selection of data point from the data respectively.

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

Feature Scaling

in Data, all column may not have the similar size of scale, like the number of rooms to the size of the room it is always good practice to have a feature scaling on the data before feeding to the training model. 

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

Fitting SVM to the Training set

Defining the SVM object and fitting it with our training data.

from sklearn.svm import SVC
classifier = SVC(kernel = 'linear', random_state = 0), y_train)
other kernel are 'linear', 'poly', 'rbf', 'sigmoid', 'precomputed' that we can use in SVM model.
Init signature: 
SVC(C=1.0, kernel='rbf', degree=3, gamma='auto', coef0=0.0, shrinking=True,probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, decision_function_shape='ovr', random_state=None)

Predicting the Test set results

y_pred = classifier.predict(X_test)
from sklearn.metrics import confusion_matrix,classification_report
print(confusion_matrix(y_test, y_pred))
    [[66 2]
     [ 8 24]]

The confusion matrix is the table which shows the model performance of a classification model (or "classifier") on a set of test data for which the true values are known.

             precision    recall  f1-score   support

          0       0.89      0.97      0.93        68
          1       0.92      0.75      0.83        32

avg / total       0.90      0.90      0.90       100

Test set prediction looks like below where red/green are the two classes identified and red in green or green in red are the misclassified predictions(error).