Importing the libraries
import
numpy
as np
import pandas as
pd
Importing the dataset
dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values
Let's see how our data looks in the data frame after importing, and in matrix form X,y.
dataset.head()
X
array([[ 19, 19000],
[ 35, 20000],
[ 26, 43000],
[ 27, 57000],
[ 19, 76000],
[ 27, 58000],
[ 27, 84000],...])
y
array([0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, ....])
dataset.head() shows the at max 5 entries of data imported from the file, to the data frame
name dataset, X, y is the arrays which have features and prediction value respectively.
Splitting the dataset into the Training set and Test set
Always recommended to split the data and test the accuracy of the data.You can play around with test_size, and random_state, they define the size and random selection of data point from the data respectively.
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
Feature Scaling
in Data, all column may not have the similar size of scale, like the number of rooms to the size of the room it is always good practice to have a feature scaling on the data before feeding to the training model.
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
Fitting SVM to the Training set
Defining the SVM object and fitting it with our training data.
from sklearn.svm import SVC
classifier = SVC(kernel = 'linear', random_state = 0)
classifier.fit(X_train, y_train)
other kernel are
'linear', 'poly', 'rbf', 'sigmoid', 'precomputed' that we can use in SVM model.
Init signature: SVC(C=1.0, kernel='rbf', degree=3, gamma='auto', coef0=0.0, shrinking=True,probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, decision_function_shape='ovr', random_state=None)
Predicting the Test set results
y_pred = classifier.predict(X_test)
classifier.score(X_test,y_test)
0.90000000000000002
from sklearn.metrics import confusion_matrix,classification_report
print(confusion_matrix(y_test, y_pred))
[[66 2]
[ 8 24]]
The confusion matrix is the table which shows the model performance of a classification model (or "classifier") on a set of test data for which the true values are known.
print(classification_report(y_test,y_pred))
precision recall f1-score support
0 0.89 0.97 0.93 68
1 0.92 0.75 0.83 32
avg / total 0.90 0.90 0.90 100
Test set prediction looks like below where red/green are the two classes identified and red in green or green in red are the misclassified predictions(error).