Computer Science‎ > ‎

Python for Data: (13) Naive Bayes Classifier using SkLearn

Introduction

The whole idea is the conditional probability with strong (naive) independence assumptions between the features.

Naive Bayes classifiers have worked quite well in many real-world situations, famously document classification and spam filtering.

import numpy as np
import pandas as pd

Importing the dataset

dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values
separating the features and prediction values.
dataset.head()
X
array([[    19,  19000],
       [    35,  20000],
       [    26,  43000],
       [    27,  57000],
       [    19,  76000],
       [    27,  58000],
       [    27,  84000],...])
y
array([0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, ....])

dataset.head() shows the at max 5 entries of data imported from the file, to the data frame
name dataset, X, y is the arrays which have features and prediction value respectively.

Splitting the dataset into the Training set and Test set

Always recommended to split the data and test the accuracy of the data.You can play around with test_size, and random_state, they define the size and random selection of data point from the data respectively.

from sklearn.cross_validation import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

Feature Scaling

Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

Fitting Naive Bayes to the Training set

from sklearn.naive_bayes import GaussianNB  # import the model
classifier = GaussianNB() # define classifier object
classifier.fit(X_train, y_train) # fitting the model

Prediction and evaluation of the Test set results

y_pred = classifier.predict(X_test)
classifier.score(X_test,y_test)
    0.90000000000000002
from sklearn.metrics import confusion_matrix,classification_report
print(confusion_matrix(y_test, y_pred))
    [[65 3]
     [ 7 25]]

the confusion matrix is the table which shows the model performance of a classification model (or "classifier") on a set of test data for which the true values are known.
print(classification_report(y_test,y_pred))
             precision    recall  f1-score   support

          0       0.90      0.96      0.93        68
          1       0.89      0.78      0.83        32

avg / total       0.90      0.90      0.90       100

precision, recall and f1 score are the terms to check the performance of the model on the
data these are calculated from using the values in confusion matrix.


Test set prediction looks like above where red/green are the two classes identified and red in green or green in red are the misclassified predictions (error).