Computer Science‎ > ‎

### Python for Data: (13) Naive Bayes Classifier using SkLearn

#### Introduction

The whole idea is the conditional probability with strong (naive) independence assumptions between the features.

#### Naive Bayes classifiers have worked quite well in many real-world situations, famously document classification and spam filtering.

`import ``numpy`` as np`
`import pandas as ``pd`

#### Importing the dataset

`dataset = pd.read_csv('Social_Network_Ads.csv')`
`X = dataset.iloc[:, [2, 3]].values`
`y = dataset.iloc[:, 4].values`
separating the features and prediction values.
`dataset.head()`
`X`
```array([[    19,  19000],
[    35,  20000],
[    26,  43000],
[    27,  57000],
[    19,  76000],
[    27,  58000],
[    27,  84000],...])```
`y`
`array([0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, ....])dataset.head() shows the at max 5 entries of data imported from the file, to the data frame name dataset, X, y is the arrays which have features and prediction value respectively.`

#### Splitting the dataset into the Training set and Test set

Always recommended to split the data and test the accuracy of the data.You can play around with test_size, and random_state, they define the size and random selection of data point from the data respectively.

`from sklearn.cross_validation import train_test_split`
`X_train,X_test,y_train,y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)`

### Feature Scaling

Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing

`from sklearn.preprocessing import StandardScaler`
`sc = StandardScaler()`
`X_train = sc.fit_transform(X_train)`
`X_test = sc.transform(X_test)`

### Fitting Naive Bayes to the Training set

`from ``sklearn``.naive_bayes import GaussianNB  # import the model`
`classifier = GaussianNB() # define classifier object`
`classifier.fit(X_train, y_train) # fitting the model`

### Prediction and evaluation of the Test set results

`y_pred = classifier.predict(X_test)`
`classifier.score(X_test,y_test)`
`    0.90000000000000002`
`from sklearn.metrics import confusion_matrix,classification_report`
`print(confusion_matrix(y_test, y_pred))`
[[65 3]
[ 7 25]]

the confusion matrix is the table which shows the model performance of a classification model (or "classifier") on a set of test data for which the true values are known.
`print(classification_report(y_test,y_pred))`
```             precision    recall  f1-score   support

0       0.90      0.96      0.93        68
1       0.89      0.78      0.83        32

avg / total       0.90      0.90      0.90       100precision, recall and f1 score are the terms to check the performance of the model on the data these are calculated from using the values in confusion matrix.```

Test set prediction looks like above where red/green are the two classes identified and red in green or green in red are the misclassified predictions (error).