The whole idea is the conditional probability with strong (naive) independence assumptions between the features.
dataset.head()
X
array([[ 19, 19000],
[ 35, 20000],
[ 26, 43000],
[ 27, 57000],
[ 19, 76000],
[ 27, 58000],
[ 27, 84000],...])
y
array([0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, ....])
dataset.head() shows the at max 5 entries of data imported from the file, to the data frame
name dataset, X, y is the arrays which have features and prediction value respectively.
Splitting the dataset into the Training set and Test set
Always recommended to split the data and test the accuracy of the data.You can play around with test_size, and random_state, they define the size and random selection of data point from the data respectively.
from sklearn.cross_validation import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
Feature Scaling
Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
Fitting Naive Bayes to the Training set
from
sklearn
.naive_bayes import GaussianNB # import the model
classifier = GaussianNB() # define classifier object
classifier.fit(X_train, y_train) # fitting the model
Prediction and evaluation of the Test set results
y_pred = classifier.predict(X_test)
classifier.score(X_test,y_test)
0.90000000000000002
from sklearn.metrics import confusion_matrix,classification_report
print(confusion_matrix(y_test, y_pred))
[[65 3]
[ 7 25]]
the confusion matrix is the table which shows the model performance of a classification model (or "classifier") on a set of test data for which the true values are known.
print(classification_report(y_test,y_pred))
precision recall f1-score support
0 0.90 0.96 0.93 68
1 0.89 0.78 0.83 32
avg / total 0.90 0.90 0.90 100
precision, recall and f1 score are the terms to check the performance of the model on the
data these are calculated from using the values in confusion matrix.
Test set prediction looks like above where red/green are the two classes identified and red in green or green in red are the misclassified predictions (error).