나이브 베이즈(Naive Bayes) 분류 모델

: feature가 주어졌을 때 label의 확률을 구함. P(L|Feature)

 

P(A|B) = P(B|A)P(A)/P(B)

P(A|B) : 사건B가 발생한 상태에서 사건A가 발생할 조건부 확률

P(label|feature)

 

 * bayes1.py

from sklearn.naive_bayes import GaussianNB
import numpy as np
from sklearn import metrics

x = np.array([1,2,3,4,5])
x = x[:, np.newaxis] # np.newaxis 차원 확대
print(x)
'''
[[1]
 [2]
 [3]
 [4]
 [5]]
'''
y = np.array([1,3,5,7,9])
print(y)

model = GaussianNB().fit(x, y)
pred = model.predict(x)
print(pred) # [1 3 5 7 9]
print('acc :', metrics.accuracy_score(y, pred)) # acc : 1.0

from sklearn.naive_bayes import GaussianNB

GaussianNB()

# new data
new_x = np.array([[0.5],[2.3], [12], [0.1]])
new_pred = model.predict(new_x)
print(new_pred) # [1 3 9 1]

 

 - One-hot encoding : 데이터를 0과 1로 변환(2진수)

: feature 데이터를 One-hot encoding

: 모델의 성능향상

x = '1,2,3,4,5'
x = x.split(',')
x = np.eye(len(x))
print(x)
'''
[[1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 1.]]
'''
y = np.array([1,3,5,7,9])

model = GaussianNB().fit(x, y)
pred = model.predict(x)
print(pred) # [1 3 5 7 9]
print('acc :', metrics.accuracy_score(y, pred)) # acc : 1.0
from sklearn.preprocessing import OneHotEncoder
x = '1,2,3,4,5'
x = x.split(',')
x = np.array(x)
x = x[:, np.newaxis]
'''
[['1']
 ['2']
 ['3']
 ['4']
 ['5']]
'''

one_hot = OneHotEncoder(categories = 'auto')
x = one_hot.fit_transform(x).toarray()
print(x)
'''
[[1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 1.]]
'''
y = np.array([1,3,5,7,9])

model = GaussianNB().fit(x, y)
pred = model.predict(x)
print(pred) # [1 3 5 7 9]
print('acc :', metrics.accuracy_score(y, pred)) # acc : 1.0

 

 * bayes3_text.py

# 나이브베이즈 분류모델로 텍스트 분류
from sklearn.datasets import fetch_20newsgroups

data = fetch_20newsgroups()
print(data.target_names)

categories = ['talk.religion.misc', 'soc.religion.christian',
              'sci.space', 'comp.graphics']

train = fetch_20newsgroups(subset='train', categories=categories)
test = fetch_20newsgroups(subset='test', categories=categories)
print(train.data[5])  # 데이터 중 대표항목

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

# 각 문자열의 콘텐츠를 숫자벡터로 전환
model = make_pipeline(TfidfVectorizer(), MultinomialNB())  # 작업을 연속적으로 진행
model.fit(train.data, train.target)
labels = model.predict(test.data)

from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

mat = confusion_matrix(test.target, labels)  # 오차행렬 보기
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
            xticklabels=train.target_names, yticklabels=train.target_names)
plt.xlabel('true label')
plt.ylabel('predicted label')
plt.show()

# 하나의 문자열에 대해 예측한 범주 변환용 유틸 함수 작성
def predict_category(s, train=train, model=model):
    pred = model.predict([s])
    return train.target_names[pred[0]]

print(predict_category('sending a payload to the ISS'))
print(predict_category('discussing islam vs atheism'))
print(predict_category('determining the screen resolution'))

# 참고 도서 : 파이썬 데이터사이언스 핸드북 ( 출판사 : 위키북스)

 

 

'BACK END > Deep Learning' 카테고리의 다른 글

[딥러닝] RandomForest  (0) 2021.03.17
[딥러닝] Decision Tree  (0) 2021.03.17
[딥러닝] PCA  (0) 2021.03.16
[딥러닝] SVM  (0) 2021.03.16
[딥러닝] 로지스틱 회귀  (0) 2021.03.15

+ Recent posts