나이브 베이즈(Naive Bayes) 분류 모델
: feature가 주어졌을 때 label의 확률을 구함. P(L|Feature)
P(A|B) = P(B|A)P(A)/P(B)
P(A|B) : 사건B가 발생한 상태에서 사건A가 발생할 조건부 확률
P(label|feature)
* bayes1.py
from sklearn.naive_bayes import GaussianNB
import numpy as np
from sklearn import metrics
x = np.array([1,2,3,4,5])
x = x[:, np.newaxis] # np.newaxis 차원 확대
print(x)
'''
[[1]
[2]
[3]
[4]
[5]]
'''
y = np.array([1,3,5,7,9])
print(y)
model = GaussianNB().fit(x, y)
pred = model.predict(x)
print(pred) # [1 3 5 7 9]
print('acc :', metrics.accuracy_score(y, pred)) # acc : 1.0
from sklearn.naive_bayes import GaussianNB
GaussianNB()
# new data
new_x = np.array([[0.5],[2.3], [12], [0.1]])
new_pred = model.predict(new_x)
print(new_pred) # [1 3 9 1]
- One-hot encoding : 데이터를 0과 1로 변환(2진수)
: feature 데이터를 One-hot encoding
: 모델의 성능향상
x = '1,2,3,4,5'
x = x.split(',')
x = np.eye(len(x))
print(x)
'''
[[1. 0. 0. 0. 0.]
[0. 1. 0. 0. 0.]
[0. 0. 1. 0. 0.]
[0. 0. 0. 1. 0.]
[0. 0. 0. 0. 1.]]
'''
y = np.array([1,3,5,7,9])
model = GaussianNB().fit(x, y)
pred = model.predict(x)
print(pred) # [1 3 5 7 9]
print('acc :', metrics.accuracy_score(y, pred)) # acc : 1.0
from sklearn.preprocessing import OneHotEncoder
x = '1,2,3,4,5'
x = x.split(',')
x = np.array(x)
x = x[:, np.newaxis]
'''
[['1']
['2']
['3']
['4']
['5']]
'''
one_hot = OneHotEncoder(categories = 'auto')
x = one_hot.fit_transform(x).toarray()
print(x)
'''
[[1. 0. 0. 0. 0.]
[0. 1. 0. 0. 0.]
[0. 0. 1. 0. 0.]
[0. 0. 0. 1. 0.]
[0. 0. 0. 0. 1.]]
'''
y = np.array([1,3,5,7,9])
model = GaussianNB().fit(x, y)
pred = model.predict(x)
print(pred) # [1 3 5 7 9]
print('acc :', metrics.accuracy_score(y, pred)) # acc : 1.0
* bayes3_text.py
# 나이브베이즈 분류모델로 텍스트 분류
from sklearn.datasets import fetch_20newsgroups
data = fetch_20newsgroups()
print(data.target_names)
categories = ['talk.religion.misc', 'soc.religion.christian',
'sci.space', 'comp.graphics']
train = fetch_20newsgroups(subset='train', categories=categories)
test = fetch_20newsgroups(subset='test', categories=categories)
print(train.data[5]) # 데이터 중 대표항목
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
# 각 문자열의 콘텐츠를 숫자벡터로 전환
model = make_pipeline(TfidfVectorizer(), MultinomialNB()) # 작업을 연속적으로 진행
model.fit(train.data, train.target)
labels = model.predict(test.data)
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
mat = confusion_matrix(test.target, labels) # 오차행렬 보기
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
xticklabels=train.target_names, yticklabels=train.target_names)
plt.xlabel('true label')
plt.ylabel('predicted label')
plt.show()
# 하나의 문자열에 대해 예측한 범주 변환용 유틸 함수 작성
def predict_category(s, train=train, model=model):
pred = model.predict([s])
return train.target_names[pred[0]]
print(predict_category('sending a payload to the ISS'))
print(predict_category('discussing islam vs atheism'))
print(predict_category('determining the screen resolution'))
# 참고 도서 : 파이썬 데이터사이언스 핸드북 ( 출판사 : 위키북스)
'BACK END > Deep Learning' 카테고리의 다른 글
[딥러닝] RandomForest (0) | 2021.03.17 |
---|---|
[딥러닝] Decision Tree (0) | 2021.03.17 |
[딥러닝] PCA (0) | 2021.03.16 |
[딥러닝] SVM (0) | 2021.03.16 |
[딥러닝] 로지스틱 회귀 (0) | 2021.03.15 |