'BACK END/Deep Learning' 카테고리의 글 목록 (2 Page)

BACK END/Deep Learning

PREV 1 2 3 NEXT

[딥러닝] Neural Network

2021. 3. 19. 12:02

인공 신경망

- 이론

brunch.co.kr/@gdhan/6

인공신경망 개념(Neural Network)

[인공지능 이야기] 생물학적 신경망, 인공신경망, 퍼셉트론, MLP | 인공신경망은 두뇌의 신경세포, 즉 뉴런이 연결된 형태를 모방한 모델이다. 인공신경망(ANN, Artificial Neural Network)은 간략히 신경

brunch.co.kr

x1 -> w1(가중치) -> [뉴런]

x2 -> w2(가중치) -> w1*x1 + w2*x2 + ... -> output : y

...

↖예측값과 실제값 비교 feedback하여 가중치 조절↙

cost(손실) 값과 weight(가중치)값을 비교하여 cost 값이 최소가 되는 지점의 weight 산출.

편미분으로 산출하여 기울기가 0인 지점 산출.

learning rate (학습률) : feedback하여 값을 산출할 경우 다음 feedback 간 간격 비율.

epoch(학습 수) : feedback 수

=> 다중 선형회귀

=> y1 = w*x + b (추세선)

=> 로지스틱 회귀

=> y2 = 1 / (1 + e^(y1) )

=> MLP

단층 신경망(뉴런, Node)

: 입력자료에 각각의 가중치를 곱해 더한 값을 대상으로 임계값(활성화 함수)을 기준하여 이항 분류가 가능. 예측도 가능

단층 신경망으로 논리회로 분류

* neural1.py

def or_func(x1, x2):
    w1, w2, theta = 0.5, 0.5, 0.3
    sigma = w1 * x1 + w2 * x2 + 0
    if sigma <= theta:
        return 0
    elif sigma > theta:
        return 1

print(or_func(0, 0)) # 0
print(or_func(1, 0)) # 1
print(or_func(0, 1)) # 1
print(or_func(1, 1)) # 1
print()

def and_func(x1, x2):
    w1, w2, theta = 0.5, 0.5, 0.7
    sigma = w1 * x1 + w2 * x2 + 0
    if sigma <= theta:
        return 0
    elif sigma > theta:
        return 1
    
print(and_func(0, 0)) # 0
print(and_func(1, 0)) # 0
print(and_func(0, 1)) # 0
print(and_func(1, 1)) # 1
print()

def xor_func(x1, x2):
    w1, w2, theta = 0.5, 0.5, 0.5
    sigma = w1 * x1 + w2 * x2 + 0
    if sigma <= theta:
        return 0
    elif sigma > theta:
        return 1
    
print(xor_func(0, 0)) # 0
print(xor_func(1, 0)) # 1
print(xor_func(0, 1)) # 1
print(xor_func(1, 1)) # 1
print()
# 만족하지 못함

import numpy as np
from sklearn.linear_model import Perceptron

feature = np.array([[0,0], [0,1], [1,0], [1,1]])
#print(feature)
#label = np.array([0, 0, 0, 1]) # and
#label = np.array([0, 1, 1, 1]) # or
label = np.array([1, 1, 1, 0]) # nand
#label = np.array([0, 1, 1, 0]) # xor

ml = Perceptron(max_iter = 100).fit(feature, label) # max_iter: 학습 수
print(ml.predict(feature))
# [0 0 0 1] and
# [0 1 1 1] or
# [1 0 0 0] nand => 만족하지못함
# [0 0 0 0] xor => 만족하지못함

from sklearn.linear_model import Perceptron

Perceptron(max_iter = ).fit(x, y) : 단순인공 신경망. max_iter - 학습 수

- Perceptron api

scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html

sklearn.linear_model.Perceptron — scikit-learn 0.24.1 documentation

scikit-learn.org

MLP

: 다층 신경망 논리 회로 분류

x1	x2	nand	or	xor
0	0	1	0	0
0	1	1	1	1
1	0	1	1	1
1	1	0	1	0

x1 --> nand -> xor -> y

x2 or

* neural4_mlp1.py

import numpy as np
from sklearn.neural_network import MLPClassifier

feature = np.array([[0,0], [0,1], [1,0], [1,1]])
#label = np.array([0, 0, 0, 1]) # and
label = np.array([0, 1, 1, 1]) # or
#label = np.array([1, 1, 1, 0]) # nand
#label = np.array([0, 1, 1, 0]) # xor

#ml = MLPClassifier(hidden_layer_sizes=30).fit(feature, label) # hidden_layer_sizes - node 수
#ml = MLPClassifier(hidden_layer_sizes=30, max_iter=400, verbose=1, learning_rate_init=0.1).fit(feature, label)
ml = MLPClassifier(hidden_layer_sizes=(10, 10, 10), max_iter=400, verbose=1, learning_rate_init=0.1).fit(feature, label)
# verbose - 진행가정 확인. # max_iter default 200. max_iter - 학습수. learning_rate_init - 학습 진행률. 클수록 세밀한 분석을 되나 속도는 저하
print(ml)
print(ml.predict(feature))
# [0 0 0 1] and
# [0 1 1 1] or
# [1 1 1 0] nand
# [0 1 1 0] xor => 모두 만족

from sklearn.neural_network import MLPClassifier

MLPClassifier(hidden_layer_sizes=, max_iter=, verbose=, learning_rate_init=).fit(x, y) : 다층 신경망.

hidden_layer_sizes : node 수

verbose : 진행가정 log 추가

max_iter : 학습 수 (default 200)

learning_rate_init : 학습 진행률. (클수록 세밀한 분석을 되나 속도는 저하)

- MLPClassifier api (deep learning)

scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html

sklearn.neural_network.MLPClassifier — scikit-learn 0.24.1 documentation

scikit-learn.org

'BACK END > Deep Learning' 카테고리의 다른 글

[딥러닝] k-means (0)	2021.03.22
[딥러닝] 클러스터링 (0)	2021.03.19
[딥러닝] KNN (0)	2021.03.18
[딥러닝] RandomForest (0)	2021.03.17
[딥러닝] Decision Tree (0)	2021.03.17

[딥러닝] KNN

2021. 3. 18. 15:26

KNN

: K 최근접 이웃 알고리즘

- 이론

onikaze.tistory.com/368

Machine Learning - (2) kNN 모델

이 글을 읽기 전에 반드시 참고하셔야 할 부분이 있음을 알려드립니다. 인터넷 상에 제 글이 검색이 되어 다른 분들도 한 번 혹은 그 이상은 거쳐가는 곳인 것은 사실이지만, 어디까지나 저는 Mac

onikaze.tistory.com

- anaconda prompt

pip install mglearn

=> 모듈 다운로드

* knn1.py

import mglearn     # pip install mglearn
import matplotlib.pyplot as plt
plt.rc('font', family='malgun gothic')

# -------------------------
# Classification
mglearn.plots.plot_knn_classification(n_neighbors=1)
plt.show()

mglearn.plots.plot_knn_classification(n_neighbors=3)
plt.show()

mglearn.plots.plot_knn_classification(n_neighbors=5)
plt.show()

=> 가장 간단한 k-NN 알고리즘은 가장 가까운 훈련 데이터 포인트 하나를 최근접 이웃으로 찾아 예측에 사용합니다.
=> 단순히 이 훈련 데이터 포인트의 출력이 예측이 됩니다.

import mglearn

mglearn.plots.plot_knn_classification(n_neighbors=) : classification knn 알고리즘. n_neighbors - k값.

# Regression
mglearn.plots.plot_knn_regression(n_neighbors=1)
plt.show()

mglearn.plots.plot_knn_regression(n_neighbors=3)
plt.show()

import mglearn

mglearn.plots.plot_knn_regression(n_neighbors=) : regression knn 알고리즘. n_neighbors - k값.

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

X, y = mglearn.datasets.make_forge() # forge dataset load
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=12) # train, test로 나눔
print(X_train, ' ', X_train.shape)  # [[ 8.92229526 -0.63993225] ...   (19, 2)
print(X_test, ' ', X_test.shape)    #  (7, 2)
print(y_train)  # [0 0 1 1 0 1 0 1 1 1 0 1 0 0 0 1 0 1 0]

model = KNeighborsClassifier(n_neighbors=3)
model.fit(X_train, y_train)
print("test 예측: {}".format(model.predict(X_test)))
# test 예측: [0 0 1 0 1 1 1]
print("test 정확도: {:.2f}".format(model.score(X_test, y_test)))
# test 정확도: 0.86
print("train 정확도: {:.2f}".format(model.score(X_train, y_train)))
# train 정확도: 0.95

fig, axes = plt.subplots(1, 3, figsize=(10, 5))

for n_neighbors, ax in zip([1, 3, 9], axes):
    model2 = KNeighborsClassifier(n_neighbors=n_neighbors).fit(X, y)
    mglearn.plots.plot_2d_separator(model2, X, fill=True, eps=0.5, ax=ax, alpha=.4)
    mglearn.discrete_scatter(X[:, 0], X[:, 1], y, ax=ax)
    ax.set_title("{} 이웃".format(n_neighbors))
    ax.set_xlabel("특성 0")
    ax.set_ylabel("특성 1")
    axes[0].legend(loc=1)

plt.show()

import mglearn

mglearn.datasets.make_forge() : forge dataset

from sklearn.neighbors import KNeighborsClassifier

KNeighborsClassifier(n_neighbors=) : knn classification 알고리즘

model.score(x, y) : 정확도

mglearn.plots.plot_2d_separator(model2, X, fill=True, eps=0.5, ax=ax, alpha=.4)

mglearn.discrete_scatter(X[:, 0], X[:, 1], y, ax=ax)

왼쪽 그림을 보면 이웃을 하나 선택했을 때는 결정 경계가 훈련 데이터에 가깝게 따라가고 있습니다.
이웃의 수를 늘릴수록 결정 경계는 더 부드러워집니다. 부드러운 경계는 더 단순한 모델을 의미합니다.
다시 말해 이웃을 적게 사용하면 모델의 복잡도가 높아지고([그림]의 오른쪽) 많이 사용하면 복잡도는 낮아집니다([그림]의 왼쪽).

훈련 데이터 전체 개수를 이웃의 수로 지정하는 극단적인 경우에는 모든 테스트 포인트가 같은 이웃(모든 훈련 데이터)을 가지게 되므로 테스트 포인트에 대한 예측은 모두 같은 값이 됩니다.
즉 훈련 세트에서 가장 많은 데이터 포인트를 가진 클래스가 예측값이 됩니다.
일반적으로 KNeighbors 분류기에 중요한 매개변수는 두 개입니다. 데이터 포인트 사이의 거리를 재는 방법과 이웃의 수입니다.
실제로 이웃의 수는 3개나 5개 정도로 적을 때 잘 작동하지만, 이 매개변수는 잘 조정해야 합니다.
거리 재는 방법은 기본적으로 유클리디안 거리 방식을 사용합니다.

breast_cancer dataset으로 실습

from sklearn.datasets import load_breast_cancer

cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, stratify=cancer.target, random_state=66)

training_accuracy = []
test_accuracy = []
# 1에서 10까지 n_neighbors를 적용
neighbors_settings = range(1, 11)

for n_neighbors in neighbors_settings:
    clf = KNeighborsClassifier(n_neighbors=n_neighbors)  # 모델 생성
    clf.fit(X_train, y_train)
    # train dataset 정확도 저장
    training_accuracy.append(clf.score(X_train, y_train))
    # test dataset 정확도 저장
    test_accuracy.append(clf.score(X_test, y_test))

import numpy as np
print("평균 정확도 :", np.mean(test_accuracy))
# 평균 정확도 : 0.918881118881119
plt.plot(neighbors_settings, training_accuracy, label="훈련 정확도")
plt.plot(neighbors_settings, test_accuracy, label="테스트 정확도")
plt.ylabel("정확도")
plt.xlabel("n_neighbors")
plt.legend()
plt.show()

from sklearn.datasets import load_breast_cancer
load_breast_cancer()

이 그림은 n_neighbors 수(x축)에 따른 훈련 세트와 테스트 세트 정확도(y축)를 보여줍니다.
실제 이런 그래프는 매끈하게 나오지 않지만, 여기서도 과대적합과 과소적합의 특징을 볼 수 있습니다
(이웃의 수가 적을수록 모델이 복잡해지므로 [그림]의 그래프가 수평으로 뒤집힌 형태입니다).
최근접 이웃의 수가 하나일 때는 훈련 데이터에 대한 예측이 완벽합니다.
하지만 이웃의 수가 늘어나면 모델은 단순해지고 훈련 데이터의 정확도는 줄어듭니다.
이웃을 하나 사용한 테스트 세트의 정확도는 이웃을 많이 사용했을 때보다 낮습니다.
이것은 1-최근접 이웃이 모델을 너무 복잡하게 만든다는 것을 설명해줍니다.
반대로 이웃을 10개 사용했을 때는 모델이 너무 단순해서 정확도는 더 나빠집니다.
정확도가 가장 좋을 때는 중간 정도인 여섯 개를 사용한 경우입니다.

참고 : 파이썬 라이브러리를 활용한 머신러닝 (한빛미디어 출판사)의 일부분을 사용했습니다.

* knn2.py

from sklearn.neighbors import KNeighborsClassifier

kmodel = KNeighborsClassifier(n_neighbors = 3, weights = 'distance')

train = [
    [5, 3, 2],
    [1, 3, 5],
    [4, 5, 7]
    ]
label = [0, 1, 1]

import matplotlib.pyplot as plt

plt.plot(train, 'o')
plt.xlim([-1, 5])
plt.ylim([0, 10])
plt.show()

scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

sklearn.neighbors.KNeighborsClassifier — scikit-learn 0.24.1 documentation

scikit-learn.org

from sklearn.neighbors import KNeighborsClassifier
KNeighborsClassifier(n_neighbors = 3, weights = 'distance')

kmodel.fit(train, label)
pred = kmodel.predict(train)
print('pred :', pred)                        # pred : [0 1 1]
print('acc :', kmodel.score(train, label))   # acc : 1.0

new_data = [[1, 2, 8], [6, 4, 1]]
new_pred = kmodel.predict(new_data)
print('new_pred :', new_pred)                # new_pred : [1 0]

* regression_test.py

 # 대표적인 분류/예측 모델로 Regression 연습
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import r2_score

adver = pd.read_csv('../testdata/Advertising.csv', usecols=[1,2,3,4])
print(adver.head(2))
'''
      tv  radio  newspaper  sales
0  230.1   37.8       69.2   22.1
1   44.5   39.3       45.1   10.4
'''

x = np.array(adver.loc[:, 'tv':'newspaper'])
y = np.array(adver.sales)
print(x[:2]) # [[230.1  37.8  69.2] [ 44.5  39.3  45.1]]
print(y[:2]) # [22.1 10.4]

# KNeighborsRegressor
kmodel = KNeighborsRegressor(n_neighbors=3).fit(x, y)
print(kmodel)
kpred = kmodel.predict(x)
print('pred :', kpred[:5]) # pred : [20.4        10.43333333  8.56666667 18.2        14.2       ]
print('r2 :', r2_score(y, kpred))  # r2 : 0.968012077694316
print()

# LinearRegression
lmodel = LinearRegression().fit(x, y)
print(lmodel)
lpred = lmodel.predict(x)
print('pred :', lpred[:5]) # pred : [20.52397441 12.33785482 12.30767078 17.59782951 13.18867186]
print('r2 :', r2_score(y, lpred))  # r2 : 0.8972106381789522
print()

# RandomForestRegressor
rmodel = RandomForestRegressor(n_estimators=100, criterion='mse').fit(x, y)
print(rmodel)
rpred = rmodel.predict(x)
print('pred :', rpred[:5]) # pred : [21.942 10.669  8.859 18.281 13.44 ]
print('r2 :', r2_score(y, rpred))  # r2 : 0.9971466378876895
print()

# XGBRegressor
xmodel = XGBRegressor(n_estimators=100).fit(x, y)
print(xmodel)
xpred = xmodel.predict(x)
print('pred :', xpred[:5]) # pred : [22.095655  10.40437    9.302584  18.499216  12.9007015]
print('r2 :', r2_score(y, xpred))  # r2 : 0.9999996661140423
print()

'BACK END > Deep Learning' 카테고리의 다른 글

[딥러닝] 클러스터링 (0)	2021.03.19
[딥러닝] Neural Network (0)	2021.03.19
[딥러닝] RandomForest (0)	2021.03.17
[딥러닝] Decision Tree (0)	2021.03.17
[딥러닝] 나이브 베이즈 (0)	2021.03.17

[딥러닝] RandomForest

2021. 3. 17. 17:29

RandomForest

: 앙상블 기법(여러개의 Decision Tree를 묶어 하나의 모델로 사용)

: 정량적인 분석 모델

RandomForestClassifier 분류 모델 연습

* randomForest1.py

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import pandas as pd
from sklearn import model_selection
import numpy as np
from sklearn.metrics._scorer import accuracy_scorer

df = pd.read_csv('https://raw.githubusercontent.com/pykwon/python/master/testdata_utf8/titanic_data.csv')
print(df.head(3), df.shape) # (891, 12)
'''
   PassengerId  Survived  Pclass  ...     Fare Cabin  Embarked
0            1         0       3  ...   7.2500   NaN         S
1            2         1       1  ...  71.2833   C85         C
2            3         1       3  ...   7.9250   NaN         S
'''
print(df.columns)
# Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
#        'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
#       dtype='object')
print(df.info())
 #   Column       Non-Null Count  Dtype  
# ---  ------       --------------  -----  
#  0   PassengerId  891 non-null    int64  
#  1   Survived     891 non-null    int64  
#  2   Pclass       891 non-null    int64  
#  3   Name         891 non-null    object 
#  4   Sex          891 non-null    object 
#  5   Age          714 non-null    float64
#  6   SibSp        891 non-null    int64  
#  7   Parch        891 non-null    int64  
#  8   Ticket       891 non-null    object 
#  9   Fare         891 non-null    float64
#  10  Cabin        204 non-null    object 
#  11  Embarked     889 non-null    object 

print(df.isnull().any())
# PassengerId    False
# Survived       False
# Pclass         False
# Name           False
# Sex            False
# Age             True
# SibSp          False
# Parch          False
# Ticket         False
# Fare           False
# Cabin           True
# Embarked        True

df.isnull().any() : null 값 확인.

df = df.dropna(subset=['Pclass','Age','Sex'])
print(df.head(3), df.shape) # (714, 12)

df_x = df[['Pclass','Age','Sex']]
print(df_x.head(3))
'''
   Pclass   Age     Sex
0       3  22.0    male
1       1  38.0  female
2       3  26.0  female
'''

df.dropna(subset=['칼럼1', '칼럼2',..]) : 칼럼에 결측치가 있으면 제거.

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

df_x.loc[:, 'Sex'] = LabelEncoder().fit_transform(df_x['Sex']) # female : 0, male : 1
df_x['Sex'] = df_x['Sex'].apply(lambda x: 1 if x=='male' else 0) # 위와 동일

print(df_x.head(3), df_x.shape) # (714, 3)
'''
   Pclass   Age  Sex
0       3  22.0    1
1       1  38.0    0
2       3  26.0    0
'''

df_y = df['Survived']
print(df_y.head(3), df_y.shape) # (714,)
'''
0    0
1    1
2    1
'''

df_x2 = pd.DataFrame(OneHotEncoder().fit_transform(df_x['Pclass'].values[:,np.newaxis]).toarray(),\
                     columns = ['f_class', 's_class', 't_class'], index=df_x.index)
print(df_x2.head(3))
'''
   f_class  s_class  t_class
0      0.0      0.0      1.0
1      1.0      0.0      0.0
2      0.0      0.0      1.0
'''

df_x = pd.concat([df_x, df_x2], axis=1)
print(df_x.head(3))
'''
   Pclass   Age  Sex  f_class  s_class  t_class
0       3  22.0    1      0.0      0.0      1.0
1       1  38.0    0      1.0      0.0      0.0
2       3  26.0    0      0.0      0.0      1.0
'''

from sklearn.preprocessing import LabelEncoder

LabelEncoder().fit_transform(df['범주형 칼럼']) : 범주형 데이터를 수치형으로 변환.

from sklearn.preprocessing import OneHotEncoder

OneHotEncoder().fit_transform(df['칼럼명']).toarray() : One hot encoding

np.newaxis : 차원 증가.

pd.concat([칼럼, .. ], axis=1) : 열 방향 합치기

# train / test
(train_x,  test_x, train_y, test_y) = train_test_split(df_x, df_y)

# model
from sklearn.metrics import accuracy_score

model = RandomForestClassifier(n_estimators=500, criterion='entropy')
fit_model = model.fit(train_x, train_y)

pred = fit_model.predict(test_x)
print('예측값:', pred[:10])           # [0 0 0 1 0 0 0 1 0 0]
print('실제값:', test_y[:10].ravel()) # [0 1 0 0 1 0 0 1 0 0]

print('acc :', sum(test_y == pred) / len(test_y))
print('acc :', accuracy_score(test_y, pred))

from sklearn.ensemble import RandomForestClassifier

RandomForestClassifier(n_estimators=100, criterion='entropy') : n_estimators : 트리 수, criterion : 분할 품질 측정 방법.

from sklearn.metrics import accuracy_score

accuracy_score(실제값, 예측값) : 정확도 산출.

ravel() : 차원 축소.

- RandomForestClassifier API

scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

sklearn.ensemble.RandomForestClassifier — scikit-learn 0.24.1 documentation

scikit-learn.org

보스톤 지역의 주택 평균가격 예측

* randomForest_regressor.py

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston

boston = load_boston()
print(boston.DESCR)
# MEDV     Median value of owner-occupied homes in $1000's

# DataFrame 값으로 변환
dfx = pd.DataFrame(boston.data, columns = boston.feature_names)
# dataset에서 독립변수 값만 추출
dfy = pd.DataFrame(boston.target, columns = ['MEDV'])
# dataset에서 종속변수 값 추출

print(dfx.head(3), dfx.shape) # (506, 13)
'''
      CRIM    ZN  INDUS  CHAS    NOX  ...  RAD    TAX  PTRATIO       B  LSTAT
0  0.00632  18.0   2.31   0.0  0.538  ...  1.0  296.0     15.3  396.90   4.98
1  0.02731   0.0   7.07   0.0  0.469  ...  2.0  242.0     17.8  396.90   9.14
2  0.02729   0.0   7.07   0.0  0.469  ...  2.0  242.0     17.8  392.83   4.03
'''
print(dfy.head(3), dfy.shape) # (506, 1)
'''
   MEDV
0  24.0
1  21.6
2  34.7 
'''
df = pd.concat([dfx, dfy], axis=1)
print(df.head(3))
'''
      CRIM    ZN  INDUS  CHAS    NOX  ...    TAX  PTRATIO       B  LSTAT  MEDV
0  0.00632  18.0   2.31   0.0  0.538  ...  296.0     15.3  396.90   4.98  24.0
1  0.02731   0.0   7.07   0.0  0.469  ...  242.0     17.8  396.90   9.14  21.6
2  0.02729   0.0   7.07   0.0  0.469  ...  242.0     17.8  392.83   4.03  34.7
'''

from sklearn.datasets import load_boston

load_boston() : boston 부동산관련 dataset.

- 상관계수

pd.set_option('display.max_columns', 100) # 데이터 프레임 출력시 생략 값 출력.
print(df.corr()) # 상관계수 확인
# RM       average number of rooms per dwelling.                   상관계수 : 0.695360
# AGE      proportion of owner-occupied units built prior to 1940. 상관계수 : -0.376955
# LSTAT    % lower status of the population                        상관계수 : -0.737663

pd.set_option('display.max_columns', 100) :데이터 프레임 출력시 컬럼 생략 값 출력.

- 시각화

import seaborn as sns
cols = ['MEDV', 'RM', 'AGE', 'LSTAT']
sns.pairplot(df[cols])
plt.show()

import seaborn as sns

sns.pairplot(데이터) : 변수 간 산점 분포도 출력.

- sklearn에 맞게 데이터 변환

x = df[['LSTAT']].values # sklearn에서 득립변수는 2차원
y = df['MEDV'].values
print(x[:2])             # [[4.98] [9.14]]
print(y[:2])             # [24.  21.6]

- DecisionTreeRegressor

# 실습 1
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score

model = DecisionTreeRegressor(max_depth=3).fit(x, y)
print('predict :', model.predict(x)[:5]) # predict : [30.47142857 25.84701493 37.315625   43.98888889 30.47142857]
print('real :', y[:5])                   # real : [24.  21.6 34.7 33.4 36.2]
r2 = r2_score(y, model.predict(x))
print('결정계수(R2, 설명력) :', r2)          # 결정계수(R2, 설명력) : 0.6993833085636556

from sklearn.tree import DecisionTreeRegressor

DecisionTreeRegressor(max_depth=).fit(x, y) : 결정 트리 회귀

from sklearn.metrics import r2_score

r2_score(실제값, 예측값) : r square 값 산출

- RandomForestRegressor

# 실습 2
from sklearn.ensemble import RandomForestRegressor

model2 = RandomForestRegressor(n_estimators=1000, criterion='mse', random_state=123).fit(x, y) # criterion='mse' 평균 제곱오차
print('predict2 :', model2.predict(x)[:5]) # predict : [24.7535     22.0408     35.2609581  38.8436     32.00298571]
print('real :', y[:5])                     # real : [24.  21.6 34.7 33.4 36.2]
r2_1 = r2_score(y, model2.predict(x))
print('결정계수(R2, 설명력) :', r2_1)          # 결정계수(R2, 설명력) : 0.9096858991691069

from sklearn.ensemble import RandomForestRegressor

RandomForestRegressor(n_estimators=, criterion='mse', random_state=).fit(x, y) : criterion='mse' 평균 제곱오차

- 학습/검정 자료로 분리

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=123)
model2.fit(x_train, y_train)

r2_train = r2_score(y_train, model2.predict(x_train))
print('train에 대한 설명력 :', r2_train)  # train에 대한 설명력 : 0.9090659680794153

r2_test = r2_score(y_test, model2.predict(x_test))
print('test에 대한 설명력 :', r2_test)    # test에 대한 설명력 : 0.5779609792473676
# 독립변수의 수를 늘려주면 결과는 개선됨.

from sklearn.model_selection import train_test_split

train_test_split(x, y, test_size=, random_state=) : train/test로 분리

- 시각화

from matplotlib import style 
style.use('seaborn-talk')
plt.scatter(x, y, c='lightgray', label='train data')
plt.scatter(x_test, model2.predict(x_test), c='r', label='predict data, $R^2=%.2f$'%r2_test)
plt.xlabel('LSTAT')
plt.ylabel('MEDV')
plt.legend()
plt.show()

from matplotlib import style

print(plt.style.available) : 사용 가능 스타일 출력.

style.use('스타일명') : matplot 스타일 사용.

- 새로운 값으로 예측

import numpy as np
print(x_test[:3])                        # [[10.11] [ 6.53] [ 3.76]]
x_new = [[50.11], [26.53], [1.76]]
print('예상 집값 :', model2.predict(x_new)) # 예상 집값 : [ 9.6527  11.0907  45.34095]

배깅 / 부스팅

배깅(Bagging) - Random Forest
  : 데이터에서 여러 bootstrap 자료 생성, 모델링 후 결합하여 최종 예측 모형을 만드는 알고리즘
    boostrap aggregating의 약어로 데이터를 가방(bag)에 쓸어 담아 복원 추출하여 여러 개의 표본을 만들어 이를 기반으로 각각의 모델을 개발한 후에 결과를 하나로 합쳐 하나의 모델을 만들어 내는 것이다.
    배깅을 통해서 얻을 수 있는 효과는 '알고리즘의 안정성'이다.
    단일 seed 하나의 값을 기준으로 데이터를 추출하여 모델을 생성해 나는 것보다, 여러 개의 다양한 표본을 사용함으로써 모델을 만드는 것이 모집단을 잘 대표할 수 있게 된다.
    또한 명목형 변수 (Categorical data)의 경우 투표(voting) 방식, 혹은 가장 높은 확률값으로 예측 결과값을 합치며 연속형 변수(numeric data)의 경우에는 평균(average)으로 값을 집계한다.
    또한 배깅은 병렬 처리를 사용할 수 있는데, 독립적인 데이터 셋으로 독립된 모델을 만들기 때문에 모델 생성에 있어서 매우 효율적이다.

부스팅(Boosting) - XGBoost
  : 오분류 개체들에 가중치를 적용하여 새로운 분류 규칙 생성 반복 기반 최종 예측 모형 생성
    좀 더 알아보자면 Boosting이란 약한 분류기를 결합하여 강한 분류기를 만드는 과정이다.
    분류기 A, B, C 가 있고, 각각의 0.3 정도의 accuracy를 보여준다고 하자.
    A, B, C를 결합하여 더 높은 정확도, 예를 들어 0.7 정도의 accuracy를 얻는 게 앙상블 알고리즘의 기본 원리다.
    Boosting은 이 과정을 순차적으로 실행한다.
    A 분류기를 만든 후, 그 정보를 바탕으로 B 분류기를 만들고, 다시 그 정보를 바탕으로 C 분류기를 만든다.
   그리고 최종적으로 만들어진 분류기들을 모두 결합하여 최종 모델을 만드는 것이 Boosting의 원리다.
   대표적인 알고리즘으로 에이다부스트가 있다. AdaBoost는 Adaptive Boosting의 약자이다.
   Adaboost는 ensemble-based classifier의 일종으로 weak classifier를 반복적으로 적용해서, data의 특징을 찾아가는 알고리즘.

- anaconda prompt

pip install xgboost

* xgboost1.py

# RandomForest vs xgboost
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
import numpy as np
import xgboost as xgb       # pip install xgboost  
 
if __name__ == '__main__':
    iris = datasets.load_iris()
    print('아이리스 종류 :', iris.target_names)
    print('데이터 열 이름 :', iris.feature_names)
 
    # iris data로 Dataframe
    data = pd.DataFrame(
        {
            'sepal length': iris.data[:, 0],
            'sepal width': iris.data[:, 1],
            'petal length': iris.data[:, 2],
            'petal width': iris.data[:, 3],
            'species': iris.target
        }
    )
    print(data.head(2))
    '''
           sepal length  sepal width  petal length  petal width  species
    0           5.1          3.5           1.4          0.2        0
    1           4.9          3.0           1.4          0.2        0
    '''
 
    x = data[['sepal length', 'sepal width', 'petal length', 'petal width']]
    y = data['species']
 
    # 테스트 데이터 30%
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=123)
 
    # 학습 진행
    model = RandomForestClassifier(n_estimators=100)  # RandomForestClassifier - Bagging 방법 : 병렬 처리
    model = xgb.XGBClassifier(booster='gbtree', max_depth=4, n_estimators=100) # XGBClassifier - Boosting : 직렬처리
    # 속성 - booster: 의사결정 기반 모형(gbtree), 선형 모형(linear)
    #    - max_depth [기본값: 6]: 과적합 방지를 위해서 사용되며 CV를 사용해서 적절한 값이 제시되어야 하고 보통 3-10 사이 값이 적용된다.

    model.fit(x_train, y_train)
 
    # 예측
    y_pred = model.predict(x_test)
    print('예측값 : ', y_pred[:5])
    # 예측값 :  [1 2 2 1 0]

    print('실제값 : ', np.array(y_test[:5]))
    # 실제값 :  [1 2 2 1 0]
 
    print('정확도 : ', metrics.accuracy_score(y_test, y_pred))
    # 정확도 :  0.9333333333333333

import xgboost as xgb

xgb.XGBClassifier(booster='gbtree', max_depth=, n_estimators=) : XGBoost 분류 - Boosting(직렬처리)
booster : 의사결정 기반 모형(gbtree), 선형 모형(linear)
max_depth : 과적합 방지를 위해서 사용되며 CV를 사용해서 적절한 값이 제시되어야 하고 보통 3-10 사이 값이 적용됨.

(default: 6)

'BACK END > Deep Learning' 카테고리의 다른 글

[딥러닝] Neural Network (0)	2021.03.19
[딥러닝] KNN (0)	2021.03.18
[딥러닝] Decision Tree (0)	2021.03.17
[딥러닝] 나이브 베이즈 (0)	2021.03.17
[딥러닝] PCA (0)	2021.03.16

[딥러닝] Decision Tree

2021. 3. 17. 13:17

의사결정 나무(Decision Tree)

: CART - classification과 Regression 모두 가능
: 여러 규칙을 순차적으로 적용하면서 분류나 예측을 진행하는 단순 알고리즘 사용 모델

Random Forest

앙상블 모델

base 모델로 Decision Tree

* tree1.py

import pydotplus
from sklearn import tree

# height, hair로 남녀 구분
x = [[180, 15],
     [177, 42],
     [156, 35],
     [174, 5],
     [166, 33]]

y = ['man', 'women', 'women', 'man', 'women']
label_names = ['height', 'hair Legnth']

model = tree.DecisionTreeClassifier(criterion='entropy', random_state=0)
print(model)
fit = model.fit(x, y)
print('acc :{:.3f}'.format(fit.score(x, y))) # acc :1.000

mydata = [[171, 8]]
pred =  fit.predict(mydata)
print('pred :', pred) # pred : ['man']

from sklearn import tree

tree.DecisionTreeClassifier() :

scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

sklearn.tree.DecisionTreeClassifier — scikit-learn 0.24.1 documentation

scikit-learn.org

# 시각화 - graphviz 툴을 사용
import collections

dot_data = tree.export_graphviz(model, feature_names=label_names, out_file=None,\
                                filled = True, rounded=True)
graph = pydotplus.graph_from_dot_data(dot_data)
colors = ('red', 'orange')
edges = collections.defaultdict(list) # list type 변수

for e in graph.get_edge_list():
    edges[e.get_source()].append(int(e.get_destination()))

for e in edges:
    edges[e].sort()
    for i in range(2):
        dest = graph.get_node(str(edges[e][i]))[0]
        dest.set_fillcolor(colors[i])

graph.write_png('tree.png') # 이미지 저장

import matplotlib.pyplot as plt

img = plt.imread('tree.png')
plt.imshow(img)
plt.show()

* tree2_iris.py

...

# 의사결정 나무 모델
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(criterion='entropy', max_depth=5)

...

...
# 트리의 특성 중요도 : 전체 트리 결정에 각 특성이 어느정도 중요한지 평가
print('특성 중요도 : \n{}'.format(model.feature_importances_))

def plot_feature_importances(model):
    n_features = x.shape[1] # 4
    plt.barh(range(n_features), model.feature_importances_, align='center')
    #plt.yticks(np.range(n_features), iris.featrue_names[2:4])
    plt.xlabel('특성중요도')
    plt.ylabel('특성')
    plt.ylim(-1, n_features)

plot_feature_importances(model)
plt.show()

# graphviz
from sklearn import tree
from io import StringIO
import pydotplus

dot_data = StringIO() # 파일 흉내를 내는 역할
tree.export_graphviz(model, out_file = dot_data,\
                     feature_names = iris.feature_names[2:4])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png('tree2.png')

import matplotlib.pyplot as plt

img = plt.imread('tree2.png')
plt.imshow(img)
plt.show()

'BACK END > Deep Learning' 카테고리의 다른 글

[딥러닝] KNN (0)	2021.03.18
[딥러닝] RandomForest (0)	2021.03.17
[딥러닝] 나이브 베이즈 (0)	2021.03.17
[딥러닝] PCA (0)	2021.03.16
[딥러닝] SVM (0)	2021.03.16

[딥러닝] 나이브 베이즈

2021. 3. 17. 13:02

나이브 베이즈(Naive Bayes) 분류 모델

: feature가 주어졌을 때 label의 확률을 구함. P(L|Feature)

P(A|B) = P(B|A)P(A)/P(B)

P(A|B) : 사건B가 발생한 상태에서 사건A가 발생할 조건부 확률

P(label|feature)

* bayes1.py

from sklearn.naive_bayes import GaussianNB
import numpy as np
from sklearn import metrics

x = np.array([1,2,3,4,5])
x = x[:, np.newaxis] # np.newaxis 차원 확대
print(x)
'''
[[1]
 [2]
 [3]
 [4]
 [5]]
'''
y = np.array([1,3,5,7,9])
print(y)

model = GaussianNB().fit(x, y)
pred = model.predict(x)
print(pred) # [1 3 5 7 9]
print('acc :', metrics.accuracy_score(y, pred)) # acc : 1.0

from sklearn.naive_bayes import GaussianNB

GaussianNB()

# new data
new_x = np.array([[0.5],[2.3], [12], [0.1]])
new_pred = model.predict(new_x)
print(new_pred) # [1 3 9 1]

- One-hot encoding : 데이터를 0과 1로 변환(2진수)

: feature 데이터를 One-hot encoding

: 모델의 성능향상

x = '1,2,3,4,5'
x = x.split(',')
x = np.eye(len(x))
print(x)
'''
[[1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 1.]]
'''
y = np.array([1,3,5,7,9])

model = GaussianNB().fit(x, y)
pred = model.predict(x)
print(pred) # [1 3 5 7 9]
print('acc :', metrics.accuracy_score(y, pred)) # acc : 1.0

from sklearn.preprocessing import OneHotEncoder
x = '1,2,3,4,5'
x = x.split(',')
x = np.array(x)
x = x[:, np.newaxis]
'''
[['1']
 ['2']
 ['3']
 ['4']
 ['5']]
'''

one_hot = OneHotEncoder(categories = 'auto')
x = one_hot.fit_transform(x).toarray()
print(x)
'''
[[1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 1.]]
'''
y = np.array([1,3,5,7,9])

model = GaussianNB().fit(x, y)
pred = model.predict(x)
print(pred) # [1 3 5 7 9]
print('acc :', metrics.accuracy_score(y, pred)) # acc : 1.0

* bayes3_text.py

# 나이브베이즈 분류모델로 텍스트 분류
from sklearn.datasets import fetch_20newsgroups

data = fetch_20newsgroups()
print(data.target_names)

categories = ['talk.religion.misc', 'soc.religion.christian',
              'sci.space', 'comp.graphics']

train = fetch_20newsgroups(subset='train', categories=categories)
test = fetch_20newsgroups(subset='test', categories=categories)
print(train.data[5])  # 데이터 중 대표항목

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

# 각 문자열의 콘텐츠를 숫자벡터로 전환
model = make_pipeline(TfidfVectorizer(), MultinomialNB())  # 작업을 연속적으로 진행
model.fit(train.data, train.target)
labels = model.predict(test.data)

from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

mat = confusion_matrix(test.target, labels)  # 오차행렬 보기
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
            xticklabels=train.target_names, yticklabels=train.target_names)
plt.xlabel('true label')
plt.ylabel('predicted label')
plt.show()

# 하나의 문자열에 대해 예측한 범주 변환용 유틸 함수 작성
def predict_category(s, train=train, model=model):
    pred = model.predict([s])
    return train.target_names[pred[0]]

print(predict_category('sending a payload to the ISS'))
print(predict_category('discussing islam vs atheism'))
print(predict_category('determining the screen resolution'))

# 참고 도서 : 파이썬 데이터사이언스 핸드북 ( 출판사 : 위키북스)

'BACK END > Deep Learning' 카테고리의 다른 글

[딥러닝] RandomForest (0)	2021.03.17
[딥러닝] Decision Tree (0)	2021.03.17
[딥러닝] PCA (0)	2021.03.16
[딥러닝] SVM (0)	2021.03.16
[딥러닝] 로지스틱 회귀 (0)	2021.03.15

[딥러닝] PCA

2021. 3. 16. 16:28

특성공학중 PCA(Principal Component Analysis)
: 특성을 단순히 선택하는 것이 아니라 특성들의 조합으로 새로운 특성을 생성

: PCA(주성분 분석)는 특성 추출(Feature Extraction) 기법에 속함

iris dataset으로 차원 축소 (4개의 열을 2(sepal, petal))

* pca_test.py

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.datasets import load_iris
plt.rc('font', family='malgun gothic')

iris = load_iris()
n = 10
x = iris.data[:n, :2] # sepal 자료로 패턴확인
print('차원 축소 전  x:\n', x, x.shape, type(x)) # (10, 2) <class 'numpy.ndarray'>
'''
 [[5.1 3.5]
 [4.9 3. ]
 [4.7 3.2]
 [4.6 3.1]
 [5.  3.6]
 [5.4 3.9]
 [4.6 3.4]
 [5.  3.4]
 [4.4 2.9]
 [4.9 3.1]]
'''
print(x.T)
# [[5.1 4.9 4.7 4.6 5.  5.4 4.6 5.  4.4 4.9]
#  [3.5 3.  3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1]]

from sklearn.datasets import load_iris

load_iris() : ndarray type의 iris dataset load.

# 시각화
plt.plot(x.T, 'o:')
plt.xticks(range(2), labels=['꽃받침 길이', '꽃받침 폭'])
plt.xlim(-0.5, 2)
plt.ylim(2.5, 6)
plt.title('iris 특성')
plt.legend(['표본{}'.format(i + 1) for i in range(n)])
plt.show()

# 시각화2 : 산포도
plt.figure(figsize=(8, 8))
df = pd.DataFrame(x)
ax = sns.scatterplot(df[0], df[1], data=df , marker='s', s = 100, color=".2")
for i in range(n):
    ax.text(x[i, 0] - 0.05, x[i, 1] + 0.03, '표본{}'.format(i + 1))
    
plt.xlabel('꽃받침 길이')
plt.ylabel('꽃받침 폭')
plt.title('iris 특성')
plt.show()

# PCA
pca1 = PCA(n_components = 1)
x_row = pca1.fit_transform(x) # 1차원 근사데이터를 반환. 비 지도 학습
print('x_row :\n', x_row, x_row.shape) # (10, 1)
'''
[[ 0.30270263]
 [-0.1990931 ]
 [-0.18962889]
 [-0.33097106]
 [ 0.30743473]
 [ 0.79976625]
 [-0.11185966]
 [ 0.16136046]
 [-0.61365539]
 [-0.12605597]]
'''

x2 = pca1.inverse_transform(x_row)
print('복귀 후 값:\n', x2, x2.shape) # (10, 2)
'''
 [[5.06676112 3.53108532]
 [4.7240094  3.1645881 ]
 [4.73047393 3.17150049]
 [4.63393012 3.06826822]
 [5.06999338 3.53454152]
 [5.40628057 3.89412635]
 [4.78359423 3.22830091]
 [4.97021731 3.42785306]
 [4.44084251 2.86180369]
 [4.77389743 3.21793233]]
'''
print(x_row[0]) # [0.30270263]
print(x2[0, :]) # [5.06676112 3.53108532]

# 시각화2 : 산포도 - 사용
df = pd.DataFrame(x)
ax = sns.scatterplot(df[0], df[1], data=df , marker='s', s = 100, color=".2")
for i in range(n):
    d = 0.03 if x[i, 1] > x2[i, 1] else -0.04
    ax.text(x[i, 0] - 0.05, x[i, 1] + 0.03, '표본{}'.format(i + 1))
    plt.plot([x[i, 0], x2[i, 0]], [x[i, 1], x2[i, 1]], "k--")
plt.plot(x2[:, 0], x2[:, 1], "o-", markersize=10, color="b")
plt.plot(x[:, 0].mean(), x[:, 1].mean(), markersize=10, marker="D")
plt.axvline(x[:, 0].mean(), c='r') # 세로선
plt.axhline(x[:, 1].mean(), c='r') # 가로선
plt.xlabel('꽃받침 길이')
plt.ylabel('꽃받침 폭')
plt.title('iris 특성')
plt.show()

x = iris.data
pca2 = PCA(n_components = 2)
x_row2 = pca2.fit_transform(x)
print('x_row2 :\n', x_row2, x_row2.shape)

x4 = pca2.inverse_transform(x_row2)
print('최초자료 :', x[0])         # 최초자료 : [5.1 3.5 1.4 0.2]
print('차원축소 :', x_row2[0])    # 차원축소 : [-2.68412563  0.31939725]
print('최초복귀 :', x4[0, :])     # 최초복귀 : [5.08303897 3.51741393 1.40321372 0.21353169]

print()
iris2 = pd.DataFrame(x_row2, columns=['sepal', 'petal'])
iris1 = pd.DataFrame(x, columns=['sepal_Length', 'sepal_width', 'petal_Length', 'petal_width'])
print(iris2.head(3)) # 차원 축소
'''
      sepal     petal
0 -2.684126  0.319397
1 -2.714142 -0.177001
2 -2.888991 -0.144949
'''
print(iris1.head(3)) # 본래 데이터
'''
   sepal_Length  sepal_width  petal_Length  petal_width
0           5.1          3.5           1.4          0.2
1           4.9          3.0           1.4          0.2
2           4.7          3.2           1.3          0.2
'''

'BACK END > Deep Learning' 카테고리의 다른 글

[딥러닝] Decision Tree (0)	2021.03.17
[딥러닝] 나이브 베이즈 (0)	2021.03.17
[딥러닝] SVM (0)	2021.03.16
[딥러닝] 로지스틱 회귀 (0)	2021.03.15
[딥러닝] 다항회귀 (0)	2021.03.12

[딥러닝] SVM

2021. 3. 16. 12:23

SVM(Support Vector Machine)

: 두 데이터 사이에 구분을 위해 사용

: 각 데이터의 중심을 기준으로 초평면(Optimal Hyper Plane) 구한다.

: 초평면과 가까운 데이터를 support vector라 한다.

: XOR 처리 가능

XOR 연산 처리(분류)

* svm1.py

xor_data = [
    [0,0,0],
    [0,1,1],
    [1,0,1],
    [1,1,0],
]
#print(xor_data)

import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn import svm

xor_df = pd.DataFrame(xor_data)

feature = np.array(xor_df.iloc[:, 0:2])
label = np.array(xor_df.iloc[:, 2])
print(feature)
'''
[[0 0]
 [0 1]
 [1 0]
 [1 1]]
'''
print(label) # [0 1 1 0]

model = LogisticRegression() # 선형분류 모델
model.fit(feature, label)
pred = model.predict(feature)
print('pred :', pred)
# pred : [0 0 0 0]

model = svm.SVC()             # 선형, 비선형(kernel trick 사용) 분류모델
model.fit(feature, label)
pred = model.predict(feature)
print('pred :', pred)
# pred : [0 1 1 0]

# Sopport vector 확인해보기 
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
import numpy as np

plt.rc('font', family='malgun gothic')

X, y = make_blobs(n_samples=50, centers=2, cluster_std=0.5, random_state=4)
y = 2 * y - 1

plt.scatter(X[y == -1, 0], X[y == -1, 1], marker='o', label="-1 클래스")
plt.scatter(X[y == +1, 0], X[y == +1, 1], marker='x', label="+1 클래스")
plt.xlabel("x1")
plt.ylabel("x2")
plt.legend()
plt.title("학습용 데이터")
plt.show()

from sklearn.svm import SVC
model = SVC(kernel='linear', C=1.0).fit(X, y)  # tuning parameter  값을 변경해보자.

xmin = X[:, 0].min()
xmax = X[:, 0].max()
ymin = X[:, 1].min()
ymax = X[:, 1].max()
xx = np.linspace(xmin, xmax, 10)
yy = np.linspace(ymin, ymax, 10)
X1, X2 = np.meshgrid(xx, yy)

z = np.empty(X1.shape)
for (i, j), val in np.ndenumerate(X1):    # 배열 좌표와 값 쌍을 생성하는 반복기를 반환
    x1 = val
    x2 = X2[i, j]
    p = model.decision_function([[x1, x2]])
    z[i, j] = p[0]

levels = [-1, 0, 1]
linestyles = ['dashed', 'solid', 'dashed']
plt.scatter(X[y == -1, 0], X[y == -1, 1], marker='o', label="-1 클래스")
plt.scatter(X[y == +1, 0], X[y == +1, 1], marker='x', label="+1 클래스")
plt.contour(X1, X2, z, levels, colors='k', linestyles=linestyles)
plt.scatter(model.support_vectors_[:, 0], model.support_vectors_[:, 1], s=300, alpha=0.3)

x_new = [10, 2]
plt.scatter(x_new[0], x_new[1], marker='^', s=100)
plt.text(x_new[0] + 0.03, x_new[1] + 0.08, "테스트 데이터")

plt.xlabel("x1")
plt.ylabel("x2")
plt.legend()
plt.title("SVM 예측 결과")
plt.show()

# Support Vectors 값 출력
print(model.support_vectors_)
'''
[[9.03715314 1.71813465]
 [9.17124955 3.52485535]]
'''

* svm2_iris.py

BMI의 계산방법을 이용하여 많은 양의 자료를 생성한 후 분류 모델로 처리

계산식    신체질량지수(BMI)=체중(kg)/[신장(m)]2
판정기준    저체중    20 미만
정상    20 - 24
과체중    25 - 29
비만    30 이상

* svm3_bmi.py

print(67/((170 / 100) * (170 / 100)))

import random

def calc_bmi(h,w):
    bmi = w / (h / 100)**2
    if bmi < 18.5: return 'thin'
    if bmi < 23: return 'normal'
    return 'fat'
print(calc_bmi(170, 65))

fp = open('bmi.csv', 'w')
fp.write('height, weight, label\n')

cnt = {'thin':0, 'normal':0, 'fat':0}

for i in range(50000):
    h = random.randint(150, 200)
    w = random.randint(35, 100)
    label = calc_bmi(h, w)
    cnt[label] += 1
    fp.write('{0},{1},{2}\n'.format(h, w, label))
fp.close()
print('good')

# BMI dataset으로 분류
from sklearn import svm, metrics
from sklearn.model_selection import train_test_split
import pandas as pd
import matplotlib.pyplot as plt

tbl = pd.read_csv('bmi.csv')

# 칼럼을 정규화
label = tbl['label']
print(label)
w = tbl['weight'] / 100
h = tbl['height'] / 200
wh = pd.concat([w, h], axis=1)
print(wh.head(5), wh.shape)
'''
   weight  height
0    0.69   0.850
1    0.51   0.835
2    0.70   0.830
3    0.71   0.945
4    0.50   0.980 (50000, 2)
'''
label = label.map({'thin':0, 'normal':1, 'fat':2})
'''
0    2
1    0
2    2
3    1
4    0
'''
print(label[:5], label.shape) # (50000,)

# train/test
data_train, data_test, label_train, label_test = train_test_split(wh, label)
print(data_train.shape, data_test.shape) # (37500, 2) (12500, 2)

# model
model = svm.SVC(C=0.01).fit(data_train, label_train)
#model = svm.LinearSVC().fit(data_train, label_train)
print(model)

# 학습한 데이터의 결과가 신뢰성이 있는지 확인하기 위해 교차검증 p221
from sklearn import model_selection
cross_vali = model_selection.cross_val_score(model, wh, label, cv=3)
# k ford classification
# train 7, test 3 => train으로 3등분 하여 재검증
# 검증 학습 학습
# 학습 검증 학습
# 학습 학습 검증
print('각각의 검증 결과:', cross_vali)          # [0.96754065 0.96400072 0.96783871]
print('평균 검증 결과:', cross_vali.mean())    # 0.9664600275737195

pred = model.predict(data_test)
ac_score = metrics.accuracy_score(label_test, pred)
print('분류 정확도 :', ac_score) # 분류 정확도 : 0.96816
print(metrics.classification_report(label_test, pred))
'''
              precision    recall  f1-score   support

           0       0.98      0.97      0.98      4263
           1       0.91      0.94      0.93      2644
           2       0.98      0.98      0.98      5593

    accuracy                           0.97     12500
   macro avg       0.96      0.96      0.96     12500
weighted avg       0.97      0.97      0.97     12500
'''

# 시각화
tbl2 = pd.read_csv('bmi.csv', index_col = 2)
print(tbl2[:3])
'''
       height  weight
label                
fat       170      69
thin      167      51
fat       166      70
'''

def scatter_func(lbl, color):
    b = tbl2.loc[lbl]
    plt.scatter(b['weight'], b['height'], c=color, label=lbl)


fig = plt.figure()
scatter_func('fat', 'red')
scatter_func('normal', 'yellow')
scatter_func('thin', 'blue')
plt.legend()
plt.savefig('bmi_test.png')
plt.show()

SVM 모델로 이미지 분류

* svm4.py

from sklearn.datasets import fetch_lfw_people

fetch_lfw_people(min_faces_per_person = 60) : 인물 사진 data load. min_faces_per_person : 최초

scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_lfw_people.html

sklearn.datasets.fetch_lfw_people — scikit-learn 0.24.1 documentation

scikit-learn.org

import matplotlib.pyplot as plt
from sklearn.metrics._classification import classification_report

faces = fetch_lfw_people(min_faces_per_person = 60) 
print(faces)

print(faces.DESCR)
print(faces.data)
print(faces.data.shape) # (729, 2914)
print(faces.target)
print(faces.target_names)
print(faces.images.shape) # (729, 62, 47)

print(faces.images[0])
print(faces.target_names[faces.target[0]])
plt.imshow(faces.images[0], cmap='bone') # cmap : 색
plt.show()

fig, ax = plt.subplots(3, 5)
print(fig)          # Figure(640x480)
print(ax.flat)      # <numpy.flatiter object at 0x00000235198C5D30>
print(len(ax.flat)) # 15
for i, axi in enumerate(ax.flat):
    axi.imshow(faces.images[i], cmap='bone')
    axi.set(xticks=[], yticks=[], xlabel=faces.target_names[faces.target[i]])
plt.show()

- 주성분 분석으로 이미지 차원을 축소시켜 분류작업을 진행

from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.pipeline import make_pipeline

m_pca = PCA(n_components=150, whiten=True, random_state = 0)
m_svc = SVC(C=1)
model = make_pipeline(m_pca, m_svc)
print(model)
# Pipeline(steps=[('pca', PCA(n_components=150, random_state=0, whiten=True)),
#                 ('svc', SVC(C=1))])

- train/test

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(faces.data, faces.target, random_state=1)
print(x_train[0], x_train.shape) # (546, 2914)
print(y_train[0], y_train.shape) # (546,)

model.fit(x_train, y_train)  # train data로 모델 fitting
pred = model.predict(x_test)
print('pred :', pred)   # pred : [1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 ..
print('read :', y_test) # read : [0 1 1 1 1 1 1 1 1 1 1 0 1 1 1 ..

- 분류 정확도

from sklearn.metrics import classification_report

print(classification_report(y_test, pred, target_names = faces.target_names)) # 분류 정확도
'''
                   precision    recall  f1-score   support

  Donald Rumsfeld       1.00      0.25      0.40        20
    George W Bush       0.80      1.00      0.89       132
Gerhard Schroeder       1.00      0.45      0.62        31

         accuracy                           0.83       183
        macro avg       0.93      0.57      0.64       183
     weighted avg       0.86      0.83      0.79       183
=> f1-score/accuracy -> 0.83
'''

from sklearn.metrics import confusion_matrix, accuracy_score

mat = confusion_matrix(y_test, pred)
print('confusion_matrix :\n', mat)
'''
 [[  5  15   0]
 [  0 132   0]
 [  0  17  14]]
 '''
print('acc :', accuracy_score(y_test, pred)) # 0.82513

- 분류결과를 시각화

# x_test[0] 하나 미리보기.
plt.subplots(1, 1)
print(x_test[0], ' ', x_test[0].shape)
# [ 24.333334  33.        72.666664 ... 201.66667  201.33333  155.33333 ]   (2914,)
print(x_test[0].reshape(62, 47)) # 1차원을 2차원으로 변환해야 이미지 출력 가능
plt.imshow(x_test[0].reshape(62, 47), cmap='bone')
plt.show()

fig, ax = plt.subplots(4, 6)
for i, axi in enumerate(ax.flat):
    axi.imshow(x_test[i].reshape(62, 47), cmap='bone')
    axi.set(xticks=[], yticks=[])
    axi.set_ylabel(faces.target_names[pred[i]].split()[-1], color='black' if pred[i] == y_test[i] else 'red')
    fig.suptitle('pred result', size = 14)
plt.show()

- 5차 행렬 시각화

import seaborn as sns
sns.heatmap(mat.T, square = True, annot=True, fmt='d', cbar=False, \
            xticklabels=faces.target_names, yticklabels=faces.target_names)
plt.xlabel('true(read) label')
plt.ylabel('predicted label')
plt.show()

'BACK END > Deep Learning' 카테고리의 다른 글

[딥러닝] 나이브 베이즈 (0)	2021.03.17
[딥러닝] PCA (0)	2021.03.16
[딥러닝] 로지스틱 회귀 (0)	2021.03.15
[딥러닝] 다항회귀 (0)	2021.03.12
[딥러닝] 단순선형 회귀, 다중선형 회귀 (0)	2021.03.11

[딥러닝] 로지스틱 회귀

2021. 3. 15. 11:34

- 로지스틱 회귀분석

: 이항분류 분석

: logit(), glm()
: 독립변수 : 연속형, 종속변수 : 범주형

- 출력된 연속형 자료에 대해 odds -> odds ratio -> logit function -> sigmoid function으로 이항분류

- odds(오즈)

: 확률을 바꾼 값. 성공확률(혹은 1일)이 실패확률(0일)에 비해 몇 배 더 높은가를 나타낸다.

- odds ratio(오즈비)

: 두 개의 오즈 비율. 확률 p의 범위가 (0,1)이라면 Odds(p)의 범위는 (0, ∞)이 된다.

- logit(로짓)

: 오즈비에 로그를 취한 값. Odds ratio에 로그함수를 취한 log(Odds(p))은 입력값의 범위가 (-∞ ~ ∞)이 된다. 즉, 범위가 실수 전체다. 이러한 입력 값의 범위를 (0 ~ 1)로 조정한다.

- sigmoid(시그모이드)

: log(Odds(p))의 범위가 실수이므로 이 값에 대한 선형회귀분석을 하는 것은 의미가 있다. 왜냐하면 오즈비(두 개의 odd 비율)에 로그를 씌우면 오즈비 값들이 정규분포를 이루기 때문이다. log(Odds(p))=wx+b로 선형회귀분석을 실시해서 w와 b를 얻을 수 있다. 위 식을 이용한 것이 sigmoid function이다. 이를 통해 0.5을 기준으로 1과 0의 양분된 값을 된다.

* logistic1.py

import math
import numpy as np
from sklearn.metrics._scorer import accuracy_scorer

def sigFunc(x):
    return 1 / ( 1 + math.exp(-x)) # math.exp(x) : e^x

print(sigFunc(0.6))
print(sigFunc(0.2))
print(sigFunc(6))
print(sigFunc(-6))
print(np.around(sigFunc(6)))   # 1.0
print(np.around(sigFunc(-6)))  # 0.0

import statsmodels.api as sm

mtcars = sm.datasets.get_rdataset('mtcars').data
print(mtcars.head(3)) # mtcars data read
'''
                mpg  cyl   disp   hp  drat     wt   qsec  vs  am  gear  carb
Mazda RX4      21.0    6  160.0  110  3.90  2.620  16.46   0   1     4     4
Mazda RX4 Wag  21.0    6  160.0  110  3.90  2.875  17.02   0   1     4     4
Datsun 710     22.8    4  108.0   93  3.85  2.320  18.61   1   1     4     1
'''
print(mtcars['am'].unique()) # [1 0]

import statsmodels.api as sm

sm.datasets.get_rdataset('데이터명').data : 내장 데이터 셋의 데이터 read.

- 방법1 : logit()

import statsmodels.formula.api as smf

formula = 'am ~ mpg + hp'           # 연비, 마력  ->  자/수동 상관관계
result = smf.logit(formula=formula, data=mtcars).fit()
print(result)
'''
Optimization terminated successfully.
         Current function value: 0.300509
         Iterations 9
<statsmodels.discrete.discrete_model.BinaryResultsWrapper object at 0x000001F6244B8040>
'''
print(result.summary())
# p-value < 0.05  =>  유효

pred = result.predict(mtcars[:10])
#print('예측값 : \n', pred)
print('예측값 : \n', np.around(pred))
'''
예측값 : 
 Mazda RX4            0.0
Mazda RX4 Wag        0.0
Datsun 710           1.0
Hornet 4 Drive       0.0
Hornet Sportabout    0.0
Valiant              0.0
Duster 360           0.0
Merc 240D            1.0
Merc 230             1.0
Merc 280             0.0
'''

print('실제값 : \n', mtcars['am'][:10])
'''
실제값 : 
 Mazda RX4            1
Mazda RX4 Wag        1
Datsun 710           1
Hornet 4 Drive       0
Hornet Sportabout    0
Valiant              0
Duster 360           0
Merc 240D            0
Merc 230             0
Merc 280             0
'''

import statsmodels.formula.api as smf

smf.logit(formula='종속변수 ~ 독립변수 + ...', data=데이터).fit() : 로지스틱 회귀 모델 생성

model.predict(데이터) : 모델에 대한 예측 값 산출

- 분류정확도

conf_tab = result.pred_table() # confusion matrix
print(conf_tab)
'''
       예측값   p        n
실제값 참 [[16.(TP)  3.(FN)]
      거짓 [ 3.(FP)  10.(TN)]]
'''
print('분류 정확도 :', (16+10) / len(mtcars)) # 0.8125
print('분류 정확도 :', (conf_tab[0][0] + conf_tab[1][1])/ len(mtcars)) # 0.8125

from sklearn.metrics import accuracy_score
pred2 = result.predict(mtcars)
print('분류 정확도 :', accuracy_score(mtcars['am'], np.around(pred2))) # 0.8125

model.pred_table() : confusion matrix 생성

from sklearn.metrics import accuracy_score

accuracy_score(실제 값, 예측 값) : 분류 정확도 산출

		예측값
		positive	negative
실제값	참	TP	FN
실제값	거짓	FP	TN

=> TP, TN : 예측값과 실제값이 일치
=> 정확도(accuracy) = TP + TN / 전체 개수

=> 정밀도(pecision) = TP / (TP + FP)

=> 재현율(recall) = TP / (TP + FN)

=> 특이도 = TN / (FP + TN)

=> F1 score = 2 x 재현율 x 정밀도 / (재현율 + 정밀도)

- 방법2 : glm()

import statsmodels.formula.api as smf
import statsmodels.api as sm

result2 = smf.glm(formula=formula, data=mtcars, family=sm.families.Binomial()).fit()
print(result2)
print(result2.summary())

glm_pred = result2.predict(mtcars[:5])
print('glm 예측값 :\n', glm_pred)
'''
 Mazda RX4            0.250047
Mazda RX4 Wag        0.250047
Datsun 710           0.558034
Hornet 4 Drive       0.355600
Hornet Sportabout    0.397097
'''
print('실제값 :\n', mtcars['am'][:5])
glm_pred2 = result2.predict(mtcars)
print('분류 정확도 :', accuracy_score(mtcars['am'], np.around(glm_pred2))) # 0.8125

smf.glm(formula='종속변수 ~ 독립변수 +...', data=데이터, family=sm.families.Binomial()).fit() : 로지스틱 회귀 모델 생성

- 새로운 값을 분류

new_df = mtcars.iloc[:2].copy()
new_df['mpg'] = [10, 30]
new_df['hp'] = [100, 130]
print(new_df)
'''
               mpg  cyl   disp   hp  drat     wt   qsec  vs  am  gear  carb
Mazda RX4       10    6  160.0  100   3.9  2.620  16.46   0   1     4     4
Mazda RX4 Wag   30    6  160.0  130   3.9  2.875  17.02   0   1     4     4
'''

glm_pred_new = result2.predict(new_df)
print('새로운 값 분류 결과 :\n', np.around(glm_pred_new))
print('새로운 값 분류 결과 :\n', np.rint(glm_pred_new))
'''
 Mazda RX4        0.0
Mazda RX4 Wag    1.0
'''

import pandas as pd
new_df2 = pd.DataFrame({'mpg':[10, 35], 'hp':[100, 145]})
glm_pred_new2 = result2.predict(new_df2)
print('새로운 값 분류 결과 :\n', np.around(glm_pred_new2))
'''
 0    0.0
1    1.0
'''

np.around(숫자) : 반올림

np.rint(숫자) : 반올림

- 로지스틱 회귀분석

: 날씨 예보 - 강수 예보

* logistic2.py

import pandas as pd
from sklearn.model_selection._split import train_test_split
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np

data = pd.read_csv('../testdata/weather.csv')
print(data.head(2), data.shape, data.columns) # (366, 12)
'''
         Date  MinTemp  MaxTemp  Rainfall  ...  Cloud  Temp  RainToday  RainTomorrow
0  2016-11-01      8.0     24.3       0.0  ...      7  23.6         No           Yes
1  2016-11-02     14.0     26.9       3.6  ...      3  25.7        Yes           Yes
Index(['Date', 'MinTemp', 'MaxTemp', 'Rainfall', 'Sunshine', 'WindSpeed',
       'Humidity', 'Pressure', 'Cloud', 'Temp', 'RainToday', 'RainTomorrow']
'''

data2 = pd.DataFrame()
data2 = data.drop(['Date', 'RainToday'], axis=1)
data2['RainTomorrow'] = data2['RainTomorrow'].map({'Yes':1, 'No':0})
print(data2.head(5))
'''
   MinTemp  MaxTemp  Rainfall  Sunshine  ...  Pressure  Cloud  Temp  RainTomorrow
0      8.0     24.3       0.0       6.3  ...    1015.0      7  23.6             1
1     14.0     26.9       3.6       9.7  ...    1008.4      3  25.7             1
2     13.7     23.4       3.6       3.3  ...    1007.2      7  20.2             1
3     13.3     15.5      39.8       9.1  ...    1007.0      7  14.1             1
4      7.6     16.1       2.8      10.6  ...    1018.5      7  15.4             0
'''

데이터.drop([칼럼1, ... ], axis=1) : 칼럼 단위 자르기

데이터.map({'key1':value1, 'key2':value2}) : 데이터의 key와 동일할 경우 value로 set.

- train (모델을 학습) / test (모델을 검증)로 분리 : 과적합 분리

train, test = train_test_split(data2, test_size=0.3, random_state = 42) # 샘플링, random_state : seed no
print(train.shape, test.shape) # (256, 10) (110, 10)

from sklearn.model_selection._split import train_test_split

train_test_split(데이터, test_size=0.3, random_state = seed넘버) : 데이터를 train, test로 test_size 비율로 분할.

- 분류 모델

#my_formula = 'RainTomorrow ~ MinTemp + MaxTemp + ...'
col_sel = "+".join(train.columns.difference(['RainTomorrow'])) # difference(x) : x 제외
my_formula = 'RainTomorrow ~ ' + col_sel
print(my_formula) 
# RainTomorrow ~ Cloud+Humidity+MaxTemp+MinTemp+Pressure+Rainfall+Sunshine+Temp+WindSpeed

model = smf.logit(formula=my_formula, data = train).fit()
#model = smf.glm(formula=my_formula, data = train, family=sm.families.Binomial()).fit()

print(model)
print(model.params)
print('예측값:\n', np.around(model.predict(test)[:5]))
'''
 193    0.0
33     0.0
15     0.0
310    0.0
57     0.0
'''
print('실제값:\n', test['RainTomorrow'][:5])
'''
 193    0
33     0
15     0
310    0
57     0
'''

구분자.join(데이터.difference([x, .. ])) : 데이터 사이에 구분자를 포함하여 결합. difference(x) : join시 x는 제외.

- 정확도

con_mat = model.pred_table() # smf.logit()에서 지원, smf.glm()에서 지원하지않음.
print('con_mat : \n', con_mat)
'''
 [[197.   9.]
 [ 21.  26.]]
'''
print('train 분류 정확도 :', (con_mat[0][0] + con_mat[1][1])/ len(train)) # 0.87109375

from sklearn.metrics import accuracy_score
pred = model.predict(test) # sigmoid function에 의해 출력
print('test 분류 정확도 :', accuracy_score(test['RainTomorrow'], np.around(pred))) # 0.87272727

model.pred_table() : 분류 정확도 테이블 생성. logit()에서 지원. gim()은 지원하지않음.

from sklearn.metrics import accuracy_score

accuracy_score(실제값, np.around(예측값)) : 정확도 산출

verginica, setosa + versicolor로 분리해 구분 결정간격 시각화

* logistic3.py

from sklearn import datasets
from sklearn.linear_model import LogisticRegression
import numpy as np

iris = datasets.load_iris()
print(iris)
print(iris.keys())
# dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])
print(iris.target)

x = iris['data'][:, 3:] # petal width로 실습
print(x[:5])
# [0.2 0.2 0.2 0.2 0.2]

y = (iris['target'] == 2).astype(np.int)
print(y[:5])
# [0 0 0 0 0]
print()

log_reg = LogisticRegression().fit(x,y) # 모델생성
print(log_reg)

x_new = np.linspace(0, 3, 1000).reshape(-1,1) # 0 ~ 3 사이 1000개의 난수 발생
print(x_new.shape) # (1000, 1)
y_proba = log_reg.predict_proba(x_new) # 확률값
print(y_proba)
'''
[[9.99250016e-01 7.49984089e-04]
 [9.99240201e-01 7.59799387e-04] ...
 
'''

import matplotlib.pyplot as plt
plt.plot(x_new, y_proba[:, 1], 'r-', label='verginica')
plt.plot(x_new, y_proba[:, 0], 'b--', label='setosa + versicolor')
plt.xlabel('petal width')
plt.legend()
plt.show()

print(log_reg.predict([[1.5],[1.7]]))       # [0 1]
print(log_reg.predict([[2.5],[0.7]]))       # [1 0]
print(log_reg.predict_proba([[2.5],[0.7]])) # [[0.02563061 0.97436939]  [0.98465572 0.01534428]]

LogisticRegression으로 iris의 꽃의 종류를 분류

* logistic4

from sklearn import datasets
from sklearn.linear_model import LogisticRegression
import numpy as np
from sklearn.model_selection._split import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import pandas as pd

iris = datasets.load_iris()
print(iris.data[:3])
'''
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]]
'''
print(np.corrcoef(iris.data[:, 2], iris.data[:, 3]))

x = iris.data[:, [2, 3]] # feature(독립변수, x) : petal length, petal width
y = iris.target # label, class
print(type(x), type(y), x.shape, y.shape) # ndarray, ndarray (150, 2) (150,)
print(set(y)) # {0, 1, 2}

- train / test 분리

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state=0)
print(x_train.shape, x_test.shape, y_train.shape, y_test.shape) # (105, 2) (45, 2) (105,) (45,)

- scaling(표준화 : 단위가 다른 feature가 두개 이상인 경우 표준화를 진행하여 모델의 성능을 향상시킨다)

print(x_train[:3])
'''
[[3.5 1. ]
 [5.5 1.8]
 [5.7 2.5]]
'''
sc = StandardScaler()
sc.fit(x_train)
sc.fit(x_test)
x_train = sc.transform(x_train)
x_test = sc.transform(x_test)
print(x_train[:3])
'''
[[-0.05624622 -0.18650096]
 [ 1.14902997  0.93250481]
 [ 1.26955759  1.91163486]]
'''
# 표준화 값을 원래 값으로 복귀
# inver_x_train = sc.inverse_transform(x_train)
# print(inver_x_train[:3])

- 분류 모델

: logit(), glm() : 이항분류 - 활성화 함수 - sigmoid : 출력 값이 0.5 기준으로 크고 작음에 따라 1, 2로 변경
: LogisticRegression : 다항분류 - 활성화 함수 - softmax : 복수의 확률값 중 가장 큰 값을 채택

model = LogisticRegression(C=1.0, random_state = 0) # C속성 : 모델에 패널티를 적용(L2 정규화) - 과적합 방지
model.fit(x_train, y_train) # 지도학습

- 분류 예측

y_pred = model.predict(x_test) # 검정자료는 test
print('예측값 :', y_pred)
print('실제값 :', y_test)

- 분류 정확도

print('총 개수 : %d, 오류수:%d'%(len(y_test), (y_test != y_pred).sum())) # 총 개수 : 45, 오류수:2
print('분류 정확도 출력 1: %.3f'%accuracy_score(y_test, y_pred))          # 분류 정확도 출력 1: 0.956

con_mat = pd.crosstab(y_test, y_pred, rownames = ['예측치'], colnames=['실제치'])
print(con_mat)
'''
실제치   0   1   2
예측치            
0    16   0   0
1     0  17   1
2     0   1  10
'''

print('분류 정확도 출력 2:', (con_mat[0][0] + con_mat[1][1] + con_mat[2][2]) / len(y_test))
# 분류 정확도 출력 2: 0.9555555555555556

print('분류 정확도 출력 3:', model.score(x_test, y_test))   # test
# 분류 정확도 출력 3: 0.9555555555555556
print('분류 정확도 출력 3:', model.score(x_train, y_train)) # train
# 분류 정확도 출력 3: 0.9523809523809523

- 새로운 값으로 예측

new_data = np.array([[5.1, 2.4], [1.1, 1.4], [8.1, 8.4]])
# 표준화
sc.fit(new_data)
new_data = sc.transform(new_data)
new_pred = model.predict(new_data)
print('새로운 값으로 예측 :', new_pred) #  [1 0 2]

- 붓꽃 자료에 대한 로지스틱 회귀 결과를 차트로 그리기

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from matplotlib import font_manager, rc

plt.rc('font', family='malgun gothic')      
plt.rcParams['axes.unicode_minus']= False

def plot_decision_region(X, y, classifier, test_idx=None, resolution=0.02, title=''):
    markers = ('s', 'x', 'o', '^', 'v')  # 점 표시 모양 5개 정의
    colors = ('r', 'b', 'lightgreen', 'gray', 'cyan')
    cmap = ListedColormap(colors[:len(np.unique(y))])
    #print('cmap : ', cmap.colors[0], cmap.colors[1], cmap.colors[2])

    # decision surface 그리기
    x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    x2_min, x2_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    xx, yy = np.meshgrid(np.arange(x1_min, x1_max, resolution), np.arange(x2_min, x2_max, resolution))

    # xx, yy를 ravel()를 이용해 1차원 배열로 만든 후 전치행렬로 변환하여 퍼셉트론 분류기의 
    # predict()의 인자로 입력하여 계산된 예측값을 Z로 둔다.
    Z = classifier.predict(np.array([xx.ravel(), yy.ravel()]).T)
    Z = Z.reshape(xx.shape)   # Z를 reshape()을 이용해 원래 배열 모양으로 복원한다.

    # X를 xx, yy가 축인 그래프 상에 cmap을 이용해 등고선을 그림
    plt.contourf(xx, yy, Z, alpha=0.5, cmap=cmap)
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())

    X_test = X[test_idx, :]
    for idx, cl in enumerate(np.unique(y)):
        plt.scatter(x=X[y==cl, 0], y=X[y==cl, 1], c=cmap(idx), marker=markers[idx], label=cl)

    if test_idx:
        X_test = X[test_idx, :]
        plt.scatter(X_test[:, 0], X_test[:, 1], c=[], linewidth=1, marker='o', s=80, label='testset')

    plt.xlabel('꽃잎 길이')
    plt.ylabel('꽃잎 너비')
    plt.legend(loc=2)
    plt.title(title)
    plt.show()

x_combined_std = np.vstack((x_train, x_test))
y_combined = np.hstack((y_train, y_test))
plot_decision_region(X=x_combined_std, y=y_combined, classifier=model, test_idx=range(105, 150), title='scikit-learn제공')

- 정규화

- 표준화

ROC curve

: 분류모델 성능 평가

* logistic5.py

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
import numpy as np
import pandas as pd

x, y = make_classification(n_samples=16, n_features=2, n_informative=2, n_redundant=0, random_state=12)
# : dataset
# n_samples : 표준 데이터수, n_features : 독립변수 수
print(x)
'''
[[-1.03701295 -0.8840986 ]
 [-1.181542    1.35572706]
 [-1.57888668 -0.13665031]
 [-2.04426219  0.79930258]
 [-1.42777756  0.2448902 ]
 [ 1.26492389  1.54672358]
 [ 2.53102266  1.99835068]
 [-1.66485782  0.71855249]
 [ 0.96918839 -1.25885923]
 [-3.23328615  1.58405095]
 [ 1.79298809  1.77564192]
 [ 1.34738938  0.66463162]
 [-0.35655805  0.33163742]
 [ 1.39723888  1.23611398]
 [ 0.93616267 -1.36918874]
 [ 0.69830946 -2.46962002]]
'''
print(y)
# [0 1 0 0 1 1 1 0 0 1 1 0 1 1 0 0]

model = LogisticRegression().fit(x, y) # 모델
y_hat = model.predict(x)               # 예측
print(y_hat)
# [0 1 0 1 0 1 1 1 0 1 1 1 0 1 0 0]

f_value = model.decision_function(x)
# 결정/판별/불확실성 추정 합수. ROC curve의 판별 경계선 설정을 위한 sample data 제공
print(f_value)
'''
[ 0.37829565  1.6336573  -1.42938156  1.21967832  2.06504666 -4.11896895
 -1.04677034 -1.21469968  1.62496692 -0.43866584 -0.92693183 -0.76588836
  0.09428499  1.62617134 -2.08158634  2.36316277]
'''

df = pd.DataFrame(np.vstack([f_value, y_hat, y]).T, columns= ['f', 'y_hat', 'y'])
df.sort_values("f", ascending=False).reset_index(drop=True)
print(df)
'''
           f  y_hat    y
0  -1.902803    0.0  0.0
1   1.000982    1.0  1.0
2  -1.008356    0.0  0.0
3   0.143868    1.0  0.0
4  -0.487168    0.0  1.0
5   1.620022    1.0  1.0
6   2.401185    1.0  1.0 ...
'''

# ROC
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y, y_hat, labels=[1, 0]))
# [[6 2]
#  [3 5]]
accuracy = (6 + 5) / (6 + 2 + 3 + 5)
print('accuracy : ', accuracy) # accuracy :  0.6875
recall = 6 / (6 + 3)           # 재현율 TPR
print('recall : ', recall)     # recall :  0.6666666666666666
fallout = 3 / (3 + 5)          # 위 양선율 FPR
print('fallout : ', fallout)   # fallout :  0.375

from sklearn import metrics
acc_sco = metrics.accuracy_score(y, y_hat)
cl_rep = metrics.classification_report(y, y_hat)
print('acc_sco : ', acc_sco)   # acc_sco :  0.6875
print('cl_rep : \n', cl_rep)
'''
               precision    recall  f1-score   support

           0       0.71      0.62      0.67         8
           1       0.67      0.75      0.71         8

    accuracy                           0.69        16
   macro avg       0.69      0.69      0.69        16
weighted avg       0.69      0.69      0.69        16
'''

from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y, model.decision_function(x))
print('fpr :', fpr)             # fpr : [0.    0.    0.    0.375 0.375 1.   ]
print('tpr :', tpr)             # tpr : [0.    0.125 0.75  0.75  1.    1.   ]
print('thresholds', thresholds) # thresholds [ 3.40118546  2.40118546  0.98927765  0.09570707 -0.48716822 -3.71164276]

import matplotlib.pyplot as plt
plt.plot(fpr, tpr, 'o-', label='Logistic Regression')
plt.plot([0, 1], [0, 1], 'k--', label='random guess')
plt.plot([fallout], [recall], 'ro', ms=10)
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.title('ROC')
plt.show()

# AUC (Area Under the Curve) : ROC 커브의 면적
from sklearn.metrics import auc
print('auc :', auc(fpr, tpr)) # auc : 0.90625

'BACK END > Deep Learning' 카테고리의 다른 글

[딥러닝] PCA (0)	2021.03.16
[딥러닝] SVM (0)	2021.03.16
[딥러닝] 다항회귀 (0)	2021.03.12
[딥러닝] 단순선형 회귀, 다중선형 회귀 (0)	2021.03.11
[딥러닝] 선형회귀 (0)	2021.03.10

[딥러닝] 다항회귀

2021. 3. 12. 12:51

다항회귀

선형회귀 모델을 다항회귀로 변환

* linear_reg10.py

import numpy as np
import matplotlib.pyplot as plt

x = np.array([1,2,3,4,5])
y = np.array([4,2,1,3,7])
plt.scatter(x, y) # 산포도
plt.show()

선형회귀 모델

from sklearn.linear_model import LinearRegression
x = x[:, np.newaxis] # 입력을 matrix로 주어야함으로 차원 확대
print(x)
'''
[[1]
 [2]
 [3]
 [4]
 [5]]
'''
model = LinearRegression().fit(x, y) # 선형회귀 모델
y_pred = model.predict(x) # 예측값
print(y_pred)
# [2.  2.7 3.4 4.1 4.8]

plt.scatter(x, y) # 산포도
plt.plot(x, y_pred, c='red') # 추세선 그래프
plt.show()

다항식 특징을 추가

# 비선형인 경우 다항식 특징을 추가해서 작업한다.
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=3, include_bias = False) # degree : 열 개수, include_bias : 편향
print(poly)
x2 = poly.fit_transform(x) # 특징 행렬 생성
print(x2)
'''
[[  1.   1.   1.]
 [  2.   4.   8.]
 [  3.   9.  27.]
 [  4.  16.  64.]
 [  5.  25. 125.]]
  ----제곱------>
'''

model2 = LinearRegression().fit(x2, y) # 선형회귀 모델
y_pred2 = model2.predict(x2) # 예측값
print(y_pred2)
# [4.04285714 1.82857143 1.25714286 2.82857143 7.04285714]

plt.scatter(x, y) # 산포도
plt.plot(x, y_pred2, c='red') # 추세선 그래프
plt.show()

선형회귀 모델을 다항회귀로 변환

: 다항식 추가. 특징행렬 생성.

* linear_reg11.py

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics._regression import mean_squared_error, r2_score

x = np.array([258, 270, 294, 320, 342, 368, 396, 446, 480, 586])[:, np.newaxis]
print(x)
'''
[[258]
 [270]
 [294]
 [320]
 [342]
 [368]
 [396]
 [446]
 [480]
 [586]] 
'''
y = np.array([236, 234, 253, 298, 314, 342, 360, 368, 391, 390])

# 비교목적으로 일반회귀 모델 클래스와 다항식 모델 클래스
lr = LinearRegression()
pr = LinearRegression()
polyf = PolynomialFeatures(degree=2) # 특징 행렬 생성
x_quad = polyf.fit_transform(x)
print(x_quad)
'''
[[1.00000e+00 2.58000e+02 6.65640e+04]
 [1.00000e+00 2.70000e+02 7.29000e+04]
 [1.00000e+00 2.94000e+02 8.64360e+04]
'''
lr.fit(x, y)
x_fit = np.arange(250, 600, 10)[:, np.newaxis]
print(x_fit)
'''
[[250]
 [260]
 [270]
 [280]
 [290]
'''

y_lin_fit = lr.predict(x_fit)
print(y_lin_fit)
# [250.63869122 256.03244588 261.42620055 ...

pr.fit(x_quad, y)
y_quad_fit = pr.predict(polyf.fit_transform(x_fit))
print(y_quad_fit)
# [215.50100168 228.03388862 240.11490613 ...

# 시각화
plt.scatter(x, y, label='train points')
plt.plot(x_fit, y_lin_fit, label='linear fit', linestyle='--', c='red')
plt.plot(x_fit, y_quad_fit, label='quadratic fit', linestyle='-', c='blue')
plt.legend()
plt.show()
print()

# MSE(평균 제곱오차)와 R2(결정계수) 확인
y_lin_pred = lr.predict(x)
print('y_lin_pred :\n', y_lin_pred)
'''
 [254.95369495 261.42620055 274.37121174 288.39497387 300.26123414
 314.28499627 329.38750933 356.35628266 374.69504852 431.86884797]
'''
y_quad_pred = pr.predict(x_quad)
print('y_quad_pred :\n', y_quad_pred)
'''
 [225.56346079 240.11490613 267.26572086 293.74195218 313.75904653
 334.59594733 353.61955374 378.77882554 389.43443486 389.12615204]
'''

print('train MSE 비교 : 선형모델은 %.3f, 다항모델은 %.3f'%(mean_squared_error(y, y_lin_pred), mean_squared_error(y, y_quad_pred)))
# train MSE 비교 : 선형모델은 570.885, 다항모델은 58.294

print('train 결정계수 비교 : 선형모델은 %.3f, 다항모델은 %.3f'%(r2_score(y, y_lin_pred), r2_score(y, y_quad_pred)))
# train 결정계수 비교 : 선형모델은 0.831, 다항모델은 0.983

'BACK END > Deep Learning' 카테고리의 다른 글

[딥러닝] SVM (0)	2021.03.16
[딥러닝] 로지스틱 회귀 (0)	2021.03.15
[딥러닝] 단순선형 회귀, 다중선형 회귀 (0)	2021.03.11
[딥러닝] 선형회귀 (0)	2021.03.10
[딥러닝] 공분산, 상관계수 (0)	2021.03.10

[딥러닝] 단순선형 회귀, 다중선형 회귀

2021. 3. 11. 10:36

단순선형 회귀 Simple Linear Regression

: ols()

: 독립변수 - 연속형, 종속변수 - 연속형.
: 독립변수 1개

* linear_reg4.py

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.rc('font', family='malgun gothic')

df = pd.read_csv('../testdata/drinking_water.csv')
print(df.head(3), '\n', df.describe())
'''
   친밀도  적절성  만족도
0    3    4    3
1    3    3    2
2    4    4    4 
'''

print(df.corr()) # 적절성/만족도 상관계수 : 0.766853

print('----------------------------------------------------------------------')
import statsmodels.formula.api as smf

model = smf.ols(formula='만족도 ~ 적절성', data=df).fit()
print(model.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    만족도   R-squared:                       0.588
Model:                            OLS   Adj. R-squared:                  0.586
Method:                 Least Squares   F-statistic:                     374.0
Date:                Thu, 11 Mar 2021   Prob (F-statistic):           2.24e-52
Time:                        10:07:49   Log-Likelihood:                -207.44
No. Observations:                 264   AIC:                             418.9
Df Residuals:                     262   BIC:                             426.0
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.7789      0.124      6.273      0.000       0.534       1.023
적절성            0.7393      0.038     19.340      0.000       0.664       0.815
==============================================================================
Omnibus:                       11.674   Durbin-Watson:                   2.185
Prob(Omnibus):                  0.003   Jarque-Bera (JB):               16.003
Skew:                          -0.328   Prob(JB):                     0.000335
Kurtosis:                       4.012   Cond. No.                         13.4
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

[해석]

상관계수 ** 2 = 결정계수

print(0.766853 ** 2) # 0.588063523609
R-squared : 결정계수(설명력), 상관계수 R의 제곱 : 0.588
: 1 - (SSE(explain sum of square-추세선과 데이터간 y값) / SST(total sum of square - 평균과 추세선간 y값

차이) )

: 1 - (SSE / SST)

=> over fitting : R2가 1에 아주 가까우면(기존 데이터와 추사) 새로운 데이터에 대해 설명력이 좋지않다.
적절성의 p-value : 0.000 < 0.05 => 모델은 유효하다.
std err(표준 오차) : 0.038
Intercept(y절편) : 0.7789
coef(기울기) : 0.7393
t = 기울기/ 표준오차 : 19.340

print(0.7393 / 0.038) # 19.455263157894738
F-statistic = t**2 : 374.0

print(19.340 ** 2) # 374.0356

독립변수가 많을 경우 R-squared과 Adj. R-squared의 차이가 클 경우 독립변수 이상치를 확인해야한다.
Kurtosis : 4.012 => 3보다 클경우 평균에 데이터가 몰려있다.

print(model.params) # y절편과 기울기 산출
# Intercept    0.778858
#적절성          0.739276

print(model.rsquared) # 0.5880630629464404
print()
print(model.pvalues)
'''
Intercept    1.454388e-09
적절성          2.235345e-52
'''
#print(model.predict()) # 예측값
print(df.만족도[0],' ', model.predict()[0]) # 3   3.7359630488589186

# 새로운 값 예측
print(df.적절성[:5])
'''
3   3.7359630488589186
0    4
1    3
2    4
3    2
4    2
'''

print(df.만족도[:5])
'''
0    3
1    2
2    4
3    2
4    2
'''

print(model.predict()[:5]) # [3.73596305 2.99668687 3.73596305 2.25741069 2.25741069]
print()

new_df = pd.DataFrame({'적절성':[6,5,4,3,22]})
new_pred = model.predict(new_df)
print('new_pred :\n', new_pred)
'''
 0     5.214515
1     4.475239
2     3.735963
3     2.996687
4    17.042934
'''

plt.scatter(df.적절성, df.만족도)
slope, intercept = np.polyfit(df.적절성, df.만족도, 1) # R의 abline 기능
plt.plot(df.적절성, df.적절성 * slope + intercept, 'b') # 추세선
plt.show()

다중 선형회귀 Multiple Linear Regression

: 독립변수가 복수

model2 = smf.ols(formula='만족도 ~ 적절성 + 친밀도', data=df).fit()
print(model2.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    만족도   R-squared:                       0.598
Model:                            OLS   Adj. R-squared:                  0.594
Method:                 Least Squares   F-statistic:                     193.8
Date:                Thu, 11 Mar 2021   Prob (F-statistic):           2.61e-52
Time:                        11:19:33   Log-Likelihood:                -204.37
No. Observations:                 264   AIC:                             414.7
Df Residuals:                     261   BIC:                             425.5
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.6673      0.131      5.096      0.000       0.409       0.925
적절성            0.6852      0.044     15.684      0.000       0.599       0.771
친밀도            0.0959      0.039      2.478      0.014       0.020       0.172
==============================================================================
Omnibus:                       13.103   Durbin-Watson:                   2.174
Prob(Omnibus):                  0.001   Jarque-Bera (JB):               17.256
Skew:                          -0.382   Prob(JB):                     0.000179
Kurtosis:                       3.992   Cond. No.                         18.8
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

단순 선형 회귀

: iris dataset, ols() 사용. 상관관계가 약한/강한 변수로 모델 작성.

* linear_reg5.py

import pandas as pd
import statsmodels.formula.api as smf
import seaborn as sns
iris = sns.load_dataset('iris')
print(iris.head(3))
'''
   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
'''

print(iris.corr())
'''
              sepal_length  sepal_width  petal_length  petal_width
sepal_length      1.000000    -0.117570      0.871754     0.817941
sepal_width      -0.117570     1.000000     -0.428440    -0.366126
petal_length      0.871754    -0.428440      1.000000     0.962865
petal_width       0.817941    -0.366126      0.962865     1.000000
'''

# 단순 선형회귀 모델 : 상관관계 r = -0.117570(sepal_length/sepal_width)
result = smf.ols(formula = 'sepal_length ~ sepal_width', data=iris).fit()
#print(result.summary()) #  R2 : 0.014
print(result.rsquared)# 0.01382 < 0.05      => 의미없는 모델
print(result.pvalues) # 1.518983e-01 > 0.05

result2 = smf.ols(formula = 'sepal_length ~ petal_length', data=iris).fit()
print(result2.summary()) #  R2 : 0.760      => 설명력
print(result2.rsquared)# 0.7599 > 0.05      => 의미있는 모델
print(result2.pvalues) # 1.038667e-47 < 0.05
print()

pred = result2.predict()
print('실제값 :', iris.sepal_length[0]) # 실제값 : 5.1
print('예측값 :', pred[0])              # 예측값 : 4.879094603339241

# 새로운 데이터로 예측
print(iris.petal_length[1:5])
new_data = pd.DataFrame({'petal_length':[1.4, 0.5, 8.5, 12.123]})
print(new_data)
'''
   petal_length
0         1.400
1         0.500
2         8.500
3        12.123
'''
y_pred_new = result2.predict(new_data)
print('새로운 데이터로 sepal_length예측 :\n', y_pred_new)
'''
 0    4.879095
1    4.511065
2    7.782443
3    9.263968
'''

다중 선형 회귀

result3 = smf.ols(formula = 'sepal_length ~ petal_length + petal_width', data=iris).fit()
print(result3.summary()) #  R2 : 0.760      => 설명력

                            OLS Regression Results                            
==============================================================================
Dep. Variable:           sepal_length   R-squared:                       0.766
Model:                            OLS   Adj. R-squared:                  0.763
Method:                 Least Squares   F-statistic:                     241.0
Date:                Thu, 11 Mar 2021   Prob (F-statistic):           4.00e-47
Time:                        12:09:43   Log-Likelihood:                -75.023
No. Observations:                 150   AIC:                             156.0
Df Residuals:                     147   BIC:                             165.1
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept        4.1906      0.097     43.181      0.000       3.999       4.382
petal_length     0.5418      0.069      7.820      0.000       0.405       0.679
petal_width     -0.3196      0.160     -1.992      0.048      -0.637      -0.002
==============================================================================
Omnibus:                        0.383   Durbin-Watson:                   1.826
Prob(Omnibus):                  0.826   Jarque-Bera (JB):                0.540
Skew:                           0.060   Prob(JB):                        0.763
Kurtosis:                       2.732   Cond. No.                         25.3
==============================================================================

print('R-squared :', result3.rsquared)# 0.7662 > 0.05      => 의미있는 모델
print('p-value', result3.pvalues)
# petal_length    9.414477e-13
# petal_width     4.827246e-02
# y = 0.5418 * x1 -0.3196 * x2 + 4.1906

# 새로운 데이터로 예측
new_data2 = pd.DataFrame({'petal_length':[8.5, 12.12], 'petal_width':[8.5, 12.5]})
y_pred_new2 = result3.predict(new_data2)
print('새로운 데이터로 sepal_length예측 :\n', y_pred_new2)
'''
 0    6.079508
1    6.762540
'''

선형 회귀 분석

: mtcars dataset, ols() 사용. 모델작성 후 추정치 얻기

* linear_reg6.py

import statsmodels.api
import statsmodels.formula.api as smf
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

plt.rc('font', family='malgun gothic')

mtcars = statsmodels.api.datasets.get_rdataset('mtcars').data
print(mtcars)
'''
                      mpg  cyl   disp   hp  drat  ...   qsec  vs  am  gear  carb
Mazda RX4            21.0    6  160.0  110  3.90  ...  16.46   0   1     4     4
Mazda RX4 Wag        21.0    6  160.0  110  3.90  ...  17.02   0   1     4     4
'''
print(mtcars.columns) # Index(['mpg', 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec', 'vs', 'am', 'gear', 'carb'], dtype='object')
print(mtcars.describe())
print(np.corrcoef(mtcars.hp, mtcars.mpg)) # 상관계수 : -0.77616837
print(np.corrcoef(mtcars.wt, mtcars.mpg)) # 상관계수 : -0.86765938
print(mtcars.corr())

# 시각화
plt.scatter(mtcars.hp, mtcars.mpg)
plt.xlabel('마력 수')
plt.ylabel('연비')
slope, intercept = np.polyfit(mtcars.hp, mtcars.mpg, 1) # 1차원
plt.plot(mtcars.hp, mtcars.hp * slope + intercept, 'r')
plt.show()

# 단순선형 회귀
result = smf.ols('mpg ~ hp', data=mtcars).fit()
print(result.summary())
print(result.conf_int(alpha=0.05)) # 33.435772
print(result.summary().tables[0])  # coef * x + Intercept
print('마력수  110에 대한 연비 예측 :', -0.0682 * 110 + 30.0989) # 22.5969
print('마력수  50에 대한 연비 예측 :', -0.0682 * 50 + 30.0989)   # 26.6889
# 마력이 증가하면 연비는 줄어든다. 음의 상관관계이므로 결과는 반비례한다. 참고 자료로만 활용해야한다.

# 다중선형 회귀
result2 = smf.ols('mpg ~ hp + wt', data=mtcars).fit()
print(result2.summary())
print(result2.conf_int(alpha=0.05))
print(result2.summary().tables[0])
print('마력수 110 + 무게 5에 대한 연비 예측 :', ((-0.0318 * 110) +(-3.8778 * 5) + 37.2273)) # 14.3403

print('추정치 구하기 차체 무게를 입력해 연비를 추정')
result3 = smf.ols('mpg ~ wt', data=mtcars).fit()
print(result3.summary())
print('결정계수 :', result3.rsquared) # 0.7528327936582646 > 0.05 설명력이 우수한 모델
pred = result3.predict()

# 1개의 자료로 실제값과 예측값(추정값) 저장 후 비교
print(mtcars.mpg[0])
print(pred[0]) # 모든 자동차 차체 무게에 대한 연비 추정치 출력

data = {
    'mpg':mtcars.mpg,
    'mpg_pred':pred
    }
df = pd.DataFrame(data)
print(df)
'''
                      mpg   mpg_pred
Mazda RX4            21.0  23.282611
Mazda RX4 Wag        21.0  21.919770
Datsun 710           22.8  24.885952
'''

# 새로운 차체 무게로 연비 추정하기
mtcars.wt = float(input('차체 무게 입력:'))
new_pred = result3.predict(pd.DataFrame(mtcars.wt))
print('차체 무게 {}일때 예상연비{}이다'.format(mtcars.wt[0], new_pred[0]))
# 차체 무게 1일때 예상연비31.940654594619367이다

# 여러 차제 무게에 대한 연비 추정
new_wt = pd.DataFrame({'wt':[6, 3, 0.5]})
new_pred2 = result3.predict(pd.DataFrame(new_wt))
print('예상연비 : \n', np.round(new_pred2.values, 2)) #  [ 5.22 21.25 34.61]

선형 회귀 분석

: 여러매체의 광고비에 따른 판매량 데이터, ols() 사용. 모델작성 후 추정치 얻기

* linear_reg7

import statsmodels.api
import statsmodels.formula.api as smf
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


adf_df = pd.read_csv('../testdata/Advertising.csv', usecols=[1,2,3,4])
print(adf_df.head(3), ' ', adf_df.shape) # (200, 4)
print(adf_df.index, adf_df.columns)
print(adf_df.info())
'''
      tv  radio  newspaper  sales
0  230.1   37.8       69.2   22.1
1   44.5   39.3       45.1   10.4
2   17.2   45.9       69.3    9.3
'''

print('상관계수 r : \n', adf_df.loc[:, ['sales', 'tv']].corr())
'''
           sales        tv
sales  1.000000  0.782224
tv     0.782224  1.000000
'''
# r : 0.782224 > 0.05 => 강한 양의 상관관계이고, 인과관계임을 알 수 있다.
print()

lm = smf.ols(formula='sales ~ tv', data=adf_df).fit()
print(lm.summary()) # R-squared : 0.612, p : 1.47e-42
print(lm.params)
print(lm.pvalues)
print(lm.rsquared)

# 시각화
plt.scatter(adf_df.tv, adf_df.sales)
plt.xlabel('tv')
plt.ylabel('sales')
x = pd.DataFrame({'tv':[adf_df.tv.min(), adf_df.tv.max()]})
y_pred = lm.predict(x)
plt.plot(x, y_pred, c='red')
plt.title('Linear Regression')
sns.regplot(adf_df.tv, adf_df.sales, scatter_kws = {'color':'r'})
plt.xlim(-50, 350)
plt.ylim(ymin=0)
plt.show()

# 예측 : 새로운 tv값으로 sales를 추정
x_new = pd.DataFrame({'tv':[230.1, 44.5, 100]})
pred = lm.predict(x_new)
print('추정값 :\n', pred)
'''
0    17.970775
1     9.147974
2    11.786258
'''

print('\n다중 선형회귀 모델 ')
lm_mul = smf.ols(formula = 'sales ~ tv + radio + newspaper', data = adf_df).fit()
#  + newspaper 포함시와 미포함시의 R2값 변화가 없어 제거 필요.
print(lm_mul.summary())
print(adf_df.corr())

# 예측2 : 새로운 tv, radio값으로 sales를 추정
x_new2 = pd.DataFrame({'tv':[230.1, 44.5, 100], 'radio':[30.1, 40.1, 50.1],\
                      'newspaper':[10.1, 10.1, 10.1]})
pred2 = lm.predict(x_new2)
print('추정값 :\n', pred2)
'''
0    17.970775
1     9.147974
2    11.786258
'''

회귀분석모형의 적절성을 위한 조건

: 아래의 조건 위배 시에는 변수 제거나 조정을 신중히 고려해야 함.

- 정규성 : 독립변수들의 잔차항이 정규분포를 따라야 한다.
- 독립성 : 독립변수들 간의 값이 서로 관련성이 없어야 한다.
- 선형성 : 독립변수의 변화에 따라 종속변수도 변화하나 일정한 패턴을 가지면 좋지 않다.
- 등분산성 : 독립변수들의 오차(잔차)의 분산은 일정해야 한다. 특정한 패턴 없이 고르게 분포되어야 한다.
- 다중공선성 : 독립변수들 간에 강한 상관관계로 인한 문제가 발생하지 않아야 한다.

# 잔차항
fitted = lm_mul.predict(adf_df)     # 예측값
print(fitted)
'''
0      20.523974
1      12.337855
2      12.307671
'''
residual = adf_df['sales'] - fitted # 잔차

import seaborn as sns
print('선형성 - 예측값과 잔차가 비슷하게 유지')
sns.regplot(fitted, residual, lowess = True, line_kws = {'color':'red'})
plt.plot([fitted.min(), fitted.max()], [0, 0], '--', color='grey')
plt.show() # 선형성을 만족하지 못한다.

print('정규성- 잔차가 정규분포를 따르는 지 확인')
import scipy.stats as stats
sr = stats.zscore(residual)
(x, y), _ = stats.probplot(sr)
sns.scatterplot(x, y)
plt.plot([-3, 3], [-3, 3], '--', color="grey")
plt.show() # 선형성을 만족하지 못한다. 
print('residual test :', stats.shapiro(residual))
# residual test : ShapiroResult(statistic=0.9176644086837769, pvalue=3.938041004403203e-09)
# pvalue=3.938041004403203e-09 < 0.05 => 정규성을 만족하지못함.

print('독립성 - 잔차가 자기상관(인접 관측치의 오차가 상관되어 있음)이 있는지 확인')
# 모델.summary() Durbin-Watson:2.084 => 잔차항이 독립성을 만족하는 지 확인. 2에 가까우면 자기상관이 없다.(서로 독립- 잔차끼리 상관관계가 없다)
# 0에 가까우면 양의 상관, 4에 가까우면 음의 상관.

print('등분산성 - 잔차의 분산이 일정한지 확인')
sns.regplot(fitted, np.sqrt(np.abs(sr)), lowess = True, line_kws = {'color':'red'})
plt.show()
# 추세선이 수평선을 그리지않으므로 등분산성을 만족하지 못한다.

print('다중공선성 - 독립변수들 간에 강한 상관관계 확인')
# VIF(Variance Inflation Factors - 분산 팽창 요인) 값이 10을 넘으면 다중공선성이 발생하는 변수라고 할 수 있다.
from statsmodels.stats.outliers_influence import variance_inflation_factor
print(variance_inflation_factor(adf_df.values, 0)) # 23.198876299003153
print(variance_inflation_factor(adf_df.values, 1)) # 12.570312383503682
print(variance_inflation_factor(adf_df.values, 2)) # 3.1534983754953845
print(variance_inflation_factor(adf_df.values, 3)) # 55.3039198336228

# DataFrame으로 보기
vif_df = pd.DataFrame()
vif_df['vid_value'] = [variance_inflation_factor(adf_df.values, i) for i in range(adf_df.shape[1])]
print(vif_df)
'''
   vid_value
0  23.198876
1  12.570312
2   3.153498
3  55.303920
'''

print('참고 : cooks distance - 극단값을 나타내는 지료 확인')
from statsmodels.stats.outliers_influence import OLSInfluence
cd, _ = OLSInfluence(lm_mul).cooks_distance
print(cd.sort_values(ascending=False).head())
'''
130    0.272956
5      0.128306
75     0.056313
35     0.051275
178    0.045921
'''

import statsmodels.api as sm
sm.graphics.influence_plot(lm_mul, criterion='cooks')
plt.show()

print(adf_df.iloc[[130, 5, 75, 35, 178]]) # 극단 값으로 작업에서 제외 권장.
'''
        tv  radio  newspaper  sales
130    0.7   39.6        8.7    1.6
5      8.7   48.9       75.0    7.2
75    16.9   43.7       89.4    8.7
35   290.7    4.1        8.5   12.8
178  276.7    2.3       23.7   11.8
'''

* linear_reg8.py

from sklearn.linear_model import LinearRegression
import statsmodels.api

mtcars = statsmodels.api.datasets.get_rdataset('mtcars').data
print(mtcars[:3])
'''
                mpg  cyl   disp   hp  drat     wt   qsec  vs  am  gear  carb
Mazda RX4      21.0    6  160.0  110  3.90  2.620  16.46   0   1     4     4
Mazda RX4 Wag  21.0    6  160.0  110  3.90  2.875  17.02   0   1     4     4
Datsun 710     22.8    4  108.0   93  3.85  2.320  18.61   1   1     4     1
'''

# hp(마력수)가 mpg(연비)에 영향을 미치지는 지, 인과관계가 있다면 연비에 미치는 영향값(추정치, 예측치)을 예측 (정량적 분석)
x = mtcars[['hp']].values
y = mtcars[['mpg']].values
print(x[:3])
'''
[[110]
 [110]
 [ 93]]
'''
print(y[:3])
'''
[[21. ]
 [21. ]
 [22.8]]
'''

import matplotlib.pyplot as plt
plt.scatter(x, y) # 산포도 출력
plt.show()

fit_model = LinearRegression().fit(x, y)   # 모델 생성
print('slope :', fit_model.coef_[0])       # 기울기 : [-0.06822828]
print('intercept :', fit_model.intercept_) # y절편 : [30.09886054]
# newY = fit_model.coef_[0] * newX + fit_model.intercept_

pred = fit_model.predict(x)
print(pred[:3])
print('예측값 :', pred[:3].flatten()) # 예측값 : [22.59374995 22.59374995 23.75363068]
print('실제값 :', y[:3].flatten())    # 실제값 : [21.  21.  22.8]
print()

# 모델 성능 파악 시 R2 또는 RMSE
from sklearn.metrics import mean_squared_error
import numpy as np

lin_mse = mean_squared_error(y, pred)   # 평균 제곱 오차
lin_rmse = np.sqrt(lin_mse)             # 루트
print("평균 제곱 오차 : ", lin_mse)          # 평균 제곱 오차 :  13.989822298268805
print("평균 제곱근 편차(RMSE) : ", lin_rmse) # 평균 제곱근 편차(RMSE) :  3.7402970868994894
print()

# 마력에 따른 연비 추정치
new_hp = [[100]]
new_pred = fit_model.predict(new_hp)
print('%s 마력인 경우 연비 추정치는 %s'%(new_hp[0][0], new_pred[0][0]))
# 100 마력인 경우 연비 추정치는 23.27603273246613

선형회귀 분석 : Linear Regression
과적합 방지를 위해 Ridgo, Lasso, ElasticNet

* linear_reg9.py

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris

iris = load_iris()
print(iris)
'''
[[5.1, 3.5, 1.4, 0.2],
[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5. , 3.6, 1.4, 0.2],
[5.4, 3.9, 1.7, 0.4],
[4.6, 3.4, 1.4, 0.3],
'''
print(iris.feature_names) # ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
print(iris.target)       
print(iris.target_names) # ['setosa' 'versicolor' 'virginica']
print()

iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
iris_df['target'] = iris.target
iris_df['target_names'] = iris.target_names[iris.target]
print(iris_df.head(3), ' ', iris_df.shape) # (150, 6)
'''
   sepal length (cm)  sepal width (cm)  ...  target  target_names
0                5.1               3.5  ...       0        setosa
1                4.9               3.0  ...       0        setosa
2                4.7               3.2  ...       0        setosa
'''

출처 : https://www.educative.io/edpresso/overfitting-and-underfitting

# train / test 분리 : 과적합 방지 방법 중 1
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(iris_df, test_size = 0.3) # data를 train 0.7, test 0.3 배율로 나눔
print(train_set.head(2), ' ', train_set.shape) # (105, 6)
print(test_set.head(2), ' ', test_set.shape) # (45, 6)

# 선형회귀
# 정규화 선형회귀 방법은 선형회귀계수(weight)에 대한 제약조건을 추가함으로 해서, 모형이 과도라게 최적화(오버피팅)되는 현상을 방지할 수 있다.
from sklearn.linear_model import LinearRegression as lm
import matplotlib.pyplot as plt

print(train_set.iloc[:, [2]]) # petal.length
print(train_set.iloc[:, [3]]) # petal.width

model_ols = lm().fit(X=train_set.iloc[:, [2]], y=train_set.iloc[:, [3]])
print(model_ols.coef_[0])     # [0.41268804]
print(model_ols.intercept_)   # [-0.35472987]
pred = model_ols.predict(model_ols.predict(test_set.iloc[:, [2]]))
print('ols_pred :\n', pred[:5])
'''
 [[ 0.31044183]
 [ 0.49464775]
 [ 0.3606798 ]
 [ 0.09274392]
 [-0.25892194]]
'''

print('ols_real :\n', test_set.iloc[:, [3]][:5])
'''
      petal width (cm)
138               1.8
143               2.3
142               1.9
79                1.0
45                0.3
'''

# 회귀분석 방법 - Ridge: alpha값을 조정(가중치 제곱합을 최소화)하여 과대/과소적합을 피한다. 다중공선성 문제 처리에 효과적.
from sklearn.linear_model import Ridge
model_ridge = Ridge(alpha=10).fit(X=train_set.iloc[:, [2]], y=train_set.iloc[:, [3]])

#점수
print(model_ridge.score(X=train_set.iloc[:, [2]], y=train_set.iloc[:, [3]])) #0.91923658601
print(model_ridge.score(X=test_set.iloc[:, [2]], y=test_set.iloc[:, [3]]))   #0.935219182367
print('ridge predict : ', model_ridge.predict(test_set.iloc[:, [2]]))
plt.scatter(train_set.iloc[:, [2]], train_set.iloc[:, [3]],  color='red')
plt.plot(test_set.iloc[:, [2]], model_ridge.predict(test_set.iloc[:, [2]]))
plt.show()

print('\nLasso')
# 회귀분석 방법 - Lasso: alpha값을 조정(가중치 절대값의 합을 최소화)하여 과대/과소적합을 피한다.
from sklearn.linear_model import Lasso
model_lasso = Lasso(alpha=0.1, max_iter=1000).fit(X=train_set.iloc[:, [0,1,2]], y=train_set.iloc[:, [3]])

#점수
print(model_lasso.score(X=train_set.iloc[:, [0,1,2]], y=train_set.iloc[:, [3]])) #0.921241848687
print(model_lasso.score(X=test_set.iloc[:, [0,1,2]], y=test_set.iloc[:, [3]]))   #0.913186971647
print('사용한 특성수 : ', np.sum(model_lasso.coef_ != 0))   # 사용한 특성수 :  1
plt.scatter(train_set.iloc[:, [2]], train_set.iloc[:, [3]],  color='red')
plt.plot(test_set.iloc[:, [2]], model_ridge.predict(test_set.iloc[:, [2]]))
plt.show()

# 회귀분석 방법 4 - Elastic Net 회귀모형 : Ridge + Lasso
# 가중치 제곱합을 최소화, 거중치 절대값의 합을 최소화, 두가지를 동시에 제약조건으로 사용
from sklearn.linear_model import ElasticNet

'BACK END > Deep Learning' 카테고리의 다른 글

[딥러닝] 로지스틱 회귀 (0)	2021.03.15
[딥러닝] 다항회귀 (0)	2021.03.12
[딥러닝] 선형회귀 (0)	2021.03.10
[딥러닝] 공분산, 상관계수 (0)	2021.03.10
[딥러닝] 이항검정 (0)	2021.03.10