[코딩] Circle Square

전체 글

[딥러닝] GAN 2021.04.12
[딥러닝] RNN, NLP 2021.04.05
[딥러닝] Tensorflow - 이미지 분류 2021.04.01
[LINUX] 리눅스 - 명령어 /eclipse /FlashPlayer /DB /apache /R /anaconda /hadoop 2021.03.26
[딥러닝] Keras - Logistic 2021.03.25
[딥러닝] Keras - Linear 2021.03.23
[딥러닝] TensorFlow 2021.03.22
[딥러닝] TensorFlow 환경설정 2021.03.22
[딥러닝] DBScan 2021.03.22
[딥러닝] k-means 2021.03.22
[딥러닝] 클러스터링 2021.03.19
[딥러닝] Neural Network 2021.03.19
[딥러닝] KNN 2021.03.18
[딥러닝] RandomForest 2021.03.17
[딥러닝] Decision Tree 2021.03.17
[딥러닝] 나이브 베이즈 2021.03.17
[딥러닝] PCA 2021.03.16
[딥러닝] SVM 2021.03.16
[딥러닝] 로지스틱 회귀 2021.03.15
[딥러닝] 다항회귀 2021.03.12
[딥러닝] 단순선형 회귀, 다중선형 회귀 2021.03.11
[딥러닝] 선형회귀 2021.03.10
[딥러닝] 공분산, 상관계수 2021.03.10
[딥러닝] 이항검정 2021.03.10
[딥러닝] ANOVA 2021.03.08
[딥러닝] T 검정 2021.03.05
[딥러닝] 카이제곱 2021.03.04
[Pandas] pandas 정리2 - db, django 2021.03.02
[MatPlotLib] matplotlib 정리 2021.03.02
[Pandas] pandas 정리 2021.02.24

PREV 1 2 3 4 NEXT

[딥러닝] GAN

2021. 4. 12. 16:15

Autoencoder

- Auto Encoder

excelsior-cjh.tistory.com/187

08. 오토인코더 (AutoEncoder)

이번 포스팅은 핸즈온 머신러닝 교재를 가지고 공부한 것을 정리한 포스팅입니다. 08. 오토인코더 - Autoencoder 저번 포스팅 07. 순환 신경망, RNN에서는 자연어, 음성신호, 주식과 같은 연속적인 데

excelsior-cjh.tistory.com

- Autoencoder
: 실제 이미지를 이용하여 가상의 이미지를 생성
: 데이터 수가 충분하지않은 이미지를 얻고자 할 경우 사용

* tf_16_autoencoder.ipynb

from tensorflow.keras.datasets import mnist, fashion_mnist
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Input, Dense, MaxPool2D, Conv2D, UpSampling2D, Flatten, Reshape
import matplotlib.pyplot as plt
import numpy as np

(x_train, _), (x_test, _) = fashion_mnist.load_data()   # 비지도 학습이므로 feature만 사용
#(x_train, _), (x_test, _) = mnist.load_data()   # 비지도 학습이므로 feature만 사용
print(x_train[:1])
print(x_train.shape[0])
x_train = x_train.reshape(x_train.shape[0], 28, 28, 1).astype('float32') / 255
x_test = x_test.reshape(x_test.shape[0], 28, 28, 1).astype('float32') / 255
#print(x_train[:1])

[[[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
     0   0   0   0   0   0   0   0   0   0   0]
  [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
     0   0   0   0   0   0   0   0   0   0   0]
  [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
     0   0   0   0   0   0   0   0   0   0   0]
  [  0   0   0   0   0   0   0   0   0   0   0   0   1   0   0  13  73
     0   0   1   4   0   0   0   0   1   1   0]
  [  0   0   0   0   0   0   0   0   0   0   0   0   3   0  36 136 127
    62  54   0   0   0   1   3   4   0   0   3]
  [  0   0   0   0   0   0   0   0   0   0   0   0   6   0 102 204 176
   134 144 123  23   0   0   0   0  12  10   0]
  [  0   0   0   0   0   0   0   0   0   0   0   0   0   0 155 236 207
   178 107 156 161 109  64  23  77 130  72  15]
  [  0   0   0   0   0   0   0   0   0   0   0   1   0  69 207 223 218
   216 216 163 127 121 122 146 141  88 172  66]
  [  0   0   0   0   0   0   0   0   0   1   1   1   0 200 232 232 233
   229 223 223 215 213 164 127 123 196 229   0]
  [  0   0   0   0   0   0   0   0   0   0   0   0   0 183 225 216 223
   228 235 227 224 222 224 221 223 245 173   0]
  [  0   0   0   0   0   0   0   0   0   0   0   0   0 193 228 218 213
   198 180 212 210 211 213 223 220 243 202   0]
  [  0   0   0   0   0   0   0   0   0   1   3   0  12 219 220 212 218
   192 169 227 208 218 224 212 226 197 209  52]
  [  0   0   0   0   0   0   0   0   0   0   6   0  99 244 222 220 218
   203 198 221 215 213 222 220 245 119 167  56]
  [  0   0   0   0   0   0   0   0   0   4   0   0  55 236 228 230 228
   240 232 213 218 223 234 217 217 209  92   0]
  [  0   0   1   4   6   7   2   0   0   0   0   0 237 226 217 223 222
   219 222 221 216 223 229 215 218 255  77   0]
  [  0   3   0   0   0   0   0   0   0  62 145 204 228 207 213 221 218
   208 211 218 224 223 219 215 224 244 159   0]
  [  0   0   0   0  18  44  82 107 189 228 220 222 217 226 200 205 211
   230 224 234 176 188 250 248 233 238 215   0]
  [  0  57 187 208 224 221 224 208 204 214 208 209 200 159 245 193 206
   223 255 255 221 234 221 211 220 232 246   0]
  [  3 202 228 224 221 211 211 214 205 205 205 220 240  80 150 255 229
   221 188 154 191 210 204 209 222 228 225   0]
  [ 98 233 198 210 222 229 229 234 249 220 194 215 217 241  65  73 106
   117 168 219 221 215 217 223 223 224 229  29]
  [ 75 204 212 204 193 205 211 225 216 185 197 206 198 213 240 195 227
   245 239 223 218 212 209 222 220 221 230  67]
  [ 48 203 183 194 213 197 185 190 194 192 202 214 219 221 220 236 225
   216 199 206 186 181 177 172 181 205 206 115]
  [  0 122 219 193 179 171 183 196 204 210 213 207 211 210 200 196 194
   191 195 191 198 192 176 156 167 177 210  92]
  [  0   0  74 189 212 191 175 172 175 181 185 188 189 188 193 198 204
   209 210 210 211 188 188 194 192 216 170   0]
  [  2   0   0   0  66 200 222 237 239 242 246 243 244 221 220 193 191
   179 182 182 181 176 166 168  99  58   0   0]
  [  0   0   0   0   0   0   0  40  61  44  72  41  35   0   0   0   0
     0   0   0   0   0   0   0   0   0   0   0]
  [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
     0   0   0   0   0   0   0   0   0   0   0]
  [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
     0   0   0   0   0   0   0   0   0   0   0]]]
60000

autoencoder = Sequential()

# 인코더 : 차원 축소
autoencoder.add(Conv2D(32, kernel_size=3, padding='same', input_shape=(28, 28, 1), activation='relu'))
autoencoder.add(MaxPool2D(pool_size=2, padding='same'))
autoencoder.add(Conv2D(16, kernel_size=3, padding='same', activation='relu'))
autoencoder.add(MaxPool2D(pool_size=2, padding='same'))
autoencoder.add(Conv2D(8, kernel_size=3, padding='same', activation='relu'))

# 디코더 : 차원 확장
autoencoder.add(Conv2D(8, kernel_size=3, padding='same', activation='relu'))
autoencoder.add(UpSampling2D())
autoencoder.add(Conv2D(16, kernel_size=3, padding='same', activation='relu'))
autoencoder.add(UpSampling2D())
autoencoder.add(Conv2D(32, kernel_size=3, padding='same', activation='relu'))

autoencoder.add(Conv2D(1, kernel_size=3, padding='same', activation='sigmoid'))

print(autoencoder.summary())
# Layer (type)                 Output Shape              Param #   
# =================================================================
# conv2d_11 (Conv2D)           (None, 28, 28, 32)        320       
# _________________________________________________________________
# max_pooling2d_4 (MaxPooling2 (None, 14, 14, 32)        0         
# _________________________________________________________________
# conv2d_12 (Conv2D)           (None, 14, 14, 16)        4624      
# _________________________________________________________________
# max_pooling2d_5 (MaxPooling2 (None, 7, 7, 16)          0         
# _________________________________________________________________
# conv2d_13 (Conv2D)           (None, 7, 7, 8)           1160      
# _________________________________________________________________
# conv2d_14 (Conv2D)           (None, 7, 7, 8)           584       
# _________________________________________________________________
# up_sampling2d_2 (UpSampling2 (None, 14, 14, 8)         0         
# _________________________________________________________________
# conv2d_15 (Conv2D)           (None, 14, 14, 16)        1168      
# _________________________________________________________________
# up_sampling2d_3 (UpSampling2 (None, 28, 28, 16)        0         
# _________________________________________________________________
# conv2d_16 (Conv2D)           (None, 28, 28, 32)        4640      
# _________________________________________________________________
# conv2d_17 (Conv2D)           (None, 28, 28, 1)         289       
# =================================================================
# Total params: 12,785

autoencoder.compile(optimizer='adam', loss='binary_crossentropy')

autoencoder.fit(x_train, x_train, epochs=30, batch_size=128, validation_data=(x_test, x_test), verbose=2)

random_test = np.random.randint(x_test.shape[0], size=5)
ae_imgs = autoencoder.predict(x_test)

plt.figure(figsize=(7, 2))

for i, image_idx in enumerate(random_test):
    ax = plt.subplot(2, 7, i+1)
    plt.imshow(x_test[image_idx].reshape(28, 28))
    ax = plt.subplot(2, 7, 7 + i + 1)
    plt.imshow(ae_imgs[image_idx].reshape(28, 28))
    ax.axis('off')
plt.show()

GAN

- GAN에 대해 알고 싶다면..

cafe.daum.net/flowlife/S2Ul/13

- GAN

dreamgonfly.github.io/blog/gan-explained/

- GAN : DcGAN(CNN을 GAN에 적용한 알고리즘)

: MNIST dataset으로 새로운 숫자를 생성

* tf_17_GAN.ipynb

from tensorflow.keras.datasets import mnist
from tensorflow.keras.layers import Input, Dense, Reshape, Flatten, Dropout, BatchNormalization, Activation, LeakyReLU, UpSampling2D, Conv2D
from tensorflow.keras.models import Sequential, Model
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import os

if  not os.path.exists("./gen_imgs"):
    os.makedirs("./gen_imgs")

np.random.seed(3)
tf.random.set_seed(3)

generator = Sequential()
generator.add(Dense(128 * 7 * 7, input_dim = 100, activation=LeakyReLU(alpha=0.2)))
generator.add(BatchNormalization())
generator.add(Reshape((7, 7, 128)))
generator.add(UpSampling2D())
generator.add(Conv2D(64, kernel_size=5, padding='same'))

generator.add(BatchNormalization())
generator.add(Activation(LeakyReLU(alpha=0.2)))
generator.add(UpSampling2D())
generator.add(Conv2D(1, kernel_size=5, padding='same', activation='tanh'))

print(generator.summary())
# Layer (type)                 Output Shape              Param #   
# =================================================================
# dense_3 (Dense)              (None, 6272)              633472    
# _________________________________________________________________
# batch_normalization_4 (Batch (None, 6272)              25088     
# _________________________________________________________________
# reshape_3 (Reshape)          (None, 7, 7, 128)         0         
# _________________________________________________________________
# up_sampling2d_2 (UpSampling2 (None, 14, 14, 128)       0         
# _________________________________________________________________
# conv2d_2 (Conv2D)            (None, 14, 14, 64)        204864    
# _________________________________________________________________
# batch_normalization_5 (Batch (None, 14, 14, 64)        256       
# _________________________________________________________________
# activation_1 (Activation)    (None, 14, 14, 64)        0         
# _________________________________________________________________
# up_sampling2d_3 (UpSampling2 (None, 28, 28, 64)        0         
# _________________________________________________________________
# conv2d_3 (Conv2D)            (None, 28, 28, 1)         1601      
# =================================================================
# Total params: 865,281

discriminator = Sequential()
discriminator.add(Conv2D(64, kernel_size=5, strides=2, input_shape=(28, 28, 1), padding='same'))
discriminator.add(Activation(LeakyReLU(alpha=0.2)))

discriminator.add(Conv2D(128, kernel_size=5, strides=2, padding='same'))
discriminator.add(Activation(LeakyReLU(alpha=0.2)))

discriminator.add(Flatten())

discriminator.add(Dense(1, activation='sigmoid'))
print(discriminator.summary())
# Layer (type)                 Output Shape              Param #   
# =================================================================
# conv2d_10 (Conv2D)           (None, 14, 14, 64)        1664      
# _________________________________________________________________
# activation_7 (Activation)    (None, 14, 14, 64)        0         
# _________________________________________________________________
# conv2d_11 (Conv2D)           (None, 7, 7, 128)         204928    
# _________________________________________________________________
# activation_8 (Activation)    (None, 7, 7, 128)         0         
# _________________________________________________________________
# flatten (Flatten)            (None, 6272)              0         
# _________________________________________________________________
# dense_5 (Dense)              (None, 1)                 6273      
# =================================================================
# Total params: 212,865

discriminator.compile(loss='binary_crossentropy', optimizer='adam')
discriminator.trainable = False # 학습기능 해제

# GAN 모델
ginput = Input(shape=(100, ))
dis_output = discriminator(generator(ginput))
gan = Model(ginput, dis_output)
gan.compile(loss='binary_crossentropy', optimizer='adam')
print(gan.summary())
# Layer (type)                 Output Shape              Param #   
# =================================================================
# input_2 (InputLayer)         [(None, 100)]             0         
# _________________________________________________________________
# sequential_4 (Sequential)    (None, 28, 28, 1)         865281    
# _________________________________________________________________
# sequential_9 (Sequential)    (None, 1)                 212865    
# =================================================================
# Total params: 1,078,146

# 신경망 실행 함수
def gan_train(epoch, batch_size, save_interval):
    (x_train, _), (_, _) = mnist.load_data()

    x_train = x_train.reshape(x_train.shape[0], 28, 28, 1).astype('float32')
    x_train = (x_train - 127.5) / 127.5

    #print(x_train[0]) # -1 ~ 1 사이의 값으로 변경. generator에서 활성화 함수로 tanh를 사용했으므로

    true = np.ones((batch_size, 1))
    fake = np.zeros((batch_size, 1))
    for i in range(epoch):
        # 실제 데이터를 판별자에 입력
        idx = np.random.randint(0, x_train.shape[0], batch_size)
        imgs = x_train[idx]
        d_loss_real = discriminator.train_on_batch(imgs, true)

        # 가상 데이터를 판병자에 입력
        noise = np.random.normal(0, 1, (batch_size, 100))
        gen_images = generator.predict(noise)
        d_loss_fake = discriminator.train_on_batch(gen_images, fake)

        # 판별자와 생성자의 오차 계산
        d_loss = 0.5 * np.add(d_loss_real, d_loss_fake)
        g_loss = gan.train_on_batch(noise, true)
        print('epoch :%d'%i, ', d_loss : %.3f'%d_loss, ', g_loss : %.3f'%g_loss)
        if i % save_interval == 0:
            noise = np.random.normal(0, 1, (25, 100))
            gen_images = generator.predict(noise)
            gen_images = 0.5 * gen_images + 0.5
            
            fig, axs = plt.subplots(5, 5)
            count = 0
            for j in range(5):
                for k in range(5):
                    axs[j, k].imshow(gen_images[count, :, :, 0], cmap='gray')
                    axs[j, k].axis('off')
                    count += 1
            fig.savefig('./gen_imgs/gan_mnist_%d.png'%i)

'BACK END > Deep Learning' 카테고리의 다른 글

[딥러닝] RNN, NLP (0)	2021.04.05
[딥러닝] Tensorflow - 이미지 분류 (0)	2021.04.01
[딥러닝] Keras - Logistic (0)	2021.03.25
[딥러닝] Keras - Linear (0)	2021.03.23
[딥러닝] TensorFlow (0)	2021.03.22

[딥러닝] RNN, NLP

2021. 4. 5. 10:29

순환신경망 (Recurrent Neueal Network, RNN)

: 시퀀스 단위의 입력을 시퀀스 단위의 출력으로 처리하는 모델

: 시계열 데이터 처리 - 자연어, 번역, 이미지 캡션, 채팅, 주식 ...

: LSTM, GRU, ..

- RNN

wikidocs.net/22886

위키독스

온라인 책을 제작 공유하는 플랫폼 서비스

wikidocs.net

* tf_rnn.ipynb

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import SimpleRNN, LSTM

model = Sequential()

model.add(SimpleRNN(3, input_shape = (2, 10)))
#model.add(SimpleRNN(3, input_length = 2, input_dim = 10))

print(model.summary())                # Total params: 42

from tensorflow.keras.layers. import SimpleRNN

SimpleRNN(a, batch_input_shape = (b, c, d)) : 출력 수(a), batch_size(b), sequence(c), 입력수(d)

model = Sequential()

model.add(LSTM(3, input_shape = (2, 10)))

print(model.summary())                # Total params: 168

from tensorflow.keras.layers. import LSTM

LSTM(a, batch_input_shape = (b, c, d)) : 출력 수(a), batch_size(b), sequence(c), 입력수(d)

model = Sequential()

model.add(SimpleRNN(3, batch_input_shape = (8, 2, 10)))
# batch_size : 8, sequence : 2, 입력수 : 10, 출력 수 : 3

print(model.summary())                # Total params: 42

model = Sequential()

model.add(LSTM(3, batch_input_shape = (8, 2, 10)))

print(model.summary())                # Total params: 168

model = Sequential()

model.add(SimpleRNN(3, batch_input_shape = (8, 2, 10), return_sequences=True))

print(model.summary())                # Total params: 42

model = Sequential()

model.add(LSTM(3, batch_input_shape = (8, 2, 10), return_sequences=True))

print(model.summary())               # Total params: 168

NLP(자연어 처리)

자연어 : 순차적

문장 -> 단어/문자열/형태소/자소(자음/모음) -> code화(숫자) -> one -hot encoding or word2vec(단어간 관계 예측) -> embeding처리

4개의 숫자를 통해 그 다음 숫자를 예측하는 RNN 모델 생성

* tf_rnn2.ipynb

import tensorflow as tf
import numpy as np

x = []
y = []
for i in range(6):              # 0 ~ 5
    lst = list(range(i, i + 4)) # 0 ~ 3, 1 ~ 4, 2 ~ 5 ...
    print(lst)
    x.append(list(map(lambda c:[c / 10], lst)))
    y.append((i + 4) /10)

print(x)
# [[[0.0], [0.1], [0.2], [0.3]], [[0.1], [0.2], [0.3], [0.4]], [[0.2], [0.3], [0.4], [0.5]], [[0.3], [0.4], [0.5], [0.6]], [[0.4], [0.5], [0.6], [0.7]], [[0.5], [0.6], [0.7], [0.8]]]
print(y)
# [0.4, 0.5, 0.6, 0.7, 0.8, 0.9]

x = np.array(x)
y = np.array(y)

for i in range(len(x)): # 0 ~ 5
    print(x[i], y[i])
# [[0. ]
#  [0.1]
#  [0.2]
#  [0.3]] 0.4
# [[0.1]
#  [0.2]
#  [0.3]
#  [0.4]] 0.5
# ...

model = tf.keras.Sequential([
    #tf.keras.layers.SimpleRNN(units=10, activation='tanh', input_shape=[4, 1]),  # Total params: 131
    #tf.keras.layers.LSTM(units=10, activation='tanh', input_shape=[4, 1]),       # Total params: 491
    tf.keras.layers.GRU(units=10, activation='tanh', input_shape=[4, 1]),         # Total params: 401

    tf.keras.layers.Dense(1)
])

model.compile(optimizer='adam', loss='mse')
print(model.summary())

model.fit(x, y, epochs=100, verbose=0)
print('예측값 :', model.predict(x))
# 예측값 : [[0.380357  ]
#  [0.50256383]
#  [0.6151931 ]
#  [0.7163592 ]
#  [0.8050702 ]
#  [0.88109314]]
print('실제값 :', y)
# 실제값 : [0.4 0.5 0.6 0.7 0.8 0.9]
print()

print(model.predict(np.array([[[0.6], [0.7], [0.9]]])))
# [[0.76507646]]
print(model.predict(np.array([[[-0.1], [0.2], [0.4], [0.9]]])))
# [[0.6791251]]

Dimension for RNN

: RNN모형을 구현할 때 핵심이 되는 데이터 구조

: many-to-many, many-to-one, one-to-many

* tf_rnn3.ipynb

import numpy as np
import tensorflow as tf
from tensorflow import keras

# many-to-one
x = np.array([[[1], [2], [3]], [[2], [3], [4]], [[3], [4], [5]]], dtype=np.float32)
y = np.array([[4], [5], [6]])
print(x.shape, y.shape) # (3, 3, 1) (3, 1)

# function API 사용
layer_input = keras.Input(shape=(3, 1))
layer_rnn = keras.layers.SimpleRNN(100, activation='tanh')(layer_input)
layer_output = keras.layers.Dense(1)(layer_rnn)

model = keras.Model(layer_input, layer_output)
model.compile(loss = 'mse', optimizer='adam')
model._name = 'many-to-one'
print(model.summary())
# Layer (type)                 Output Shape              Param #   
# =================================================================
# input_2 (InputLayer)         [(None, 3, 1)]            0         
# _________________________________________________________________
# simple_rnn_1 (SimpleRNN)     (None, 100)               10200     
# _________________________________________________________________
# dense_1 (Dense)              (None, 1)                 101       
# =================================================================
# Total params: 10,301
model.fit(x, y, epochs=100, batch_size=1, verbose=1)
print('pred :', model.predict(x).flatten())
# pred : [3.902048  5.1808596 5.8828716]
print('real :', y.flatten())
# real : [4 5 6]

# many-to-many
x = np.array([[[1], [2], [3]], [[2], [3], [4]], [[3], [4], [5]]], dtype=np.float32)
y = np.array([[4], [5], [6]])
print(x.shape, y.shape) # (3, 3, 1) (3, 1)

# function API 사용
layer_input = keras.Input(shape=(3, 1))
layer_rnn = keras.layers.SimpleRNN(100, activation='tanh', return_sequences=True)(layer_input)
layer_output = keras.layers.TimeDistributed(keras.layers.Dense(1))(layer_rnn)

model = keras.Model(layer_input, layer_output)
model.compile(loss = 'mse', optimizer='adam')
model._name = 'many-to-many'
print(model.summary())
# Layer (type)                 Output Shape              Param #   
# =================================================================
# input_5 (InputLayer)         [(None, 3, 1)]            0         
# _________________________________________________________________
# simple_rnn_4 (SimpleRNN)     (None, 3, 100)            10200     
# _________________________________________________________________
# dense_4 (Dense)              (None, 3, 1)              101       
# =================================================================
# Total params: 10,301
model.fit(x, y, epochs=100, batch_size=1, verbose=1)
print('pred :', model.predict(x).flatten())
# pred : [3.429767  3.9655545 4.002289  5.02977   5.0609956 4.999564  6.3015547 5.9251485 6.001438 ]

print('real :', y.flatten())
# real : [4 5 6]

SimpleRNN(100, activation='tanh', return_sequences=True)

TimeDistributed(Dense(1))

# stacked many-to-one
x = np.array([[[1], [2], [3]], [[2], [3], [4]], [[3], [4], [5]]], dtype=np.float32)
y = np.array([[4], [5], [6]])
print(x.shape, y.shape) # (3, 3, 1) (3, 1)

# function API 사용
layer_input = keras.Input(shape=(3, 1))
layer_rnn1 = keras.layers.SimpleRNN(100, activation='tanh', return_sequences=True)(layer_input)
layer_rnn2 = keras.layers.SimpleRNN(100, activation='tanh', return_sequences=True)(layer_rnn1)
layer_output = keras.layers.Dense(1)(layer_rnn2)

model = keras.Model(layer_input, layer_output)
model.compile(loss = 'mse', optimizer='adam')
model._name = 'stacked-many-to-many'
print(model.summary())
# Layer (type)                 Output Shape              Param #   
# =================================================================
# input_8 (InputLayer)         [(None, 3, 1)]            0         
# _________________________________________________________________
# simple_rnn_8 (SimpleRNN)     (None, 3, 100)            10200     
# _________________________________________________________________
# simple_rnn_9 (SimpleRNN)     (None, 3, 100)            20100     
# _________________________________________________________________
# dense_7 (Dense)              (None, 3, 1)              101       
# =================================================================
# Total params: 30,401
model.fit(x, y, epochs=100, batch_size=1, verbose=1)
print('pred :', model.predict(x).flatten())
# pred : [3.6705296 3.9618802 3.9781656 5.160208  5.0984097 5.0837207 6.031624 5.93854   5.9503284]

print('real :', y.flatten())
# real : [4 5 6]

- 자연어 처리

wikidocs.net/book/2155

위키독스

온라인 책을 제작 공유하는 플랫폼 서비스

wikidocs.net

- 한국어 불용어

www.ranks.nl/stopwords/korean

Korean Stopwords

www.ranks.nl

문자열 토큰화 + LSTM 감성분류

* tf_rnn4.ipynb

# token, corpus, vocabulary, one-hot, word2vec, tfidf,

from tensorflow.keras.preprocessing.text import Tokenizer

samples = ['The cat say on the mat.', 'The dog ate my homework.'] # list type

# token 처리 1 - word index
token_index = {}
for sam in samples:
    for word in sam.split(sep=' '):
        if word not in token_index:
            #print(word)
            token_index[word] = len(token_index)
print(token_index)
# {'The': 0, 'cat': 1, 'say': 2, 'on': 3, 'the': 4, 'mat.': 5, 'dog': 6, 'ate': 7, 'my': 8, 'homework.': 9}

print()
# token 처리 2 - word index
# tokenizer = Tokenizer(num_words=3) # num_words=3 빈도가 높은 3개의 토큰 만 작업에 참여
tokenizer = Tokenizer()
tokenizer.fit_on_texts(samples)
token_seq = tokenizer.texts_to_sequences(samples)   # 문자열을 index로 표현
print(token_seq)
# [[1, 2, 3, 4, 1, 5], [1, 6, 7, 8, 9]]
print(tokenizer.word_index)                        # 특수 문자 제거 및 대문자를 소문자로 변환
# {'the': 1, 'cat': 2, 'say': 3, 'on': 4, 'mat': 5, 'dog': 6, 'ate': 7, 'my': 8, 'homework': 9}

from tensorflow.keras.preprocessing.text import Tokenizer

- Tokenizer API

www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer

tf.keras.preprocessing.text.Tokenizer | TensorFlow Core v2.4.1

Text tokenization utility class.

www.tensorflow.org

tokenizer = Tokenizer(num_words=3) : num_words=3 빈도가 높은 3개의 토큰 만 작업에 참여

token_seq = tokenizer.texts_to_sequences(samples)

tokenizer.fit_on_texts(data) :

tokenizer.word_index :

token_mat = tokenizer.texts_to_matrix(samples, mode='binary')    # 있으면 1 없으면 0
# token_mat = tokenizer.texts_to_matrix(samples, mode='freq')      # 빈도수로 표현
# token_mat = tokenizer.texts_to_matrix(samples, mode='tfidf')   # 단어의 중요정도를 가중치로 표현
# token_mat = tokenizer.texts_to_matrix(samples, mode='count')
print(token_mat)
# [[0. 2. 1. 1. 1. 1. 0. 0. 0. 0.]
#  [0. 1. 0. 0. 0. 0. 1. 1. 1. 1.]]

# [[0.         0.86490296 0.69314718 0.69314718 0.69314718 0.69314718
#   0.         0.         0.         0.        ]
#  [0.         0.51082562 0.         0.         0.         0.
#   0.69314718 0.69314718 0.69314718 0.69314718]]

# [[0.         0.33333333 0.16666667 0.16666667 0.16666667 0.16666667
#   0.         0.         0.         0.        ]
#  [0.         0.2        0.         0.         0.         0.
#   0.2        0.2        0.2        0.2       ]]

# [[0. 1. 1. 1. 1. 1. 0. 0. 0. 0.]
#  [0. 1. 0. 0. 0. 0. 1. 1. 1. 1.]]

print(tokenizer.word_counts)
# OrderedDict([('the', 3), ('cat', 1), ('say', 1), ('on', 1), ('mat', 1), ('dog', 1), ('ate', 1), ('my', 1), ('homework', 1)])
print(tokenizer.document_count)  # 2
print(tokenizer.word_docs)
# defaultdict(<class 'int'>, {'the': 2, 'on': 1, 'cat': 1, 'mat': 1, 'say': 1, 'my': 1, 'ate': 1, 'dog': 1, 'homework': 1})

from tensorflow.keras.utils import to_categorical
token_seq = to_categorical(token_seq[0], num_classes=6)
print(token_seq)
# [[0. 1. 0. 0. 0. 0.]
#  [0. 0. 1. 0. 0. 0.]
#  [0. 0. 0. 1. 0. 0.]
#  [0. 0. 0. 0. 1. 0.]
#  [0. 1. 0. 0. 0. 0.]
#  [0. 0. 0. 0. 0. 1.]]

- 영화 리뷰 자료로 간단한 감성분석

import numpy as np
docs = ['너무 재밌어요', '또 보고 싶어요', '참 잘 만든 영화네요', '친구에게 추천할래요', '배우가 멋져요',\
        '별로예요', '지루하고 재미없어요', '연기가 어색해요', '재미없어요', '돈 아까워요']
classes = np.array([1,1,1,1,1,0,0,0,0,0])

token = Tokenizer()
token.fit_on_texts(docs)
print(token.word_index)
# {'재미없어요': 1, '너무': 2, '재밌어요': 3, '또': 4, '보고': 5, '싶어요': 6, '참': 7, '잘': 8, '만든': 9, '영화네요': 10, '친구에게': 11,
# '추천할래요': 12, '배우가': 13, '멋져요': 14, '별로예요': 15, '지루하고': 16, '연기가': 17, '어색해요': 18, '돈': 19, '아까워요': 20}

x = token.texts_to_sequences(docs)
print('토큰화 결과 :', x)
# 토큰화 결과 : [[2, 3], [4, 5, 6], [7, 8, 9, 10], [11, 12], [13, 14], [15], [16, 1], [17, 18], [1], [19, 20]]

from tensorflow.keras.preprocessing.sequence import pad_sequences
padded_x = pad_sequences(x, 4) # 패딩 : 서로 다른 길이의 데이터를 가장 긴 데이터의 길이로 맞춘다.
# 병렬 연산을 위해서 여러 문장의 길이를 임의로 동일하게 맞춰주는 작업이 필요
print(padded_x)
# [[ 0  0  2  3]
#  [ 0  4  5  6]
#  [ 7  8  9 10]
#  [ 0  0 11 12]
#  [ 0  0 13 14]
#  [ 0  0  0 15]
#  [ 0  0 16  1]
#  [ 0  0 17 18]
#  [ 0  0  0  1]
#  [ 0  0 19 20]]

# 모델
word_size = len(token.word_index) + 1
print(word_size) # 22

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten, Embedding, LSTM
model = Sequential()
model.add(Embedding(word_size, output_dim=8, input_length=4)) # model.add(Embedding(vocabulary, output_dim, input_length))
model.add(LSTM(32, activation='tanh'))
model.add(Flatten()) # FC Layer
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(padded_x, classes, epochs=10, verbose=1)
print('acc :', model.evaluate(padded_x, classes)[1])
# acc : 1.0
print('pred :', model.predict(padded_x).flatten())
# pred : [0.50115323 0.5027714  0.5044522  0.5016046  0.50048715 0.49916682  0.4971649  0.49867162 0.49731275 0.4972709 ]
print('real :', classes)
# real : [1 1 1 1 1 0 0 0 0 0]

RNN을 이용한 텍스트 생성

: 모델이 문맥을 학습한 후 텍스트를 작성

* tf_rnn5_token.ipynb

from tensorflow.keras.preprocessing.text import Tokenizer

text = """운동장에 눈이 많이 쌓여 있다
그 사람의 눈이 빛난다
맑은 눈이 사람 마음을 곱게 만든다"""

tok = Tokenizer()
tok.fit_on_texts([text])
encoded = tok.texts_to_sequences([text])
print(encoded)
# [[2, 1, 3, 4, 5, 6, 7, 1, 8, 9, 1, 10, 11, 12, 13]]
print(tok.word_index)
# {'눈이': 1, '운동장에': 2, '많이': 3, '쌓여': 4, '있다': 5, '그': 6, '사람의': 7, '빛난다': 8, '맑은': 9, '사람': 10, '마음을': 11, '곱게': 12, '만든다': 13}

vocab_size = len(tok.word_index) + 1
print('단어 집합의 크기 :%d'%vocab_size)
# 단어 집합의 크기 :14

sequences = list()   # feature
for line in text.split('\n'):
    encoded = tok.texts_to_sequences([line])[0]
    #print(encoded)
    # [2, 1, 3, 4, 5]
    # [6, 7, 1, 8]
    # [9, 1, 10, 11, 12, 13]
    for i in range(1, len(encoded)):
        sequ = encoded[:i + 1]
        #print(sequ)
        # [2, 1]
        # [2, 1, 3]
        # [2, 1, 3, 4]
        # [2, 1, 3, 4, 5]
        # [6, 7]
        # [6, 7, 1]
        # [6, 7, 1, 8]
        # [9, 1]
        # [9, 1, 10]
        # [9, 1, 10, 11]
        # [9, 1, 10, 11, 12]
        # [9, 1, 10, 11, 12, 13]
        sequences.append(sequ)
print(sequences)
# [[2, 1], [2, 1, 3], [2, 1, 3, 4], [2, 1, 3, 4, 5], [6, 7], [6, 7, 1], [6, 7, 1, 8], [9, 1], [9, 1, 10], [9, 1, 10, 11], [9, 1, 10, 11, 12], [9, 1, 10, 11, 12, 13]]

print('학습에 참여할 샘플 수 :', len(sequences)) # 12
print(max(len(i) for i in sequences)) # 6

# padding
from tensorflow.keras.preprocessing.sequence import pad_sequences
max_len = max(len(i) for i in sequences)
psequences = pad_sequences(sequences, maxlen = max_len, padding='pre')
# psequences = pad_sequences(sequences, maxlen = max_len, padding='post')
print(psequences)
# [[ 0  0  0  0  2  1]
#  [ 0  0  0  2  1  3]
#  [ 0  0  2  1  3  4]
#  [ 0  2  1  3  4  5]
#  [ 0  0  0  0  6  7]
#  [ 0  0  0  6  7  1]
#  [ 0  0  6  7  1  8]
#  [ 0  0  0  0  9  1]
#  [ 0  0  0  9  1 10]
#  [ 0  0  9  1 10 11]
#  [ 0  9  1 10 11 12]
#  [ 9  1 10 11 12 13]]

import numpy as np
psequences = np.array(psequences)
x = psequences[:, :-1] # feature
y = psequences[:, -1]  # label
print(x)
# [[ 0  0  0  0  2]
#  [ 0  0  0  2  1]
#  [ 0  0  2  1  3]
#  [ 0  2  1  3  4]
#  [ 0  0  0  0  6]
#  [ 0  0  0  6  7]
#  [ 0  0  6  7  1]
#  [ 0  0  0  0  9]
#  [ 0  0  0  9  1]
#  [ 0  0  9  1 10]
#  [ 0  9  1 10 11]
#  [ 9  1 10 11 12]]
print(y)
# [ 1  3  4  5  7  1  8  1 10 11 12 13]
from tensorflow.keras.utils import to_categorical
y = to_categorical(y, num_classes = vocab_size)
print(y)
# [[0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
#  [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
#  [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
#  [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
#  [0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
#  [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
#  [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
#  [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
#  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
#  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
#  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
#  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]]

# model
from tensorflow.keras.layers import Embedding, Dense, SimpleRNN, LSTM, Flatten
from tensorflow.keras.models import Sequential

model = Sequential()
model.add(Embedding(vocab_size, 32, input_length=max_len - 1))
model.add(LSTM(32, activation='tanh'))
model.add(Flatten())
model.add(Dense(32, activation='relu'))
model.add(Dense(vocab_size, activation='softmax'))

model.compile(loss = 'categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())        # Total params: 10,286
model.fit(x, y, epochs=200, verbose=2)
print(model.evaluate(x, y))   # [0.07081326842308044, 1.0]

def sentence_gen(model, t, current_word, n):
    init_word = current_word
    sentence = ''
    for _ in range(n):
        encoded = t.texts_to_sequences([current_word])[0]
        encoded = pad_sequences([encoded], maxlen = max_len - 1, padding = 'pre')
        result = np.argmax(model.predict(encoded))
        # print(result)
        for word, index in t.word_index.items():
            #print('word:', word, ', index:', index)
            # word: 눈이 , index: 1
            # word: 운동장에 , index: 2
            # word: 많이 , index: 3
            # word: 쌓여 , index: 4
            # word: 있다 , index: 5
            # word: 그 , index: 6
            # word: 사람의 , index: 7
            # word: 빛난다 , index: 8
            # word: 맑은 , index: 9
            # word: 사람 , index: 10
            # word: 마음을 , index: 11
            # word: 곱게 , index: 12
            # word: 만든다 , index: 13
            if index == result:
                break
        current_word = current_word + ' ' + word
        sentence = sentence + ' '  + word # 예측단어를 문장에 저장
    sentence = init_word + sentence
    return sentence


print(sentence_gen(model, tok, '운동장에', 1))
print(sentence_gen(model, tok, '운동장에', 3))
print(sentence_gen(model, tok, '맑은', 1))
print(sentence_gen(model, tok, '맑은', 2))
print(sentence_gen(model, tok, '맑은', 3))
print(sentence_gen(model, tok, '한국', 3))
print(sentence_gen(model, tok, '파이썬', 5))
# 운동장에 눈이
# 운동장에 눈이 많이 쌓여
# 맑은 눈이
# 맑은 눈이 사람
# 맑은 눈이 사람 마음을
# 한국 눈이 눈이 사람

소설을 학습하여 새로운 소설생성

* tf_rnn6_토지소설.ipynb

- RNN 관련

cafe.daum.net/flowlife/S2Ul/33

Daum 카페

cafe.daum.net

import numpy as np
import random, sys
import tensorflow as tf

f = open("rnn_test_toji.txt", 'r', encoding="utf-8")
text = f.read()
#print(text)
f.close();

print('텍스트 행 수: ', len(text))  # 306967
print(set(text))  # set 집합형 함수를 이용해 중복 제거{'얻', '턴', '옮', '쩐', '제', '평',...

chars = sorted(list(set(text)))     # 중복이 제거된 문자를 하나하나 읽어 들여 정렬 
print(chars)                        # ['\n', ' ', '!', ... , '0', '1', ... 'a', 'c', 'f', '...
print('사용되고 있는 문자 수:', len(chars))   # 1469

char_indices = dict((c, i) for i, c in enumerate(chars)) # 문자와 ID
indices_char = dict((i, c) for i, c in enumerate(chars)) # ID와 문자
print(char_indices)            # ... '것': 106, '겄': 107, '겅': 108,...
print(indices_char)            # ... 106: '것', 107: '겄', 108: '겅',...

# 텍스트를 maxlen개의 문자로 자르고 다음에 오는 문자 등록하기
maxlen = 20
step = 3
sentences = []
next_chars = []

for i in range(0, len(text) - maxlen, step):
    #print(text[i: i + maxlen])
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])

print('학습할 구문 수:', len(sentences))        # 102316

print('텍스트를 ID 벡터로 변환')
X = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
print(X[:3])
print(y[:3])

for i, sent in enumerate(sentences):
    #print(sent)
    for t, char in enumerate(sent):
        #print(t, ' ', char)
        X[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

print(X[:5])  # 찾은 글자에만 True, 나머지는 False 기억
print(y[:5])

# 모델 구축하기(LSTM(RNN의 개량종)) -------------
# 하나의 LSTM 층과 그 뒤에 Dense 분류층 추가
model = tf.keras.Sequential()
model.add(tf.keras.layers.LSTM(128, activation='tanh', input_shape=(maxlen, len(chars))))

model.add(tf.keras.layers.Dense(128))
model.add(tf.keras.layers.Activation('relu'))
model.add(tf.keras.layers.Dense(len(chars)))
model.add(tf.keras.layers.Activation('softmax'))

opti = tf.keras.optimizers.Adam(lr=0.001)
model.compile(loss='categorical_crossentropy', optimizer=opti, metrics=['acc'])

from tensorflow.keras.callbacks import EarlyStopping

es = EarlyStopping(patience = 5, monitor='loss')
model.fit(X, y, epochs=500, batch_size=64, verbose=2, callbacks=[es])

print(model.evaluate(X, y))
# 확률적 샘플링 처리 함수(무작위적으로 샘플링하기 위함)
# 모델의 예측이 주어졌을 때 새로운 글자를 샘플링 
def sample_func(preds, variety=1.0):        # 후보를 배열에서 꺼내기
    # array():복사본, asarray():참조본 생성 - 원본 변경시 복사본은 변경X 참조본은 변경O
    preds = np.asarray(preds).astype('float64') 
    preds = np.log(preds) / variety         # 로그확률 벡터식을 코딩
    exp_preds = np.exp(preds)               # 자연상수 얻기
    preds = exp_preds / np.sum(exp_preds)   # softmax 공식 참조
    probas = np.random.multinomial(1, preds, 1)  # 다항식분포로 샘플 얻기
    return np.argmax(probas)

for num in range(1, 2):   # 학습시키고 텍스트 생성하기 반복    1, 60
    print()
    print('--' * 30)
    print('반복 =', num)

    # 데이터에서 한 번만 반복해서 모델 학습
    model.fit(X, y, batch_size=128, epochs=1, verbose=0) 

    # 임의의 시작 텍스트 선택하기
    start_index = random.randint(0, len(text) - maxlen - 1)

    for variety in [0.2, 0.5, 1.0, 1.2]:     # 다양한 문장 생성
        print('\n--- 다양성 = ', variety)    # 다양성 = 0.2 -> 다양성 =  0.5 -> ...
        generated = ''
        sentence = text[start_index: start_index + maxlen]
        generated += sentence
        print('--- 시드 = "' + sentence + '"')  # --- 시드 = "께 간뎅이가 부어서, 시부릴기력 있거"...
        sys.stdout.write(generated)

        # 시드를 기반으로 텍스트 자동 생성. 시드 텍스트에서 시작해서 500개의 글자를 생성
        for i in range(500):
            x = np.zeros((1, maxlen, len(chars))) # 지금까지 생성된 글자를 원핫인코딩 처리

            for t, char in enumerate(sentence):
                x[0, t, char_indices[char]] = 1.

            # 다음에 올 문자를 예측하기(다음 글자를 샘플링)
            preds = model.predict(x, verbose=0)[0]
            next_index = sample_func(preds, variety)    # 다양한 문장 생성을 위함
            next_char = indices_char[next_index]

            # 출력하기
            generated += next_char
            sentence = sentence[1:] + next_char
            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()

- Google colab

cafe.daum.net/flowlife/S2Ul/24

Daum 카페

cafe.daum.net

- 리눅스 기본 명령어

cafe.daum.net/flowlife/9A8Q/161

리눅스 기본 명령어

vmwarehttps://www.vmware.com/kr.html무료사용 제품VMware Workstation Playerhttps://www.centos.org/vmware에 centos 설치 하기https://jhnyang.tistory.com/280https://www.ubuntu-kr.org/1. 데비안(Debian)Debian은

cafe.daum.net

- 가상환경 tool

vmware

virtual box

- telnet 와 ssh의 차이

m.blog.naver.com/PostView.nhn?blogId=ahnsh09&logNo=40171391492&proxyReferer=https:%2F%2Fwww.google.com%2F

telnet 와 ssh의 차이

telnet이란? 원격 접속 서비스로서 특정 사용자가 네트워크를 통해 다른 컴퓨터에 연결하여 그 컴퓨터에서 ...

blog.naver.com

뉴욕타임즈 기사의 일부 자료로 RNN 학습 모델을 만들어 기사 생성하기

* tf_rnn7_뉴욕타임즈기사.ipynb

import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/pykwon/python/master/testdata_utf8/articlesapril.csv')
print(df.head())
print(df.count)
print(df.columns)
# Index(['articleID', 'articleWordCount', 'byline', 'documentType', 'headline', 'keywords', 'multimedia', 
#'newDesk', 'printPage', 'pubDate','sectionName', 'snippet', 'source', 'typeOfMaterial', 'webURL'], dtype='object')

print(len(df.columns)) # 15

print(df['headline'].head())
# 0    Former N.F.L. Cheerleaders’ Settlement Offer: ...
# 1    E.P.A. to Unveil a New Rule. Its Effect: Less ...
# 2                              The New Noma, Explained
# 3                                              Unknown
# 4                                              Unknown
print(df.headline.values)
# ['Former N.F.L. Cheerleaders’ Settlement Offer: $1 and a Meeting With Goodell'
#  'E.P.A. to Unveil a New Rule. Its Effect: Less Science in Policymaking.'
#  'The New Noma, Explained' ...
#  'Gen. Michael Hayden Has One Regret: Russia'
#  'There Is Nothin’ Like a Tune' 'Unknown']

headline = []
headline.extend(list(df.headline.values))
print(headline[:10])
# ['Former N.F.L. Cheerleaders’ Settlement Offer: $1 and a Meeting With Goodell',
# 'E.P.A. to Unveil a New Rule. Its Effect: Less Science in Policymaking.', 'The New Noma, Explained', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'How a Bag of Texas Dirt  Became a Times Tradition', 'Is School a Place for Self-Expression?']

# Unknown 값은 노이즈로 판단해 제거
print(len(headline)) # 1324
headline = [n for n in headline if n != 'Unknown']
print(len(headline)) # 1214

# 구굿점 제거, 소문자 처리
print('He하이llo 가a나b123다'.encode('ascii', errors="ignore").decode()) # Hello ab123
from string import punctuation
print(", python.'".strip(punctuation))                                   #  python
print(", py thon.'".strip(punctuation + ' '))                            # py thon
#-------------------------------------------------------------------

def repre_func(s):
    s = s.encode('utf8').decode('ascii', 'ignore')
    return ''.join(c for c in s if c not in punctuation).lower()

text = [repre_func(s) for s in headline]
print(text[:10])
# ['former nfl cheerleaders settlement offer 1 and a meeting with goodell', 'epa to unveil a new rule its effect less science in policymaking', 'the new noma explained', 'how a bag of texas dirt  became a times tradition', 'is school a place for selfexpression', 'commuter reprogramming', 'ford changed leaders looking for a lift its still looking', 'romney failed to win at utah convention but few believe hes doomed', 'chain reaction', 'he forced the vatican to investigate sex abuse now hes meeting with pope francis']

a.extend(b) : a에 b의 각 원소를 추가.

a.encode('ascii', errors="ignore").decode() : a에서 영문/숫자 아닌값 무시하여 제거.

from string import punctuation

a.strip(punctuation) : 구둣점 제거

- 단어 집합 생성

from keras_preprocessing.text import Tokenizer
tok = Tokenizer()
tok.fit_on_texts(text)
print(tok.word_index)
# {'the': 1, 'a': 2, 'to': 3, 'of': 4, 'in': 5, 'for': 6, 'and': 7,  ...

vocab_size = len(tok.word_index) + 1
print('vocab_size :', vocab_size)   # 3494

sequences = list()
for line in text:
    enc = tok.texts_to_sequences([line])[0]
    for i in range(1, len(enc)):
        se = enc[:i + 1]
        sequences.append(se)

print(sequences[:11])
# [[99, 269], [99, 269, 371], [99, 269, 371, 1115], [99, 269, 371, 1115, 582], [99, 269, 371, 1115, 582, 52], [99, 269, 371, 1115, 582, 52, 7], [99, 269, 371, 1115, 582, 52, 7, 2], [99, 269, 371, 1115, 582, 52, 7, 2, 372], [99, 269, 371, 1115, 582, 52, 7, 2, 372, 10], [99, 269, 371, 1115, 582, 52, 7, 2, 372, 10, 1116], [100, 3]]

index_to_word = {}
for key, value in tok.word_index.items():
    index_to_word[value] = key
print(index_to_word)
# {1: 'the', 2: 'a', 3: 'to', 4: 'of', 5: 'in', 6: 'for', 7: 'and', ...
print(index_to_word[150])       # fire

max_len = max(len(i) for i in sequences)
print('max_len :', max_len)     # 24

from tensorflow.keras.preprocessing.sequence import pad_sequences
sequences = pad_sequences(sequences, maxlen = max_len, padding = 'pre')
print(sequences[:3])
# [[   0    0    0    0    0    0    0    0    0    0    0    0    0    0     0    0    0    0    0    0    0    0   99  269]
#  [   0    0    0    0    0    0    0    0    0    0    0    0    0    0     0    0    0    0    0    0    0   99  269  371]
#  [   0    0    0    0    0    0    0    0    0    0    0    0    0    0     0    0    0    0    0    0   99  269  371 1115]]

import numpy as np
sequences = np.array(sequences)
x = sequences[:, :-1] # feature
y = sequences[:, -1]  # label
print(x[:3])
# [[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0    0   0   0   0  99]
#  [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0    0   0   0  99 269]
#  [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0    0   0  99 269 371]]

print(y[:3])
# [ 269  371 1115]

from tensorflow.keras.utils import to_categorical
y = to_categorical(y, num_classes=vocab_size)
print(y[0])
# [0. 0. 0. ... 0. 0. 0.]

from tensorflow.keras.layers import Embedding, Dense, LSTM
from tensorflow.keras.models import Sequential

model = Sequential()
model.add(Embedding(vocab_size, 32, input_length = max_len -1))
model.add(LSTM(128, activation='tanh'))
model.add(Dense(vocab_size, activation='softmax'))

model.compile(loss = 'categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(x, y, epochs=50, verbose=2, batch_size=32)
print(model.evaluate(x, y))
# [1.3969029188156128, 0.7689350247383118]

def sentence_gen(model, t, current_word, n):
    init_word = current_word
    sentence = ''
    for _ in range(n):
        encoded = t.texts_to_sequences([current_word])[0]
        encoded = pad_sequences([encoded], maxlen = max_len - 1, padding = 'pre')
        result = np.argmax(model.predict(encoded))
        # print(result)
        for word, index in t.word_index.items():
            #print('word:', word, ', index:', index)
            if index == result:
                break
        current_word = current_word + ' ' + word
        sentence = sentence + ' '  + word # 예측단어를 문장에 저장
    sentence = init_word + sentence
    return sentence

print(sentence_gen(model, tok, 'i', 10))
print(sentence_gen(model, tok, 'how', 10))
print(sentence_gen(model, tok, 'how', 100))
print(sentence_gen(model, tok, 'good', 200))
print(sentence_gen(model, tok, 'python', 10))
# i brain injuries are tied to dementia abuse him slippery crashes
# how to serve a deranged tyrant stoically a pope fields for
# how to serve a deranged tyrant stoically a pope fields for a cathedral todo meet in a cathedral strike president apply for her police in privatized scientists about fast denmark says shot was life according at 92 was michael whims of webs and comey memoir too life aids still alive too african life on still loss to exfbi chief in new york lifts renewable sources to doing apply at 92 for say he police at pope francis say it was was too aids to behind was back to 92 was back to type not too common beach reimaginedjurassic african apartheid on
# good calls off trip to latin america citing crisis in syria not to invade back at meeting from pope francis doomed it recalls it was back to be focus of them to comey francis say risk risk it recalls about it us potent tolerance of others or products slippery leak of journalist it just hes aids hes risk it comey francis rude it was back to was not too was was rude francis it was endorse rival endorse rude was still alive 1738 african was shot him didnt him didnt it was endorse rival too was was it was endorse rival too rude apply to them to comey he officials to back to smiles at pope francis say it recalls it was back on not from uk officials of not 2002 not too pope francis too was too doomed francis not trying to them war uk officials say lawyers apply to agreement from muppets children say been mainstream it us border architect of misconduct to not francis it was say to invade endorse rival was behind apply to agreement on nafta children about gay draws near to director for north korea us children pledges recalls it was too rude francis risk
# python to men pushed to the edge investigation syria trump about

자연어 생성 글자 단위, 단어단위, 자소 단위

자연어 생성 : 단어 단위 생성

* tf_rnn8_토지_단어단위.ipynb

# 토지 또는 조선왕조실록 데이터 파일 다운로드
# https://github.com/wikibook/tf2/blob/master/Chapter7.ipynb
import tensorflow as tf
import numpy as np 

path_to_file = tf.keras.utils.get_file('toji.txt', 'https://raw.githubusercontent.com/pykwon/etc/master/rnn_test_toji.txt')
#path_to_file = 'silrok.txt'
# 데이터 로드 및 확인. encoding 형식으로 utf-8 을 지정해야합니다.
train_text = open(path_to_file, 'rb').read().decode(encoding='utf-8')

# 텍스트가 총 몇 자인지 확인합니다.
print('Length of text: {} characters'.format(len(train_text))) # Length of text: 695685 characters

# 처음 100 자를 확인해봅니다.
print(train_text[:100])
# 제 1 편 어둠의 발소리
# 1897년의 한가위.
# 까치들이 울타리 안 감나무에 와서 아침 인사를 하기도 전에, 무색 옷에 댕기꼬리를 늘인 
# 아이들은 송편을 입에 물고 마을길을 쏘

# 훈련 데이터 입력 정제
import re
# From https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py
def clean_str(string):    
    string = re.sub(r"[^가-힣A-Za-z0-9(),!?\'\`]", " ", string)
    string = re.sub(r"\'ll", " \'ll", string)
    string = re.sub(r",", " , ", string)
    string = re.sub(r"!", " ! ", string)
    string = re.sub(r"\(", "", string)
    string = re.sub(r"\)", "", string)
    string = re.sub(r"\?", " \? ", string)
    string = re.sub(r"\s{2,}", " ", string)
    string = re.sub(r"\'{2,}", "\'", string)
    string = re.sub(r"\'", "", string)
    return string

train_text = train_text.split('\n')
train_text = [clean_str(sentence) for sentence in train_text]
train_text_X = []
for sentence in train_text:
    train_text_X.extend(sentence.split(' '))
    train_text_X.append('\n')
    
train_text_X = [word for word in train_text_X if word != '']

print(train_text_X[:20])  
# ['제', '1', '편', '어둠의', '발소리', '\n', '1897년의', '한가위', '\n', '까치들이', '울타리', '안', '감나무에', '와서', '아침', '인사를', '하기도', '전에', ',', '무색']

# 단어 토큰화
# 단어의 set을 만듭니다.
vocab = sorted(set(train_text_X))
vocab.append('UNK')   # 텍스트 안에 존재하지 않는 토큰을 나타내는 'UNK' 사용
print ('{} unique words'.format(len(vocab)))

# vocab list를 숫자로 맵핑하고, 반대도 실행합니다.
word2idx = {u:i for i, u in enumerate(vocab)}
idx2word = np.array(vocab)

text_as_int = np.array([word2idx[c] for c in train_text_X])

# word2idx 의 일부를 알아보기 쉽게 print 해봅니다.
print('{')
for word,_ in zip(word2idx, range(10)):
    print('  {:4s}: {:3d},'.format(repr(word), word2idx[word]))
print('  ...\n}')

print('index of UNK: {}'.format(word2idx['UNK']))

# 토큰 데이터 확인. 20개만 확인
print(train_text_X[:20])  
print(text_as_int[:20])

# 기본 데이터셋 만들기
seq_length = 25  # 25개의 단어가 주어질 경우 다음 단어를 예측하도록 데이터를 만듦
examples_per_epoch = len(text_as_int) // seq_length
sentence_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

# seq_length + 1 은 처음 25개 단어와 그 뒤에 나오는 정답이 될  1 단어를 합쳐 함께 반환하기 위함
# drop_remainder=True 남는 부분은 제거 속성
sentence_dataset = sentence_dataset.batch(seq_length + 1, drop_remainder=True)

for item in sentence_dataset.take(1):
    print(idx2word[item.numpy()])
    print(item.numpy())

# 학습 데이터셋 만들기
# 26개의 단어가 각각 입력과 정답으로 묶어서 ([25단어], 1단어) 형태의 데이터를 반환하기 위한 작업
def split_input_target(chunk):
    return [chunk[:-1], chunk[-1]]

train_dataset = sentence_dataset.map(split_input_target)
for x,y in train_dataset.take(1):
    print(idx2word[x.numpy()])
    print(x.numpy())
    print(idx2word[y.numpy()])
    print(y.numpy())

# 데이터셋 shuffle, batch 설정
BATCH_SIZE = 64
steps_per_epoch = examples_per_epoch // BATCH_SIZE
BUFFER_SIZE = 5000

train_dataset = train_dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

# 단어 단위 생성 모델 정의
total_words = len(vocab)
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(total_words, 100, input_length=seq_length),
    tf.keras.layers.LSTM(units=100, return_sequences=True),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.LSTM(units=100),
    tf.keras.layers.Dense(total_words, activation='softmax')
])

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.summary()

# 단어 단위 생성 모델 학습
from tensorflow.keras.preprocessing.sequence import pad_sequences

def testmodel(epoch, logs):
    if epoch % 5 != 0 and epoch != 49:
        return
    test_sentence = train_text[0]

    next_words = 100
    for _ in range(next_words):
        test_text_X = test_sentence.split(' ')[-seq_length:]
        test_text_X = np.array([word2idx[c] if c in word2idx else word2idx['UNK'] for c in test_text_X])
        test_text_X = pad_sequences([test_text_X], maxlen=seq_length, padding='pre', value=word2idx['UNK'])

        output_idx = model.predict_classes(test_text_X)
        test_sentence += ' ' + idx2word[output_idx[0]]
    
    print()
    print(test_sentence)
    print()

# 모델을 학습시키며 모델이 생성한 결과물을 확인하기 위해 LambdaCallback 함수 생성
testmodelcb = tf.keras.callbacks.LambdaCallback(on_epoch_end=testmodel)

history = model.fit(train_dataset.repeat(), epochs=50, 
                steps_per_epoch=steps_per_epoch, 
                callbacks=[testmodelcb], verbose=2)

model.save('rnnmodel.hdf5')
del model
from tensorflow.keras.models import load_model
model=load_model('rnnmodel.hdf5')

# 임의의 문장을 사용한 생성 결과 확인
test_sentence = '최참판댁 사랑은 무인지경처럼 적막하다'
#test_sentence = '동헌에 나가 공무를 본 후 활 십오 순을 쏘았다'

next_words = 500
for _ in range(next_words):
    # 임의 문장 입력 후 뒤에서 부터 seq_length 만킁ㅁ의 단어(25개) 선택
    test_text_X = test_sentence.split(' ')[-seq_length:]  
    
    # 문장의 단어를 인덱스 토큰으로 바꿈. 사전에 등록되지 않은 경우에는 'UNK' 코큰값으로 변경
    test_text_X = np.array([word2idx[c] if c in word2idx else word2idx['UNK'] for c in test_text_X])
    # 문장의 앞쪽에 빈자리가 있을 경우 25개 단어가 채워지도록 패딩
    test_text_X = pad_sequences([test_text_X], maxlen=seq_length, padding='pre', value=word2idx['UNK'])
    
    # 출력 중에서 가장 값이 큰 인덱스 반환
    output_idx = model.predict_classes(test_text_X) 
    test_sentence += ' ' + idx2word[output_idx[0]] # 출력단어는 test_sentence에 누적해 다음 스테의 입력으로 활용

print(test_sentence)


# LambdaCallback
# keras에서 여러가지 상황에서 콜백이되는 class들이 만들어져 있는데, LambdaCallback 등의 Callback class들은 
# 기본적으로 keras.callbacks.Callback class를 상속받아서 특정 상황마다 콜백되는 메소드들을 재정의하여 사용합니다. 
# LambdaCallback는 lambda 평션을 작성하여 생성자에 넘기는 방식으로 사용 할 수 있습니다. 
# callback 시 받는 arg는 Callbakc class에 정의 되어 있는대로 맞춰 주어야 합니다. 
# on_epoch_end메소드로 정의하여 epoch이 끝날 때 마다 확인해보도록 하겠습니다.
# 아래 처럼 lambda 함수를 작성하여 LambdaCallback를 만들어 주고, 이때 epoch, logs는 신경 안쓰시고 arg 형태만 맞춰주면 됩니다.
# from keras.callbacks import LambdaCallback
# print_weights = LambdaCallback(on_epoch_end=lambda epoch, logs: print(model.layers[3].get_weights()))

- jamo tools

dschloe.github.io/python/tensorflow2.0/ch7_4_naturallanguagegeneration2/

Tensorflow 2.0 Tutorial ch7.4 - (2) 단어 단위 생성

공지 본 Tutorial은 교재 시작하세요 텐서플로 2.0 프로그래밍의 강사에게 국비교육 강의를 듣는 사람들에게 자료 제공을 목적으로 제작하였습니다. 강사의 주관적인 판단으로 압축해서 자료를 정

dschloe.github.io

* tf_rnn9_토지_자소단위.ipynb

!pip install jamotools

import tensorflow as tf
import numpy as np
import jamotools

path_to_file = tf.keras.utils.get_file('toji.txt', 'https://raw.githubusercontent.com/pykwon/etc/master/rnn_test_toji.txt')
#path_to_file = 'silrok.txt'
# 데이터 로드 및 확인. encoding 형식으로 utf-8 을 지정해야합니다.
train_text = open(path_to_file, 'rb').read().decode(encoding='utf-8')

# 텍스트가 총 몇 자인지 확인합니다.
print('Length of text: {} characters'.format(len(train_text))) # Length of text: 695685 characters
print()

# 처음 100 자를 확인해봅니다.
s = train_text[:100]
print(s)

# 한글 텍스트를 자모 단위로 분리해줍니다. 한자 등에는 영향이 없습니다.
s_split = jamotools.split_syllables(s)
print(s_split)

Length of text: 695685 characters

제 1 편 어둠의 발소리
1897년의 한가위.
까치들이 울타리 안 감나무에 와서 아침 인사를 하기도 전에, 무색 옷에 댕기꼬리를 늘인 
아이들은 송편을 입에 물고 마을길을 쏘
ㅈㅔ 1 ㅍㅕㄴ ㅇㅓㄷㅜㅁㅇㅢ ㅂㅏㄹㅅㅗㄹㅣ
1897ㄴㅕㄴㅇㅢ ㅎㅏㄴㄱㅏㅇㅟ.
ㄲㅏㅊㅣㄷㅡㄹㅇㅣ ㅇㅜㄹㅌㅏㄹㅣ ㅇㅏㄴ ㄱㅏㅁㄴㅏㅁㅜㅇㅔ ㅇㅘㅅㅓ ㅇㅏㅊㅣㅁ ㅇㅣㄴㅅㅏㄹㅡㄹ ㅎㅏㄱㅣㄷㅗ ㅈㅓㄴㅇㅔ, ㅁㅜㅅㅐㄱ ㅇㅗㅅㅇㅔ ㄷㅐㅇㄱㅣㄲㅗㄹㅣㄹㅡㄹ ㄴㅡㄹㅇㅣㄴ 
ㅇㅏㅇㅣㄷㅡㄹㅇㅡㄴ ㅅㅗㅇㅍㅕㄴㅇㅡㄹ ㅇㅣㅂㅇㅔ ㅁㅜㄹㄱㅗ ㅁㅏㅇㅡㄹㄱㅣㄹㅇㅡㄹ ㅆㅗ

import jamotools

jamotools.split_syllables(s) :

# 7.45 자모 결합 테스트
s2 = jamotools.join_jamos(s_split)
print(s2)
print(s == s2)

# 7.46 자모 토큰화
# 텍스트를 자모 단위로 나눕니다. 데이터가 크기 때문에 약간 시간이 걸립니다.
train_text_X = jamotools.split_syllables(train_text)
vocab = sorted(set(train_text_X))
vocab.append('UNK')
print ('{} unique characters'.format(len(vocab))) # 179 unique characters

# vocab list를 숫자로 맵핑하고, 반대도 실행합니다.
char2idx = {u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)

text_as_int = np.array([char2idx[c] for c in train_text_X])
print(text_as_int) # [69 81  2 ...  2  1  0]

# word2idx 의 일부를 알아보기 쉽게 print 해봅니다.
print('{')
for char,_ in zip(char2idx, range(10)):
    print('  {:4s}: {:3d},'.format(repr(char), char2idx[char]))
print('  ...\n}')

print('index of UNK: {}'.format(char2idx['UNK']))

제 1 편 어둠의 발소리
1897년의 한가위.
까치들이 울타리 안 감나무에 와서 아침 인사를 하기도 전에, 무색 옷에 댕기꼬리를 늘인 
아이들은 송편을 입에 물고 마을길을 쏘
True
179 unique characters
[69 81  2 ...  2  1  0]
{
  '\n':   0,
  '\r':   1,
  ' ' :   2,
  '!' :   3,
  '"' :   4,
  "'" :   5,
  '(' :   6,
  ')' :   7,
  ',' :   8,
  '-' :   9,
  ...
}
index of UNK: 178

# 7.47 토큰 데이터 확인
print(train_text_X[:20])
print(text_as_int[:20])

# 7.48 학습 데이터세트 생성
seq_length = 80
examples_per_epoch = len(text_as_int) // seq_length
print('examples_per_epoch :', examples_per_epoch)
# examples_per_epoch : 16815
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

char_dataset = char_dataset.batch(seq_length+1, drop_remainder=True) # drop_remainder : 잔여 데이터 제거
for item in char_dataset.take(1):
    print(idx2char[item.numpy()])
#     ['ㅈ' 'ㅔ' ' ' '1' ' ' 'ㅍ' 'ㅕ' 'ㄴ' ' ' 'ㅇ' 'ㅓ' 'ㄷ' 'ㅜ' 'ㅁ' 'ㅇ' 'ㅢ' ' ' 'ㅂ'
#  'ㅏ' 'ㄹ' 'ㅅ' 'ㅗ' 'ㄹ' 'ㅣ' '\r' '\n' '1' '8' '9' '7' 'ㄴ' 'ㅕ' 'ㄴ' 'ㅇ' 'ㅢ' ' '
#  'ㅎ' 'ㅏ' 'ㄴ' 'ㄱ' 'ㅏ' 'ㅇ' 'ㅟ' '.' '\r' '\n' 'ㄲ' 'ㅏ' 'ㅊ' 'ㅣ' 'ㄷ' 'ㅡ' 'ㄹ' 'ㅇ'
#  'ㅣ' ' ' 'ㅇ' 'ㅜ' 'ㄹ' 'ㅌ' 'ㅏ' 'ㄹ' 'ㅣ' ' ' 'ㅇ' 'ㅏ' 'ㄴ' ' ' 'ㄱ' 'ㅏ' 'ㅁ' 'ㄴ'
#  'ㅏ' 'ㅁ' 'ㅜ' 'ㅇ' 'ㅔ' ' ' 'ㅇ' 'ㅘ' 'ㅅ']
    print('item.numpy() :', item.numpy())
# item.numpy() : [69 81  2 13  2 74 82 49  2 68 80 52 89 62 68 95  2 63 76 54 66 84 54 96
#   1  0 13 20 21 19 49 82 49 68 95  2 75 76 49 46 76 68 92 10  1  0 47 76
#  71 96 52 94 54 68 96  2 68 89 54 73 76 54 96  2 68 76 49  2 46 76 62 49
#  76 62 89 68 81  2 68 85 66]

def split_input_target(chunk):
    return [chunk[:-1], chunk[-1]]

train_dataset = char_dataset.map(split_input_target)
for x,y in train_dataset.take(1):
    print(idx2char[x.numpy()])
    print(x.numpy())
    print(idx2char[y.numpy()])
    print(y.numpy())
    
BATCH_SIZE = 64
steps_per_epoch = examples_per_epoch // BATCH_SIZE
BUFFER_SIZE = 5000

train_dataset = train_dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

ㅈㅔ 1 ㅍㅕㄴ ㅇㅓㄷㅜㅁㅇㅢ ㅂㅏㄹ
[69 81  2 13  2 74 82 49  2 68 80 52 89 62 68 95  2 63 76 54]
examples_per_epoch : 16815
['ㅈ' 'ㅔ' ' ' '1' ' ' 'ㅍ' 'ㅕ' 'ㄴ' ' ' 'ㅇ' 'ㅓ' 'ㄷ' 'ㅜ' 'ㅁ' 'ㅇ' 'ㅢ' ' ' 'ㅂ'
 'ㅏ' 'ㄹ' 'ㅅ' 'ㅗ' 'ㄹ' 'ㅣ' '\r' '\n' '1' '8' '9' '7' 'ㄴ' 'ㅕ' 'ㄴ' 'ㅇ' 'ㅢ' ' '
 'ㅎ' 'ㅏ' 'ㄴ' 'ㄱ' 'ㅏ' 'ㅇ' 'ㅟ' '.' '\r' '\n' 'ㄲ' 'ㅏ' 'ㅊ' 'ㅣ' 'ㄷ' 'ㅡ' 'ㄹ' 'ㅇ'
 'ㅣ' ' ' 'ㅇ' 'ㅜ' 'ㄹ' 'ㅌ' 'ㅏ' 'ㄹ' 'ㅣ' ' ' 'ㅇ' 'ㅏ' 'ㄴ' ' ' 'ㄱ' 'ㅏ' 'ㅁ' 'ㄴ'
 'ㅏ' 'ㅁ' 'ㅜ' 'ㅇ' 'ㅔ' ' ' 'ㅇ' 'ㅘ' 'ㅅ']
item.numpy() : [69 81  2 13  2 74 82 49  2 68 80 52 89 62 68 95  2 63 76 54 66 84 54 96
  1  0 13 20 21 19 49 82 49 68 95  2 75 76 49 46 76 68 92 10  1  0 47 76
 71 96 52 94 54 68 96  2 68 89 54 73 76 54 96  2 68 76 49  2 46 76 62 49
 76 62 89 68 81  2 68 85 66]
['ㅈ' 'ㅔ' ' ' '1' ' ' 'ㅍ' 'ㅕ' 'ㄴ' ' ' 'ㅇ' 'ㅓ' 'ㄷ' 'ㅜ' 'ㅁ' 'ㅇ' 'ㅢ' ' ' 'ㅂ'
 'ㅏ' 'ㄹ' 'ㅅ' 'ㅗ' 'ㄹ' 'ㅣ' '\r' '\n' '1' '8' '9' '7' 'ㄴ' 'ㅕ' 'ㄴ' 'ㅇ' 'ㅢ' ' '
 'ㅎ' 'ㅏ' 'ㄴ' 'ㄱ' 'ㅏ' 'ㅇ' 'ㅟ' '.' '\r' '\n' 'ㄲ' 'ㅏ' 'ㅊ' 'ㅣ' 'ㄷ' 'ㅡ' 'ㄹ' 'ㅇ'
 'ㅣ' ' ' 'ㅇ' 'ㅜ' 'ㄹ' 'ㅌ' 'ㅏ' 'ㄹ' 'ㅣ' ' ' 'ㅇ' 'ㅏ' 'ㄴ' ' ' 'ㄱ' 'ㅏ' 'ㅁ' 'ㄴ'
 'ㅏ' 'ㅁ' 'ㅜ' 'ㅇ' 'ㅔ' ' ' 'ㅇ' 'ㅘ']
[69 81  2 13  2 74 82 49  2 68 80 52 89 62 68 95  2 63 76 54 66 84 54 96
  1  0 13 20 21 19 49 82 49 68 95  2 75 76 49 46 76 68 92 10  1  0 47 76
 71 96 52 94 54 68 96  2 68 89 54 73 76 54 96  2 68 76 49  2 46 76 62 49
 76 62 89 68 81  2 68 85]
ㅅ
66

# 7.49 자소 단위 생성 모델 정의
total_chars = len(vocab)
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(total_chars, 100, input_length=seq_length),
    tf.keras.layers.LSTM(units=400, activation='tanh'),
    tf.keras.layers.Dense(total_chars, activation='softmax')
])

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.summary()   # Total params: 891,279

# 7.50 자소 단위 생성 모델 학습
from tensorflow.keras.preprocessing.sequence import pad_sequences

def testmodel(epoch, logs):
    if epoch % 5 != 0 and epoch != 99:
        return
    
    test_sentence = train_text[:48]
    test_sentence = jamotools.split_syllables(test_sentence)

    next_chars = 300
    for _ in range(next_chars):
        test_text_X = test_sentence[-seq_length:]
        test_text_X = np.array([char2idx[c] if c in char2idx else char2idx['UNK'] for c in test_text_X])
        test_text_X = pad_sequences([test_text_X], maxlen=seq_length, padding='pre', value=char2idx['UNK'])

        output_idx = model.predict_classes(test_text_X)
        test_sentence += idx2char[output_idx[0]]
    
    print()
    print(jamotools.join_jamos(test_sentence))
    print()

testmodelcb = tf.keras.callbacks.LambdaCallback(on_epoch_end=testmodel)

history = model.fit(train_dataset.repeat(), epochs=50, steps_per_epoch=steps_per_epoch, \
                    callbacks=[testmodelcb], verbose=2)

Epoch 1/50
262/262 - 37s - loss: 2.9122 - accuracy: 0.2075
/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/sequential.py:450: UserWarning: `model.predict_classes()` is deprecated and will be removed after 2021-01-01. Please use instead:* `np.argmax(model.predict(x), axis=-1)`,   if your model does multi-class classification   (e.g. if it uses a `softmax` last-layer activation).* `(model.predict(x) > 0.5).astype("int32")`,   if your model does binary classification   (e.g. if it uses a `sigmoid` last-layer activation).
  warnings.warn('`model.predict_classes()` is deprecated and '

제 1 편 어둠의 발소리
1897년의 한가위.
까치들이 울타리 안 감나무에 와서 안이이 알이이 알이이 알이이 알이이 알이이 알이이 알이이 알이이 알이이 알이이 알이이 알이이 알이이 알이이 알이이 알이이 알이이 알이이 알이이 알이이 알이이 알이이 알이이 알이이 알이이 알이이 알이이 알이이 알이이 알이이 알이이 알이이 알이이 알이이 알이이 알이이 알잉

Epoch 2/50
262/262 - 7s - loss: 2.3712 - accuracy: 0.3002
Epoch 3/50
262/262 - 7s - loss: 2.2434 - accuracy: 0.3256
Epoch 4/50
262/262 - 7s - loss: 2.1652 - accuracy: 0.3414
Epoch 5/50
262/262 - 7s - loss: 2.1132 - accuracy: 0.3491
Epoch 6/50
262/262 - 7s - loss: 2.0670 - accuracy: 0.3600

제 1 편 어둠의 발소리
1897년의 한가위.
까치들이 울타리 안 감나무에 와서 았다.  "아난 강이는 갈이는 갈이는 갈이는 갈이는 갈이는 갈이는 갈이는 갈이는 갈이는 갈이는 갈이는 갈이는 갈이는 갈이는 갈이는 갈이는 갈이는 갈이는 갈이는 갈이는 갈이는 갈이는 갈이는 갈이는 갈이는 갈이는 갈이는 갈이는 갈이는 갈이는 갈이는

Epoch 7/50
262/262 - 7s - loss: 2.0299 - accuracy: 0.3709
Epoch 8/50
262/262 - 7s - loss: 1.9852 - accuracy: 0.3810
Epoch 9/50
262/262 - 7s - loss: 1.9415 - accuracy: 0.3978
Epoch 10/50
262/262 - 7s - loss: 1.9119 - accuracy: 0.4020
Epoch 11/50
262/262 - 7s - loss: 1.8684 - accuracy: 0.4153

제 1 편 어둠의 발소리
1897년의 한가위.
까치들이 울타리 안 감나무에 와서 아니라고 날 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 ㄱ

Epoch 12/50
262/262 - 7s - loss: 1.8237 - accuracy: 0.4272
Epoch 13/50
262/262 - 7s - loss: 1.7745 - accuracy: 0.4429
Epoch 14/50
262/262 - 7s - loss: 1.7272 - accuracy: 0.4625
Epoch 15/50
262/262 - 7s - loss: 1.6779 - accuracy: 0.4688
Epoch 16/50
262/262 - 7s - loss: 1.6217 - accuracy: 0.4902

제 1 편 어둠의 발소리
1897년의 한가위.
까치들이 울타리 안 감나무에 와서 안 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 간 가

Epoch 17/50
262/262 - 7s - loss: 1.5658 - accuracy: 0.5041
Epoch 18/50
262/262 - 7s - loss: 1.4984 - accuracy: 0.5252
Epoch 19/50
262/262 - 7s - loss: 1.4413 - accuracy: 0.5443
Epoch 20/50
262/262 - 7s - loss: 1.3629 - accuracy: 0.5704
Epoch 21/50
262/262 - 7s - loss: 1.2936 - accuracy: 0.5923

제 1 편 어둠의 발소리
1897년의 한가위.
까치들이 울타리 안 감나무에 와서 아니요."
  "아는 말이 잡아낙에 사람 가나가 가라. 가나구가 가람 가나고. 사남이 바람이 그렇게 없는 노루가 가나가오. 아니라."
  "아는 말이 잡아낙에 사람 가나가 가라. 가나구가 가람 가나고. 사남이 바람이 그렇게 없는 노루가 가나가오. 아니라."
  "아는 말이 잡아낙에 사람 가나가 ㄱ

Epoch 22/50
262/262 - 7s - loss: 1.2142 - accuracy: 0.6217
Epoch 23/50
262/262 - 7s - loss: 1.1281 - accuracy: 0.6505
Epoch 24/50
262/262 - 7s - loss: 1.0444 - accuracy: 0.6786
Epoch 25/50
262/262 - 7s - loss: 0.9711 - accuracy: 0.7047
Epoch 26/50
262/262 - 7s - loss: 0.8712 - accuracy: 0.7445

제 1 편 어둠의 발소리
1897년의 한가위.
까치들이 울타리 안 감나무에 와서 아니요."
  "예, 서방이 타람 자기 있는 놀을 벤  앙이는  곡서방을 마지 않았다. 장모수는 잠 밀 앞은 알 앞은 것이다. 그러나 속으로 나랑치를  그렇더면 정을 비한 것은 알굴이 말고 마른 안 부리 전 물어지를 하는 것이다. 그런 소릴 긴데 없는  갈아조 말 앞은 ㅇ

Epoch 27/50
262/262 - 7s - loss: 0.8168 - accuracy: 0.7620
Epoch 28/50
262/262 - 7s - loss: 0.7244 - accuracy: 0.7985
Epoch 29/50
262/262 - 7s - loss: 0.6301 - accuracy: 0.8362
Epoch 30/50
262/262 - 7s - loss: 0.5399 - accuracy: 0.8695
Epoch 31/50
262/262 - 7s - loss: 0.4745 - accuracy: 0.8950

제 1 편 어둠의 발소리
1897년의 한가위.
까치들이 울타리 안 감나무에 와서 아니요."
 "그건 치신하게 소기일이고 나랑치가  가래 참아노려 하고 사람을 딸려들  하더나건서반다.
 "줄에 나무장 다과 있는데 마을 세나강이 사이오. 나익은 노영은 물을 딸로나 잉인이의 얼굴이  없고 바람을 들었다. 그 천덕이 속을 거밀렀다. 지녁해지직 때 났는데 이 ㅇ

Epoch 32/50
262/262 - 7s - loss: 0.3956 - accuracy: 0.9234
Epoch 33/50
262/262 - 7s - loss: 0.3326 - accuracy: 0.9429
Epoch 34/50
262/262 - 7s - loss: 0.2787 - accuracy: 0.9577
Epoch 35/50
262/262 - 7s - loss: 0.2249 - accuracy: 0.9738
Epoch 36/50
262/262 - 7s - loss: 0.1822 - accuracy: 0.9837

제 1 편 어둠의 발소리
1897년의 한가위.
까치들이 울타리 안 감나무에 와서 아니요."
 "그래 갱째기를 하였던 것이다. 그러나 소습으로  있을 물었다. 가나게, 한장한다. 
 "안 빌리 장에서  나였다. 체신기린 조한 시릴 세에 있이는 노랭이 되었다. 나무지 족은 야우는 물을 만다.  울씨는 지소 가라! 담하는 누눌이 말씨갔다. 
 "일서 좀은 이융이의 ㄴ

Epoch 37/50
262/262 - 7s - loss: 0.1399 - accuracy: 0.9902
Epoch 38/50
262/262 - 7s - loss: 0.1123 - accuracy: 0.9942
Epoch 39/50
262/262 - 7s - loss: 0.0864 - accuracy: 0.9968
Epoch 40/50
262/262 - 7s - loss: 0.0713 - accuracy: 0.9979
Epoch 41/50
262/262 - 7s - loss: 0.0552 - accuracy: 0.9989

제 1 편 어둠의 발소리
1897년의 한가위.
까치들이 울타리 안 감나무에 와서 아니요."
  "전을 짐 집 갚고 불랑이 떳어지는 닷마닥네서 쑤어직 때 가자. 다동이 타타자그마."
  "전을 지작한고, 그런 세닉을 바들 가는고 마진 오를 기는 불림이 최치른다. 한 일을 물었다.
  "눈저 살아노, 흔자하는 나루에."
  "저는 물을 물어져들  자몬 아니요."
  

Epoch 42/50
262/262 - 7s - loss: 0.0431 - accuracy: 0.9994
Epoch 43/50
262/262 - 7s - loss: 0.0325 - accuracy: 0.9998
Epoch 44/50
262/262 - 7s - loss: 0.2960 - accuracy: 0.9110
Epoch 45/50
262/262 - 7s - loss: 0.1939 - accuracy: 0.9540
Epoch 46/50
262/262 - 7s - loss: 0.0542 - accuracy: 0.9979

제 1 편 어둠의 발소리
1897년의 한가위.
까치들이 울타리 안 감나무에 와서 아니요."
  "전을 떰은 탕첫
은 안나무네."
  "그건 심심해 사람을 놀었다. 음....간 들지 휜 오를 
까끄치올을 쓸어얐다. 베여 갈인 덧못이야. 그건 시김이 잡아나게 생가가지가 다라고들 하고 사람들 색했으면 조신 소리를 필징이 갔다. 체공인 든장은 무소 그러핬다

Epoch 47/50
262/262 - 7s - loss: 0.0265 - accuracy: 0.9999
Epoch 48/50
262/262 - 7s - loss: 0.0189 - accuracy: 1.0000
Epoch 49/50
262/262 - 7s - loss: 0.0156 - accuracy: 1.0000
Epoch 50/50
262/262 - 7s - loss: 0.0131 - accuracy: 1.0000

model.save('rnnmodel2.hdf5')

# 7.51 임의의 문장을 사용한 생성 결과 확인
from tensorflow.keras.preprocessing.sequence import pad_sequences
test_sentence = '최참판댁 사랑은 무인지경처럼 적막하다'
test_sentence = jamotools.split_syllables(test_sentence)

next_chars = 5000
for _ in range(next_chars):
    test_text_X = test_sentence[-seq_length:]
    test_text_X = np.array([char2idx[c] if c in char2idx else char2idx['UNK'] for c in test_text_X])
    test_text_X = pad_sequences([test_text_X], maxlen=seq_length, padding='pre', value=char2idx['UNK'])
    
    output_idx = model.predict_classes(test_text_X)
    test_sentence += idx2char[output_idx[0]]
    

print(jamotools.join_jamos(test_sentence))

/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/sequential.py:450: UserWarning: `model.predict_classes()` is deprecated and will be removed after 2021-01-01. Please use instead:* `np.argmax(model.predict(x), axis=-1)`,   if your model does multi-class classification   (e.g. if it uses a `softmax` last-layer activation).* `(model.predict(x) > 0.5).astype("int32")`,   if your model does binary classification   (e.g. if it uses a `sigmoid` last-layer activation).
  warnings.warn('`model.predict_classes()` is deprecated and '
최참판댁 사랑은 무인지경처럼 적막하다가 최차는 야우는 물을 물었다.
  "내가 적에서 이자사 아니요."
  "전을 떰은 안 불이라 강천 갓촉을 농해 났이 마으러서 같은 노웅이 것을  치문이 참만난 함잉이의 송순이라 강을 덜걱을  눈치를  최철었다.
  "오를 놀짝은 노영은 뭇
이들을 달려들  도만 살알이 되었다. 나무지 족은 야움을 놀을 허렸다. 그건 신기를 물을 달려들 딸을 동렸다. 
“선으로 바람 사람을 바습으로 나왔다. 
간가나게, 사람들 사람을  했는 노래원이 머를 세 있는 것 같았다.
  강산이 족은 양피는 가날이 아니요."
  "예, 산고 가탕이 다시 죽얼 지 구천하게  그러세 얼굴을 달련 덧을  부칠리 질 없는 농은 언분네요."
  "그건 소리가 아니물이 용닌이 되수  성을 흠금글고 달려가지 아나갔다. 
 "그러나 내간이고 내발이 탂어서 달려가지 않나. 무고 고랭각은 다사 죽어러들어지 않았다. 
간곡이얐다. 지낙해지 안 가기요."
 "잔은 안 불린 것 같았다.
  강산이 종굴이  왔고 싶은 상모수는 곳한 곰집은 것 같았다.
  강산이 족은 양피는 가날이 아니요."
  "예, 산고 가탕이 다시 죽얼 지 구천하게  그러세 얼굴을 달련 덧을  부칠리 질 없는 농은 언분네요."
  "그건 소리가 아니물이 용닌이 되수  성을 흠금글고 달려가지 아나갔다. 
 "그러나 내간이고 내발이 탂어서 달려가지 않나. 무고 고랭각은 다사 죽어러들어지 않았다. 
간곡이얐다. 지낙해지 안 가기요."
 "잔은 안 불린 것 같았다.
  강산이 종굴이  왔고 싶은 상모수는 곳한 곰집은 것 같았다.
  강산이 족은 양피는 가날이 아니요."
  "예, 산고 가탕이 다시 죽얼 지 구천하게  그러세 얼굴을 달련 덧을  부칠리 질 없는 농은 언분네요."
  "그건 소리가 아니물이 용닌이 되수  성을 흠금글고 달려가지 아나갔다. 
 "그러나 내간이고 내발이 탂어서 달려가지 않나. 무고 고랭각은 다사 죽어러들어지 않았다. 
간곡이얐다. 지낙해지 안 가기요."
 "잔은 안 불린 것 같았다.
  강산이 종굴이  왔고 싶은 상모수는 곳한 곰집은 것 같았다.
  강산이 족은 양피는 가날이 아니요."
  "예, 산고 가탕이 다시 죽얼 지 구천하게  그러세 얼굴을 달련 덧을  부칠리 질 없는 농은 언분네요."
  "그건 소리가 아니물이 용닌이 되수  성을 흠금글고 달려가지 아나갔다. 
 "그러나 내간이고 내발이 탂어서 달려가지 않나. 무고 고랭각은 다사 죽어러들어지 않았다. 
간곡이얐다. 지낙해지 안 가기요."
 "잔은 안 불린 것 같았다.
  강산이 종굴이  왔고 싶은 상모수는 곳한 곰집은 것 같았다.
  강산이 족은 양피는 가날이 아니요."
  "예, 산고 가탕이 다시 죽얼 지 구천하게  그러세 얼굴을 달련 덧을  부칠리 질 없는 농은 언분네요."
  "그건 소리가 아니물이 용닌이 되수  성을 흠금글고 달려가지 아나갔다. 
 "그러나 내간이고 내발이 탂어서 달려가지 않나. 무고 고랭각은 다사 죽어러들어지 않았다. 
간곡이얐다. 지낙해지 안 가기요."
 "잔은 안 불린 것 같았다.
  강산이 종굴이  왔고 싶은 상모수는 곳한 곰집은 것 같았다.
  강산이 족은 양피는 가날이 아니요."
  "예, 산고 가탕이 다시 죽얼 지 구천하게  그러세 얼굴을 달련 덧을  부칠리 질 없는 농은 언분네요."
  "그건 소리가 아니물이 용닌이 되수  성을 흠금글고 달려가지 아나갔다. 
 "그러나 내간이고 내발이 탂어서 달려가지 않나. 무고 고랭각은 다사 죽어러들어지 않았다. 
간곡이얐다. 지낙해지 안 가기요."
 "잔은 안 불린 것 같았다.
  강산이 종굴이  왔고 싶은 상모수는 곳한 곰집은 것 같았다.
  강산이 족은 양피는 가날이 아니요."
  "예, 산고 가탕이 다시 죽얼 지 구천하게  그러세 얼굴을 달련 덧을  부칠리 질 없는 농은 언분네요."
  "그건 소리가 아니물이 용닌이 되수  성을 흠금글고 달려가지 아나갔다. 
 "그러나 내간이고 내발이 탂어서 달려가지 않나. 무고 고랭각은 다사 죽어러들어지 않았다. 
간곡이얐다. 지낙해지 안 가기요."
 "잔은 안 불린 것 같았다.
  강산이 종굴이  왔고 싶은 상모수는 곳한 곰집은 것 같았다.
  강산이 족은 양피는 가날이 아니요."
  "예, 산고 가탕이 다시 죽얼 지 구천하게  그러세 얼굴을 달련 덧을  부칠리 질 없는 농은 언분네요."
  "그건 소리가 아니물이 용닌이 되수  성을 흠금글고 달려가지 아나갔다. 
 "그러나 내간이고 내발이 탂어서 달려가지 않나. 무고 고랭각은 다사 죽어러들어지 않았다. 
간곡이얐다. 지낙해지 안 가기요."
 "잔은 안 불린 것 같았다.
  강산이 종굴이  왔고 싶은 상모수는 곳한 곰집은 것 같았다.
  강산이 족은 양피는 가날이 아니요."
  "예, 산고 가탕이 다시 죽얼 지 구천하게  그러세 얼굴을 달련 덧을  부칠리 질 없는 농은 언분네요."
  "그건 소리가 아니물이 용닌이 되수  성을 흠금글고 달려가지 아나갔다. 
 "그러나 내간이고 내발이 탂어서 달려가지 않나. 무고 고랭각은 다사 죽어러들어지 않았다. 
간곡이얐다. 지낙해지 안 가기요."
 "잔은 안 불린 것 같았다.
  강산이 종굴이  왔고 싶은 상모수는 곳한 곰지

RNN을 이용한 스펨메일 분류(이진 분류)

* tf_rnn10_스팸메일분류.ipynb

import pandas as pd

data = pd.read_csv('https://raw.githubusercontent.com/pykwon/python/master/testdata_utf8/spam.csv', encoding='latin1')
print(data.head())
print('샘플 수 : ', len(data))                       # 샘플 수 :  5572
del data['Unnamed: 2']
del data['Unnamed: 3']
del data['Unnamed: 4']
print(data.head())
#      v1  ... Unnamed: 4
# 0   ham  ...        NaN
# 1   ham  ...        NaN
# 2  spam  ...        NaN
# 3   ham  ...        NaN
# 4   ham  ...        NaN
print(data.v1.unique())                              # ['ham' 'spam']
data['v1'] = data['v1'].replace(['ham', 'spam'], [0, 1])
print(data.head())

# Null 여부 확인
print(data.isnull().values.any())                     # False
print(data.info())

# 중복 데이터 확인
print(data['v2'].nunique())                           # 5169

data.drop_duplicates(subset=['v2'], inplace=True)
print('중복 데이터 제거 후 샘플 수 : ', len(data))    # 5169

print(data.groupby('v1').size().reset_index(name='count'))
#    v1  count
# 0   0   4516
# 1   1    653

# feature(v2), label(v1) 분리
xdata = data['v2']
ydata = data['v1']
print(xdata[:3])
# 0    Go until jurong point, crazy.. Available only ...
# 1                        Ok lar... Joking wif u oni...
# 2    Free entry in 2 a wkly comp to win FA Cup fina...

print(ydata[:3])
# 0    0
# 1    0
# 2    1

- token 처리

from tensorflow.keras.preprocessing.text import Tokenizer
tok = Tokenizer()
tok.fit_on_texts(xdata)
print(tok.word_index) # {'i': 1, 'to': 2, 'you': 3, 'a': 4, 'the': 5, 'u': 6, 'and': 7, 'in': 8, 'is': 9, 'me': 10 ...
sequences = tok.texts_to_sequences(xdata)
print(xdata[:5])
# 0    Go until jurong point, crazy.. Available only ...
# 1                        Ok lar... Joking wif u oni...
# 2    Free entry in 2 a wkly comp to win FA Cup fina...
# 3    U dun say so early hor... U c already then say...
# 4    Nah I don't think he goes to usf, he lives aro...
print(sequences[:5])
# [[47, 433, 4013, 780, 705, 662, 64, 8, 1202, 94, 121, 434, 1203, ...
word_index = tok.word_index
print(word_index)
# {'i': 1, 'to': 2, 'you': 3, 'a': 4, 'the': 5, 'u': 6, 'and': 7, 'in': 8, 'is': 9, 'me': 10, ...
print(len(word_index)) # 8920

# 전체 자료 중 등장빈도 수, 비율 확인
threshold = 2                   # 등장빈도 수를 제한
total_count = len(word_index)   # 전체 단어 수
rare_count = 0                  # 빈도 수가 threshold 보다 작은 경우
total_freq = 0                  # 전체 단어 빈도 수 총합 비율
rare_freq = 0                   # 빈도 수 가 threshold보다 작은 경우의 단어 빈도 수 총합 비율 전체 자료 중 등장빈도 수, 비율 확인
threshold = 2                   # 등장빈도 수를 제한
total_count = len(word_index)   # 전체 단어 수
rare_count = 0                  # 빈도 수가 threshold 보다 작은 경우
total_freq = 0                  # 전체 단어 빈도 수 총합 비율
rare_freq = 0                   # 빈도 수 가 threshold보다 작은 경우의 단어 빈도 수 총합 비율

# dict type의 단어/빈도수 얻기
for key, value in tok.word_counts.items():
    #print('k:{} va:{}'.format(key, value))
    # k:jd va:1
    # k:accounts va:1
    # k:executive va:2
    # k:parents' va:2
    # k:picked va:7
    # k:downstem va:1
    # k:08718730555 va:
    total_freq = total_freq + value

    if value < threshold:
        rare_count = rare_count + 1
        rare_freq = rare_freq + value

print('등장빈도가 1회인 단어 수 :', rare_count)                                # 등장빈도가 1회인 단어 수 : 4908
print('등장빈도가 1회인 단어 비율 :', (rare_count / total_count) * 100)        # 등장빈도가 1회인 단어 비율 : 55.02242152466368
print('전체 중 등장빈도가 1회인 단어 비율 :', (rare_freq / total_freq) * 100)  # 전체 중 등장빈도가 1회인 단어 비율 : 6.082538108811501

tok = Tokenizer(num_words= total_count - rare_count + 1)

vocab_size = len(word_index) + 1
print('단어 집합 크기 :', vocab_size)   # 단어 집합 크기 : 8921

# train/test 8:2
n_of_train = int(len(sequences) * 0.8)
n_of_test = int(len(sequences) - n_of_train)
print('train lenghth :', n_of_train)    # train lenghth : 4135
print('test lenghth :', n_of_test)      # test lenghth : 1034

# 메일의 길이 확인
x_data = sequences
print('메일의 최대 길이 :', max((len(i) for i in x_data)))          # 메일의 최대 길이 : 189
print('메일의 평균 길이 :', (sum(map(len, x_data)) / len(x_data)))  # 메일의 평균 길이 : 15.610369510543626

# 시각화
import matplotlib.pyplot as plt
plt.hist([len(siz) for siz in x_data], bins=50)
plt.xlabel('length')
plt.ylabel('count')
plt.show()

from tensorflow.keras.preprocessing.sequence import pad_sequences
max_len = max((len(i) for i in x_data))
data = pad_sequences(x_data, maxlen=max_len)
print(data.shape)       # (5169, 189)

# train/test 분리
import numpy as np
x_train = data[:n_of_train]
y_train = np.array(ydata[:n_of_train])
x_test = data[n_of_train:]
y_test = np.array(ydata[n_of_train:])
print(x_train.shape, x_train[:2])   # (4135, 189)
print(y_train.shape, y_train[:2])   # (4135,)
print(x_test.shape, y_test.shape)   # (1034, 189) (1034,)

# 모델
from tensorflow.keras.layers import LSTM, Embedding, Dense, Dropout
from tensorflow.keras.models import Sequential

model = Sequential()
model.add(Embedding(vocab_size, 32))
model.add(LSTM(32, activation='tanh'))
model.add(Dense(32, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))
print(model.summary())      # Total params: 294,881

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
history = model.fit(x_train, y_train, epochs=5, batch_size=32, validation_split=0.25, verbose=2)
print('loss, acc :', model.evaluate(x_test, y_test))    # loss, acc : [0.05419406294822693, 0.9893617033958435]

# print(x_test[0])

# loss, acc 변화에 대한 시각화
epochs = range(1, len(history.history['acc']) + 1)
plt.plot(epochs, history.history['loss'])
plt.plot(epochs, history.history['val_loss'])
plt.xlabel('epoch')
plt.ylabel('loss')
plt.legend(['train loss', 'validation_loss'])
plt.show()

plt.plot(epochs, history.history['acc'])
plt.plot(epochs, history.history['val_acc'])
plt.xlabel('epoch')
plt.ylabel('acc')
plt.legend(['train acc', 'validation_acc'])
plt.show()

로이터 뉴스 분류하기

wikidocs.net/22933

위키독스

온라인 책을 제작 공유하는 플랫폼 서비스

wikidocs.net

Keras를 이용한 One-hot encoding, Embedding

cafe.daum.net/flowlife/S2Ul/19

dacon.io/codeshare/1892

wikidocs.net/33520

word2vec tf idf

blog.naver.com/PostView.nhn?blogId=happyrachy&logNo=221285427229&parentCategoryNo=&categoryNo=16&viewDate=&isShowPopularPosts=false&from=postView

* tf_rnn11_뉴스카테고리분류.ipynb

import numpy as np
import tensorflow as tf
from tensorflow.keras.datasets import reuters
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Embedding
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.utils import to_categorical

np.random.seed(3)
tf.random.set_seed(3)

#print(reuters.load_data())
# (x_train, y_train), (x_test, y_test) = reuters.load_data()
# print(x_train.shape, y_train.shape, x_test.shape, y_test.shape) # (8982,) (8982,) (2246,) (2246,)

(x_train, y_train), (x_test, y_test) = reuters.load_data(num_words=1000, test_split=0.2) # test_split=0.2 default, num_words=1000 : 빈도 순위 1000이하 값만 출력
#(x_train, y_train), (x_test, y_test) = reuters.load_data(num_words=None, test_split=0.2) # test_split=0.2 default
print(x_train.shape, y_train.shape, x_test.shape, y_test.shape) # (8982,) (8982,) (2246,) (2246,)

category = np.max(y_train) + 1
print('category :', category)   # category : 46
print(x_train[:3])              # 숫자가 작을 수록 빈도수가 높음
# [list([1, 2, 2, 8, 43, 10, 447, 5, 25, 207, 270, 5, 2, 111, 16, 369, 186, 90, 67, 7, 89, ... 109, 15, 17, 12])
#  list([1, 2, 699, 2, 2, 56, 2, 2, 9, 56, 2, 2, 81, 5, 2, 57, 366, 737, 132, 20, 2, 7, 2, ...  2, 505, 17, 12])
#  list([1, 53, 12, 284, 15, 14, 272, 26, 53, 959, 32, 818, 15, 14, 272, 26, 39, 684, 70, ... 59, 11, 17, 12])]
print(y_train[:3])              # [3 4 3]
print(len(s) for s in x_train)

import matplotlib.pyplot as plt
plt.hist([len(s) for s in x_train], bins = 50)
plt.xlabel('length')
plt.ylabel('number')
plt.show()

- 데이터 구성

word_index = reuters.get_word_index()
print(word_index)       # {'mdbl': 10996, 'fawc': 16260, 'degussa': 12089, 'woods': 8803, 'hanging': 13796, ... }

index_to_word = {}
for k, v in word_index.items():
    index_to_word[v] = k

print(index_to_word)    # {10996: 'mdbl', 16260: 'fawc', 12089: 'degussa', 8803: 'woods', 13796: 'hanging ... }
print(index_to_word[1]) # the
print(index_to_word[10])    # for
print(index_to_word[100])   # group

print(x_train[0])                                       # [1, 2, 2, 8, 43, 10, 447, 5, 25, 207, 270, 5, 2, 111, 16, 369, ...
print(' '.join(index_to_word[i] for i in x_train[0]))   # he of of mln loss for plc said at only ended said of  ...

- 모델

x_train = sequence.pad_sequences(x_train, maxlen=100)
x_test = sequence.pad_sequences(x_test, maxlen=100)
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)
print(x_test)
# [[  0   0   0 ...  15  17  12]
#  [  0   0   0 ... 505  17  12]
#  [ 19 758  15 ...  11  17  12]
#  ...
#  [  0   0   0 ... 407  17  12]
#  [ 88   2  72 ... 364  17  12]
#  [125   2  21 ... 113  17  12]]
#print(y_test)

model = Sequential()
model.add(Embedding(1000, 100))
model.add(LSTM(100, activation='tanh'))
model.add(Dense(46, activation='softmax'))
print(model.summary())  # Total params: 185,046

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

history = model.fit(x_train, y_train, batch_size=64, epochs=50, validation_data=(x_test, y_test), verbose=2)

- 시각화

vloss = history.history['val_loss']
loss = history.history['loss']
x_len = np.arange(len(loss))
plt.plot(x_len, vloss, marker='.', c='red', label='train val_loss')
plt.plot(x_len, loss, marker='o', c='blue', label='train loss')
plt.xlabel('epoch')
plt.ylabel('loss')
plt.legend()
plt.show()

vacc = history.history['val_accuracy']
acc = history.history['accuracy']
x_len = np.arange(len(acc))
plt.plot(x_len, vacc, marker='.', c='red', label='train val_acc')
plt.plot(x_len, acc, marker='o', c='blue', label='train acc')
plt.xlabel('epoch')
plt.ylabel('acc')
plt.legend()
plt.show()

- IMDB

wikidocs.net/24586

위키독스

온라인 책을 제작 공유하는 플랫폼 서비스

wikidocs.net

tf 2 에서 RNN(LSTM) sample

cafe.daum.net/flowlife/S2Ul/21

* tf_rnn12_IMDB감성분류.ipynb

import numpy as np
import tensorflow as tf
from tensorflow.keras.datasets import imdb
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Embedding
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.utils import to_categorical
import matplotlib.pyplot as plt

np.random.seed(3)
tf.random.set_seed(3)
# print(imdb.load_data())
vocab_size = 10000
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size)
print(x_train.shape, y_train.shape, x_test.shape, y_test.shape) # (25000,) (25000,) (25000,) (25000,)

print(y_train[:3])                              # [1 0 0]
num_classes = max(y_train) + 1
print('num_classes :',num_classes)              # num_classes : 2
print(set(y_train), ' ', np.unique(y_train))    # {0, 1}   [0 1]

print(x_train[0])                               # [1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, ...
print(y_train[0])                               # 1

- 시각화 : 훈련용 리뷰 분포

len_result = [len(s) for s in x_train]
print('리뷰 최대 길이 :', np.max(len_result))  # 2494
print('리뷰 평균 길이 :', np.mean(len_result)) # 238.71364

plt.subplot(1, 2, 1)
plt.boxplot(len_result)
plt.subplot(1, 2, 2)
plt.hist(len_result, bins=50)
plt.show()

- 긍/부정 빈도수

unique_ele, counts_ele = np.unique(y_train, return_counts=True)
print(np.asarray((unique_ele, counts_ele)))
# [[    0     1]
#  [12500 12500]]

- index에 대한 단어 출력

word_to_index = imdb.get_word_index()
index_to_word = {}
for k, v in word_to_index.items():
    index_to_word[v] = k
print(index_to_word)        # {34701: 'fawn', 52006: 'tsukino', 52007: 'nunnery', 16816: 'sonja', ...
print(index_to_word[1])     # the
print(index_to_word[1408])  # woods

print(x_train[0])           # [1, 14, 22, 16, 43, 530, 973, 1622, ...
print(y_train[0])           # 1
print(' '.join([index_to_word[index] for index in x_train[0]]))     # the as you with out themselves powerful lets loves their ...

- LSTM으로 감성분류

from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from tensorflow.keras.models import load_model

max_len = 500
x_train = pad_sequences(x_train, maxlen=max_len)
x_test = pad_sequences(x_test, maxlen=max_len)
# print(x_train[0])

model = Sequential()
model.add(Embedding(vocab_size, 100))
model.add(LSTM(120, activation='tanh'))
model.add(Dense(1, activation='sigmoid'))

print(model.summary())      # Total params: 1,106,201

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['acc'])
es = EarlyStopping(monitor='val_loss', mode='auto', patience=3, baseline=0.01)
ms = ModelCheckpoint('tfrmm12.h5', monitor='val_acc', mode='max', save_best_only = True)
model.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=100, batch_size=64, verbose=2, callbacks=[es, ms])

loaded_model = load_model('tfrmm12.h5')
print('acc :',loaded_model.evaluate(x_test, y_test)[1])     # acc : 0.8718400001525879
print('loss :',loaded_model.evaluate(x_test, y_test)[0])    # loss : 0.3080214262008667

- CNN으로 텍스트 분류

from tensorflow.keras.layers import Conv1D, GlobalMaxPooling1D, MaxPooling1D, Dropout

model = Sequential()
model.add(Embedding(vocab_size, 256))
model.add(Conv1D(256, kernel_size=3, padding='valid', activation='relu', strides=1))
model.add(GlobalMaxPooling1D())
model.add(Dense(1, activation='sigmoid'))
print(model.summary())          # Total params: 2,757,121

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['acc'])
es = EarlyStopping(monitor='val_loss', mode='auto', patience=3, baseline=0.01)
ms = ModelCheckpoint('tfrmm12_1.h5', monitor='val_acc', mode='max', save_best_only = True)
history = model.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=100, batch_size=64, verbose=2, callbacks=[es, ms])

loaded_model = load_model('tfrmm12_1.h5')
print('acc :',loaded_model.evaluate(x_test, y_test)[1])     # acc : 0.8984400033950806
print('loss :',loaded_model.evaluate(x_test, y_test)[0])    # loss : 0.24771703779697418

- 시각화

vloss = history.history['val_loss']
loss = history.history['loss']
x_len = np.arange(len(loss))
plt.plot(x_len, vloss, marker='+', c='black', label='val_loss')
plt.plot(x_len, loss, marker='s', c='red', label='loss')
plt.legend()
plt.grid()
plt.show()

import re
def sentiment_predict(new_sentence):
  new_sentence = re.sub('[^0-9a-zA-Z ]', '', new_sentence).lower()

  # 정수 인코딩
  encoded = []
  for word in new_sentence.split():
    # 단어 집합의 크기를 10,000으로 제한.
    try :
      if word_to_index[word] <= 10000:
        encoded.append(word_to_index[word]+3)
      else:
        encoded.append(2)   # 10,000 이상의 숫자는 <unk> 토큰으로 취급.
    except KeyError:
      encoded.append(2)   # 단어 집합에 없는 단어는 <unk> 토큰으로 취급.

  pad_new = pad_sequences([encoded], maxlen = max_len) # 패딩
  
  # 예측하기
  score = float(loaded_model.predict(pad_new)) 
  if(score > 0.5):
    print("{:.2f}% 확률로 긍정!.".format(score * 100))
  else:
    print("{:.2f}% 확률로 부정!".format((1 - score) * 100))
# 99.57% 확률로 긍정!.
# 53.55% 확률로 긍정!.

# 긍/부정 분류 예측
#temp_str = "This movie was just way too overrated. The fighting was not professional and in slow motion."
temp_str = "This movie was a very touching movie."
sentiment_predict(temp_str)

temp_str = " I was lucky enough to be included in the group to see the advanced screening in Melbourne on the 15th of April, 2012. And, firstly, I need to say a big thank-you to Disney and Marvel Studios."
sentiment_predict(temp_str)

네이버 영화 리뷰 데이터를 이용해 분류 모델 작성

한국어 불용어, 토크나이징 툴

cafe.daum.net/flowlife/9A8Q/156

* tf_rnn13_naver감성분류.ipynb

#! pip install konlpy

import numpy as np
import pandas as pd
import matplotlib as plt
import re
from konlpy.tag import Okt
from tensorflow.keras.layers import Embedding, Dense, LSTM, Dropout
from tensorflow.keras.models import Sequential, load_model
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

train_data = pd.read_table('https://raw.githubusercontent.com/pykwon/python/master/testdata_utf8/ratings_train.txt')
test_data = pd.read_table('https://raw.githubusercontent.com/pykwon/python/master/testdata_utf8/ratings_test.txt')
print(train_data[:3], len(train_data))  # 150000
print(test_data[:3], len(test_data))    # 50000
#          id                           document  label
# 0   9976970                아 더빙.. 진짜 짜증나네요 목소리      0
# 1   3819312  흠...포스터보고 초딩영화줄....오버연기조차 가볍지 않구나      1
# 2  10265843                  너무재밓었다그래서보는것을추천한다      0
print(train_data.columns)               # Index(['id', 'document', 'label'], dtype='object')
# imsi = train_data.sample(n=1000, random_state=123)
# print(imsi)

- 데이터 전처리

# 데이터 전처리
print(train_data['document'].nunique(), test_data['document'].nunique())    # 146182 49157 => 중복자료가 있다.

train_data.drop_duplicates(subset=['document'], inplace=True)               # 중복 제거
print(len(train_data['document']))  # 146183
print(set(train_data['label']))     # {0 부정, 1 긍정}

train_data['label'].value_counts().plot(kind='bar')
plt.show()

print(train_data.groupby('label').size())
# label
# 0    73342
# 1    72841

- Null 값 확인

print(train_data.isnull().values.any()) # True
print(train_data.isnull().sum())
# id          0
# document    1
# label       0
print(train_data.loc[train_data.document.isnull()])
#             id document  label
# 25857  2172111      NaN      1

train_data = train_data.dropna(how='any')
print(train_data.isnull().values.any())     # False
print(len(train_data))                      # 146182

- 순수 한글 관련 자료 이외의 구둣점 등은 제거

print(train_data[:3])
train_data['document'] = train_data['document'].str.replace("[^ㄱ-ㅎ ㅏ-ㅣ 가-힣]","")
print(train_data[:3])

train_data['document'].replace('', np.nan, inplace=True)
print(train_data.isnull().sum())
# id            0
# document    391
# label         0

train_data = train_data.dropna(how='any')
print(train_data.isnull().values.any()) # False
print(len(train_data))                  # 145791

# test
test_data.drop_duplicates(subset=['document'], inplace=True)               # 중복 제거
test_data['document'] = test_data['document'].str.replace("[^ㄱ-ㅎ ㅏ-ㅣ 가-힣]","")
test_data['document'].replace('', np.nan, inplace=True)
test_data = test_data.dropna(how='any')
print(test_data.isnull().values.any())  # False
print(len(test_data))                   # 48995

- 불용어 제거 & 형태소 분류

# 불용어 제거
stopwords = ['아','휴','아이구','아이쿠','아이고','어','나','우리','저희','따라','의해','을','를','에','의','가','으로','로','에게','뿐이다','의거하여']

# 형태소 분류
okt = Okt()
x_train = []
for sen in train_data['document']:
    imsi = []
    imsi = okt.morphs(sen, stem=True)   # stem=True : 어간 추출
    imsi = [word for word in imsi if not word in stopwords]
    x_train.append(imsi)

print(x_train[:3])
# [['더빙', '진짜', '짜증나다', '목소리'], ['흠', '포스터', '보고', '초딩', '영화', '줄', '오버', '연기', '조차', '가볍다', '않다'], ['너', '무재', '밓었', '다그', '래서', '보다', '추천', '한', '다']]

x_test = []
for sen in test_data['document']:
    imsi = []
    imsi = okt.morphs(sen, stem=True)   # stem=True : 어간 추출
    imsi = [word for word in imsi if not word in stopwords]
    x_test.append(imsi)

print(x_test[:3])
# [['굳다', 'ㅋ'], ['뭐', '야', '이', '평점', '들', '은', '나쁘다', '않다', '점', '짜다', '리', '는', '더', '더욱', '아니다'], ['지루하다', '않다', '완전', '막장', '임', '돈', '주다', '보기', '에는']]

- 워드 임베딩

tok = Tokenizer()
tok.fit_on_texts(x_train)
print(tok.word_index)
# {'이': 1, '영화': 2, '보다': 3, '하다': 4, '도': 5, '들': 6, '는': 7, '은': 8, '없다': 9, '이다': 10, '있다': 11, '좋다': 12, ...

- 등장 빈도수를 확인해서 비중이 적은 자료는 배제

threshold = 3
total_cnt = len(tok.word_index)
rare_cnt = 0
total_freq = 0
rare_freq = 0

for k, v in tok.word_counts.items():
    total_freq = total_freq + v
    if v < threshold:
        rare_cnt = rare_cnt + 1
        rare_freq = rare_freq + v

print('total_cnt :', total_cnt)                         # 43753
print('rare_cnt :', rare_cnt)                           # 24340
print('rare_freq :', (rare_cnt / total_cnt) * 100)      # 55.63047105341348
print('total_cnt :', (rare_freq / total_freq) * 100)    # 1.71278110414947
# 2회 이하인 단어 전체 비중 1.7%이므로 2회 이하인 단어들은 배제해도 문제가 없을 것 같다

- OOV(Out of Vocabulary) : 단어 사전에 없으면 index자체를 할 수 없게 되는데 이런 문제를 OOV

vocab_size = total_cnt - rare_cnt + 2
print('vocab_size 크기 :', vocab_size)                  # 19415

tok = Tokenizer(vocab_size, oov_token='OOV')
tok.fit_on_texts(x_train)
x_train = tok.texts_to_sequences(x_train)
x_test = tok.texts_to_sequences(x_test)
print(x_train[:3])
# [[462, 23, 268, 665], [953, 463, 47, 609, 3, 221, 1454, 31, 967, 682, 25], [393, 2447, 1, 2317, 5669, 4, 227, 17, 15]]

y_train = np.array(train_data['label'])
y_test = np.array(test_data['label'])

- 비어 있는 샘플은 제거

drop_train = [index for index, sen in enumerate(x_train) if len(sen) < 1]

x_train = np.delete(x_train, drop_train, axis = 0)
y_train = np.delete(y_train, drop_train, axis = 0)
print(len(x_train), ' ', len(y_train))

print('리뷰 최대 길이 :', max(len(i) for i in x_train))             # 75
print('리뷰 평균 길이 :', sum(map(len, x_train)) / len(x_train))    # 12.169516185172293

plt.hist([len(s) for s in x_train], bins = 50)
plt.show()

- 전체 샘플 중에서 길이가 max_len 이하인 샘플 비율이 몇 % 인지 확인 함수 작성

def below_threshold_len(max_len, nested_list):
    cnt = 0
    for s in nested_list:
        if len(s) < max_len:
            cnt = cnt + 1
    print('전체 샘플 중에서 길이가 %s 이하인 샘플 비율 : %s'%(max_len, (cnt / len(nested_list)) * 100 ))
    # 전체 샘플 중에서 길이가 30 이하인 샘플 비율 : 92.13574660633485

max_len = 30
below_threshold_len(max_len, x_train)   # 92% 정도가 30 이하의 길이를 가짐

x_train = pad_sequences(x_train, maxlen=max_len)
x_test = pad_sequences(x_test, maxlen=max_len)
print(x_train[:5])
# [[    0     0     0     0     0     0     0     0     0     0     0     0
#       0     0     0     0     0     0     0     0     0     0     0     0
#       0     0   462    23   268   665]
#  [    0     0     0     0     0     0     0     0     0     0     0     0
#       0     0     0     0     0     0     0   953   463    47   609     3
#     221  1454    31   967   682    25]
#  [    0     0     0     0     0     0     0     0     0     0     0     0
#       0     0     0     0     0     0     0     0     0   393  2447     1
#    2317  5669     4   227    17    15]
#  [    0     0     0     0     0     0     0     0     0     0     0     0
#       0     0     0     0     0     0     0     0     0  6492   112  8118
#     225    62     8    10    33  3604]
#  [    0     0     0     0     0     0     0     0     0     0     0     0
#       0  1029     1    36  9143    31   837     3  2579    27  1114   246
#       5 14239     1  1080   260   246]]

- 모델

model = Sequential()
model.add(Embedding(vocab_size, 100))
model.add(LSTM(128, activation='tanh'))
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.25))
model.add(Dense(1, activation='sigmoid'))
print(model.summary())  # Total params: 2,075,389

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])

es = EarlyStopping(monitor='val_loss', mode = 'min', verbose=1, patience=3)
mc = ModelCheckpoint('tfrnn13.h5', monitor='val_acc', mode = 'max', save_best_only=True)
history = model.fit(x_train, y_train, epochs=10, callbacks=[es, mc], batch_size=64, validation_split=0.2)

- 저장된 모델로 나머지 작업

load_model = load_model('tfrnn13.h5')
print('acc :', load_model.evaluate(x_test, y_test)[1])  # acc : 0.847739577293396
print('loss :', load_model.evaluate(x_test, y_test)[0]) # loss : 0.35434380173683167

- 예측

def new_pred(new_sentence):
    new_sentence = okt.morphs(new_sentence, stem = True)
    new_sentence = [word for word in new_sentence if not word in stopwords]
    encoded = tok.texts_to_sequences([new_sentence])
    pad_new = pad_sequences(encoded, maxlen=max_len)
    pred = float(load_model.predict(pad_new))
    if pred < 0.5:
        print('{:.2f}% 확률로 긍정'.format(pred * 100))
    else:
        print('{:.2f}% 확률로 부정'.format((1 - pred) * 100))

new_pred('영화가 재밌네요')
new_pred('심하다 지루하고 졸려')
new_pred('주인공이 너무 멋있어 추천하고 싶네요')
# 6.13% 확률로 부정
# 0.05% 확률로 긍정
# 0.87% 확률로 부정

Sequence-to-Sequence

- 시퀀스 투 시퀀스

wikidocs.net/24996

위키독스

온라인 책을 제작 공유하는 플랫폼 서비스

wikidocs.net

- 영어를 불어로 번역하는 작업

* tf_rnn14_s2s번역.ipynb

import pandas as pd
import urllib3
import zipfile
import shutil
import os
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical

...

Attension

RNN 관련

cafe.daum.net/flowlife/S2Ul/33

자연어처리모델.pdf

attention.pdf

RNN → LSTM → Seq2Seq → Transformer → GPT-1 → BERT → GPT-3

자연어 처리를 위한 작업 2

cafe.daum.net/flowlife/S2Ul/29

- 양방향 LSTM + 어텐션 메커니즘(BiLSTM with Attention Mechanism)

wikidocs.net/48920

- IMDB 리뷰 데이터로 감성 분류 : LSTM + Attension (Transformer 기반 기술)

* tf_rnn15_attention.ipynb

# IMDB 리뷰 데이터로 감성 분류 : LSTM + Attension (Transformer 기반 기술)
# 양방향 LSTM과 어텐션 메커니즘(BiLSTM with Attention mechanism)

from tensorflow.keras.datasets import imdb
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.preprocessing.sequence import pad_sequences

vocab_size = 10000
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words = vocab_size)

print('리뷰의 최대 길이 : {}'.format(max(len(l) for l in X_train)))         # 리뷰의 최대 길이 : 2494
print('리뷰의 평균 길이 : {}'.format(sum(map(len, X_train))/len(X_train)))  # 리뷰의 평균 길이 : 238.71364

max_len = 500
X_train = pad_sequences(X_train, maxlen=max_len)
X_test = pad_sequences(X_test, maxlen=max_len)

...

'BACK END > Deep Learning' 카테고리의 다른 글

[딥러닝] GAN (0)	2021.04.12
[딥러닝] Tensorflow - 이미지 분류 (0)	2021.04.01
[딥러닝] Keras - Logistic (0)	2021.03.25
[딥러닝] Keras - Linear (0)	2021.03.23
[딥러닝] TensorFlow (0)	2021.03.22

[딥러닝] Tensorflow - 이미지 분류

2021. 4. 1. 13:32

Tensorflow - 이미지 분류

- ImageData

www.tensorflow.org/api_docs/python/tf/keras/preprocessing/image/ImageDataGenerator

tf.keras.preprocessing.image.ImageDataGenerator

Generate batches of tensor image data with real-time data augmentation.

www.tensorflow.org

- LeakyReLU

excelsior-cjh.tistory.com/177

05-1. 심층 신경망 학습 - 활성화 함수, 가중치 초기화

5-1. 심층 신경망 학습 - 활성화 함수, 가중치 초기화 저번 포스팅 04. 인공신경망에서 예제로 살펴본 신경망은 hidden layer가 2개인 얕은 DNN에 대해 다루었다. 하지만, 모델이 복잡해질수록 hidden layer

excelsior-cjh.tistory.com

CIRAR-10

: 10개의 레이블, 6만장의 칼라 이미지(5만장 - train, 1만장 - test)

- tf_cnn_cifar10.ipynb

#airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck
# DENSE 레이어로만 분류작업1

import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras.layers import Input, Flatten, Dense, Conv2D
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.datasets import cifar10

NUM_CLASSES = 10

(x_train, y_train), (x_test, y_test) = cifar10.load_data()

print('train data')
print(x_train.shape)     # (50000, 32, 32, 3)
print(x_train.shape[0])
print(x_train.shape[3])

print('test data')
print(x_test.shape)     # (10000, 32, 32, 3)

print(x_train[0])       # [[[ 59  62  63] ...
print(y_train[0])       # [6] frog

plt.figure(figsize=(12, 4))
plt.subplot(131)
plt.imshow(x_train[0], interpolation='bicubic')
plt.subplot(132)
plt.imshow(x_train[1], interpolation='bicubic')
plt.subplot(133)
plt.imshow(x_train[2], interpolation='bicubic')

x_train = x_train.astype('float32') / 255.0
x_test = x_test.astype('float32') / 255.0

y_train = to_categorical(y_train, NUM_CLASSES)
y_test = to_categorical(y_test, NUM_CLASSES)
print(x_train[54, 12, 13, 1]) # 0.36862746
print(x_train[1,12,13,2])  # 0.59607846

- 방법 1 Sequential API 사용(CNN 사용 X)

model = Sequential([
        Dense(512, input_shape=(32, 32, 3), activation='relu'),
        Flatten(),
        Dense(128, activation='relu'),
        Dense(NUM_CLASSES, activation='softmax')
])
print(model.summary()) # Total params: 67,112,330

- 방법 2 function API 사용(CNN 사용 X)

input_layer = Input((32, 32, 3))
x = Flatten()(input_layer)
x = Dense(512, activation='relu')(x)
x = Dense(128, activation='relu')(x)
output_layer = Dense(NUM_CLASSES, activation='softmax')(x)

model = Model(input_layer, output_layer)
print(model.summary()) # Total params: 1,640,330

- train

opt = Adam(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])
model.fit(x_train, y_train, batch_size=128, epochs=10, shuffle=True, verbose=2)
print('acc : %.4f'%(model.evaluate(x_test, y_test, batch_size=128)[1]))  # acc : 0.1000
print('loss : %.4f'%(model.evaluate(x_test, y_test, batch_size=128)[0])) # loss : 2.3030

CLASSES = np.array(['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck'])

pred = model.predict(x_test[:10])
pred_single = CLASSES[np.argmax(pred, axis = -1)]
actual_single = CLASSES[np.argmax(y_test[:10], axis = -1)]
print('예측값 :', pred_single)
# 예측값 : ['frog' 'frog' 'frog' 'frog' 'frog' 'frog' 'frog' 'frog' 'frog' 'frog']
print('실제값 :', actual_single)
# 실제값 : ['cat' 'ship' 'ship' 'airplane' 'frog' 'frog' 'automobile' 'frog' 'cat' 'automobile']
print('분류 실패 수 :', (pred_single != actual_single).sum())
# 분류 실패 수 : 7

- 시각화

fig = plt.figure(figsize=(15, 3))
fig.subplots_adjust(hspace = 0.4, wspace = 0.4)

for i, idx in enumerate(range(len(x_test[:10]))):
    img = x_test[idx]
    ax = fig.add_subplot(1, len(x_test[:10]), i+1)
    ax.axis('off')
    ax.text(0.5, -0.35, 'pred=' + str(pred_single[idx]),\
            fontsize=10, ha = 'center', transform = ax.transAxes)
    ax.text(0.5, -0.7, 'actual=' + str(actual_single[idx]),\
            fontsize=10, ha = 'center', transform = ax.transAxes)
    ax.imshow(img)

plt.show()

- CNN + DENSE 레이어로만 분류작업2

import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras.layers import Input, Flatten, Dense, Conv2D, Activation, BatchNormalization, ReLU, LeakyReLU, MaxPool2D
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.datasets import cifar10

NUM_CLASSES = 10

(x_train, y_train), (x_test, y_test) = cifar10.load_data()

x_train = x_train.astype('float32') / 255.0
x_test = x_test.astype('float32') / 255.0

y_train = to_categorical(y_train, NUM_CLASSES)
y_test = to_categorical(y_test, NUM_CLASSES)

- function API : CNN + DENSE

input_layer = Input(shape=(32,32,3))
conv_layer1 = Conv2D(filters=64, kernel_size=3, strides=2, padding='same')(input_layer)
conv_layer2 = Conv2D(filters=64, kernel_size=3, strides=2, padding='same')(conv_layer1)

flatten_layer = Flatten()(conv_layer2)

output_layer = Dense(units=10, activation='softmax')(flatten_layer)
model = Model(input_layer,  output_layer)
print(model.summary()) # Total params: 79,690

input_layer = Input(shape=(32,32,3))
x = Conv2D(filters=64, kernel_size=3, strides=2, padding='same')(input_layer)
x = MaxPool2D(pool_size=(2,2))(x)
#x = ReLU(x)
x = BatchNormalization()(x)
x = LeakyReLU()(x)

x = Conv2D(filters=64, kernel_size=3, strides=2, padding='same')(x)
x = MaxPool2D(pool_size=(2,2))(x)
x = BatchNormalization()(x)
x = LeakyReLU()(x)

x = Flatten()(x)

x = Dense(512)(x)
x = BatchNormalization()(x)
x = LeakyReLU()(x)

x = Dense(128)(x)
x = BatchNormalization()(x)
x = LeakyReLU()(x)

x = Dense(NUM_CLASSES)(x)
output_layer = Activation('softmax')(x)

model = Model(input_layer, output_layer)

- train

opt = Adam(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])
model.fit(x_train, y_train, batch_size=128, epochs=10, shuffle=True, verbose=2)
print('acc : %.4f'%(model.evaluate(x_test, y_test, batch_size=128)[1]))  # acc : 0.5986
print('loss : %.4f'%(model.evaluate(x_test, y_test, batch_size=128)[0])) # loss : 1.3376

Tensor : image process, CNN

cafe.daum.net/flowlife/S2Ul/3

Daum 카페

cafe.daum.net

CNN을 이용하여Tensor : image process, CNN 고차원적인 이미지 분류

https://wiserloner.tistory.com/1046?category=837669

텐서플로 2.0 공홈 탐방 (cat and dog image classification)

- 이번에는 CNN을 이용하여 조금 더 고차원적인 이미지 분류를 해보겠습니다. - 과연 머신이 영상을 보고 이것이 개인지 고양이인지를 분류해낼수 있을까요? 딥러닝, 그중에 CNN을 사용하면 놀랍

wiserloner.tistory.com

- tf_cnn_dogcat.ipynb

1. 라이브러리 임포트

import tensorflow as tf

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Conv2D, Flatten, Dropout, MaxPooling2D
from tensorflow.keras.preprocessing.image import ImageDataGenerator

import os
import numpy as np
import matplotlib.pyplot as plt

2. 데이터 다운로드

_URL = 'https://storage.googleapis.com/mledu-datasets/cats_and_dogs_filtered.zip'
path_to_zip = tf.keras.utils.get_file('cats_and_dogs.zip', origin=_URL, extract=True)
PATH = os.path.join(os.path.dirname(path_to_zip), 'cats_and_dogs_filtered')

batch_size = 128
epochs = 15
IMG_HEIGHT = 150
IMG_WIDTH = 150

3. 데이터 준비

train_dir = os.path.join(PATH, 'train')
validation_dir = os.path.join(PATH, 'validation')

train_cats_dir = os.path.join(train_dir, 'cats')  # directory with our training cat pictures
train_dogs_dir = os.path.join(train_dir, 'dogs')  # directory with our training dog pictures
validation_cats_dir = os.path.join(validation_dir, 'cats')  # directory with our validation cat pictures
validation_dogs_dir = os.path.join(validation_dir, 'dogs')  # directory with our validation dog pictures

- 이미지를 확인

num_cats_tr = len(os.listdir(train_cats_dir))
num_dogs_tr = len(os.listdir(train_dogs_dir))
# num_cats_te = len(os.listdir(test_cats_dir))
# num_dogs_te = len(os.listdir(test_dogs_dir))

num_cats_val = len(os.listdir(validation_cats_dir))
num_dogs_val = len(os.listdir(validation_dogs_dir))

total_train = num_cats_tr + num_dogs_tr
total_val = num_cats_val + num_dogs_val
# total_te = num_cats_te + num_dogs_te

print('total training cat images:', num_cats_tr)
print('total training dog images:', num_dogs_tr)
# print('total test dog images:', total_te)
# total training cat images: 1000
# total training dog images: 1000

print('total validation cat images:', num_cats_val)
print('total validation dog images:', num_dogs_val)
# total validation cat images: 500
# total validation dog images: 500
print("--")
print("Total training images:", total_train)
print("Total validation images:", total_val)
# Total training images: 2000
# Total validation images: 1000

- ImageDataGenerator

train_image_generator = ImageDataGenerator(rescale=1./255) # Generator for our training data
validation_image_generator = ImageDataGenerator(rescale=1./255) # Generator for our validation data

train_data_gen = train_image_generator.flow_from_directory(batch_size=batch_size,
                                                           directory=train_dir,
                                                           shuffle=True,
                                                           target_size=(IMG_HEIGHT, IMG_WIDTH),
                                                           class_mode='binary')
val_data_gen = validation_image_generator.flow_from_directory(batch_size=batch_size,
                                                              directory=validation_dir,
                                                              target_size=(IMG_HEIGHT, IMG_WIDTH),
                                                              class_mode='binary')

4. 데이터 확인

sample_training_images, _ = next(train_data_gen)

# This function will plot images in the form of a grid with 1 row and 5 columns where images are placed in each column.
def plotImages(images_arr):
    fig, axes = plt.subplots(1, 5, figsize=(20,20))
    axes = axes.flatten()
    for img, ax in zip( images_arr, axes):
        ax.imshow(img)
        ax.axis('off')
    plt.tight_layout()
    plt.show()
    
plotImages(sample_training_images[:5])

5. 모델 생성

model = Sequential([
    Conv2D(16, 3, padding='same', activation='relu', input_shape=(IMG_HEIGHT, IMG_WIDTH ,3)),
    MaxPooling2D(),
    Conv2D(32, 3, padding='same', activation='relu'),
    MaxPooling2D(),
    Conv2D(64, 3, padding='same', activation='relu'),
    MaxPooling2D(),
    Flatten(),
    Dense(512, activation='relu'),
    # Dense(1)
    Dense(1, activation='sigmoid')
])

6. 모델 컴파일

model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

7. 모델 확인

model.summary() # Total params: 10,641,441

8. 학습

history = model.fit_generator(
    train_data_gen,
    steps_per_epoch=total_train // batch_size,
    epochs=epochs,
    validation_data=val_data_gen,
    validation_steps=total_val // batch_size
)

9. 학습 결과 시각화

acc = history.history['accuracy']
val_acc = history.history['val_accuracy']

loss=history.history['loss']
val_loss=history.history['val_loss']

epochs_range = range(epochs)

plt.figure(figsize=(8, 8))
plt.subplot(1, 2, 1)
plt.plot(epochs_range, acc, label='Training Accuracy')
plt.plot(epochs_range, val_acc, label='Validation Accuracy')
plt.legend(loc='lower right')
plt.title('Training and Validation Accuracy')

plt.subplot(1, 2, 2)
plt.plot(epochs_range, loss, label='Training Loss')
plt.plot(epochs_range, val_loss, label='Validation Loss')
plt.legend(loc='upper right')
plt.title('Training and Validation Loss')
plt.show()

오버피팅 처리 오버피팅 처리

image_gen = ImageDataGenerator(rescale=1./255, horizontal_flip=True)
train_data_gen = image_gen.flow_from_directory(batch_size=batch_size,
                                               directory=train_dir,shuffle=True,
                                               target_size=(IMG_HEIGHT, IMG_WIDTH))
augmented_images = [train_data_gen[0][0][0] for i in range(5)]

# Re-use the same custom plotting f
image_gen = ImageDataGenerator(rescale=1./255, horizontal_flip=True)
train_data_gen = image_gen.flow_from_directory(batch_size=batch_size,
                                               directory=train_dir,
                                               shuffle=True,
                                               target_size=(IMG_HEIGHT, IMG_WIDTH))
                                               
augmented_images = [train_data_gen[0][0][0] for i in range(5)]

# Re-use the same custom plotting function defined and used
# above to visualize the training images
plotImages(augmented_images)

전부 적용

image_gen_train = ImageDataGenerator(
                    rescale=1./255,
                    rotation_range=45,
                    width_shift_range=.15,
                    height_shift_range=.15,
                    horizontal_flip=True,
                    zoom_range=0.5
                    )
                    
train_data_gen = image_gen_train.flow_from_directory(batch_size=batch_size,
                                                     directory=train_dir,
                                                     shuffle=True,
                                                     target_size=(IMG_HEIGHT, IMG_WIDTH),
                                                     class_mode='binary')
                                                     
augmented_images = [train_data_gen[0][0][0] for i in range(5)]
plotImages(augmented_images)

image_gen_val = ImageDataGenerator(rescale=1./255)

val_data_gen = image_gen_val.flow_from_directory(batch_size=batch_size,
                                                 directory=validation_dir,
                                                 target_size=(IMG_HEIGHT, IMG_WIDTH),
                                                 class_mode='binary')

model_new = Sequential([
    Conv2D(16, 3, padding='same', activation='relu', 
           input_shape=(IMG_HEIGHT, IMG_WIDTH ,3)),
    MaxPooling2D(),
    Dropout(0.2),
    Conv2D(32, 3, padding='same', activation='relu'),
    MaxPooling2D(),
    Conv2D(64, 3, padding='same', activation='relu'),
    MaxPooling2D(),
    Dropout(0.2),
    Flatten(),
    Dense(512, activation='relu'),
    Dense(1)
])

model_new.compile(optimizer='adam',
                  loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
                  metrics=['accuracy'])

model_new.summary() # Total params: 10,641,441

11. 학습 및 확인

history = model_new.fit_generator(
    train_data_gen,
    steps_per_epoch=total_train // batch_size,
    epochs=epochs,
    validation_data=val_data_gen,
    validation_steps=total_val // batch_size
)
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']

loss = history.history['loss']
val_loss = history.history['val_loss']

epochs_range = range(epochs)

plt.figure(figsize=(8, 8))
plt.subplot(1, 2, 1)
plt.plot(epochs_range, acc, label='Training Accuracy')
plt.plot(epochs_range, val_acc, label='Validation Accuracy')
plt.legend(loc='lower right')
plt.title('Training and Validation Accuracy')

plt.subplot(1, 2, 2)
plt.plot(epochs_range, loss, label='Training Loss')
plt.plot(epochs_range, val_loss, label='Validation Loss')
plt.legend(loc='upper right')
plt.title('Training and Validation Loss')
plt.show()

Transfer Learning(전이 학습)

: 부족한 데이터로 모델 생성시 성능이 약한 모델이 생성된다. 이에 전문회사에서 제공하는 모델을 사용하여 성능을 높인다.(모델을 라이브러리 처럼 사용)

: 미리 학습된 모델을 사용하여 내가 분류하고자 하는 데이터를 이용해 약간의 학습으로 성능 좋은 이미지 분류 모델을 얻을 수 있다.

이미지 분류 모형

cafe.daum.net/flowlife/S2Ul/31

Daum 카페

cafe.daum.net

Transfer Learning

cafe.daum.net/flowlife/S2Ul/32

Daum 카페

cafe.daum.net

* tf_cnn_trans_learn.ipynb

! ls -al
! pip install tensorflow-datasets
import os
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import tensorflow_datasets as tfds

tfds.disable_progress_bar()

(raw_train, raw_validation, raw_test), metadata = tfds.load('cats_vs_dogs',
                            split = ['train[:80%]', 'train[80%:90%]', 'train[90%:]'], with_info=True, as_supervised=True)

print(raw_train)
print(raw_validation)
print(raw_test)

print(metadata)

<PrefetchDataset shapes: ((None, None, 3), ()), types: (tf.uint8, tf.int64)>
<PrefetchDataset shapes: ((None, None, 3), ()), types: (tf.uint8, tf.int64)>
<PrefetchDataset shapes: ((None, None, 3), ()), types: (tf.uint8, tf.int64)>
tfds.core.DatasetInfo(
    name='cats_vs_dogs',
    version=4.0.0,
    description='A large set of images of cats and dogs.There are 1738 corrupted images that are dropped.',
    homepage='https://www.microsoft.com/en-us/download/details.aspx?id=54765',
    features=FeaturesDict({
        'image': Image(shape=(None, None, 3), dtype=tf.uint8),
        'image/filename': Text(shape=(), dtype=tf.string),
        'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
    }),
    total_num_examples=23262,
    splits={
        'train': 23262,
    },
    supervised_keys=('image', 'label'),
    citation="""@Inproceedings (Conference){asirra-a-captcha-that-exploits-interest-aligned-manual-image-categorization,
    author = {Elson, Jeremy and Douceur, John (JD) and Howell, Jon and Saul, Jared},
    title = {Asirra: A CAPTCHA that Exploits Interest-Aligned Manual Image Categorization},
    booktitle = {Proceedings of 14th ACM Conference on Computer and Communications Security (CCS)},
    year = {2007},
    month = {October},
    publisher = {Association for Computing Machinery, Inc.},
    url = {https://www.microsoft.com/en-us/research/publication/asirra-a-captcha-that-exploits-interest-aligned-manual-image-categorization/},
    edition = {Proceedings of 14th ACM Conference on Computer and Communications Security (CCS)},
    }""",
    redistribution_info=,
)

get_label_name = metadata.features['label'].int2str
print(get_label_name)

for image, label in raw_train.take(2):
    plt.figure()
    plt.imshow(image)
    plt.title(get_label_name(label))
    plt.show()

IMG_SIZE = 160   # All images will be resized to 160 by160

def format_example(image, label):
    image = tf.cast(image, tf.float32)
    image = (image/127.5) - 1
    image = tf.image.resize(image, (IMG_SIZE, IMG_SIZE))
    return image, label

train = raw_train.map(format_example)
validation = raw_validation.map(format_example)
test = raw_test.map(format_example)

# 4. 이미지 셔플링 배칭
BATCH_SIZE = 32
SHUFFLE_BUFFER_SIZE = 1000

train_batches = train.shuffle(SHUFFLE_BUFFER_SIZE).batch(BATCH_SIZE)
validation_batches = validation.batch(BATCH_SIZE)
test_batches = test.batch(BATCH_SIZE)
# 학습 데이터는 임의로 셔플하고 배치 크기를 정하여 배치로 나누어준다.

for image_batch, label_batch in train_batches.take(1):
    pass

print(image_batch.shape)    # [32, 160, 160, 3]

# 5. 베이스 모델 생성 : 전이학습에서 사용할 베이스 모델은 Google에서 개발한 MobileNet V2 모델 사용.
IMG_SHAPE = (IMG_SIZE, IMG_SIZE, 3)

# Create the base model from the pre-trained model MobileNet V2
base_model = tf.keras.applications.MobileNetV2(input_shape=IMG_SHAPE, include_top=False, weights='imagenet')

feature_batch = base_model(image_batch)
print(feature_batch.shape)   # (32, 5, 5, 1280)

# include_top=False : 입력층 -> CNN 계층 -> 특징 추출 -> 완전 연결층

- 계층 동결

base_model.trainable = False # MobileNet V2 학습 정지
print(base_model.summary()) # Total params: 2,257,984

- 전이 학습을 위한 모델 생성

global_average_layer = tf.keras.layers.GlobalAveragePooling2D() # 급격히 feature의 수를 줄여주는 역할
feature_batch_average = global_average_layer(feature_batch)
print(feature_batch_average) # (32, 1280)

prediction_layer = tf.keras.layers.Dense(1)
prediction_batch = prediction_layer(feature_batch_average)
print(prediction_batch)      # (32, 1)

model = tf.keras.Sequential([
        base_model,
        global_average_layer,
        prediction_layer
])

base_learning_rate = 0.0001
model.compile(optimizer=tf.keras.optimizers.RMSprop(lr=base_learning_rate),\
              loss = tf.keras.losses.BinaryCrossentropy(from_logits=True), metrics=['accuracy'])
print(model.summary())
'''
Layer (type)                 Output Shape              Param #   
=================================================================
mobilenetv2_1.00_160 (Functi (None, 5, 5, 1280)        2257984   
_________________________________________________________________
global_average_pooling2d_3 ( (None, 1280)              0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 1281      
=================================================================
Total params: 2,259,265
'''

- 현재 모델 확인

validation_steps = 20
loss0, accuracy0 = model.evaluate(validation_batches, steps=validation_steps)
print('initial loss : {:.2f}'.format(loss0))    # initial loss : 0.92
print('initial acc : {:.2f}'.format(accuracy0)) # initial acc : 0.35

- 모델 학습

initial_epochs = 5 # 10
history = model.fit(train_batches, epochs=initial_epochs, validation_data =validation_batches)

- 학습 시각화

acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']

plt.figure(figsize=(8, 8))
plt.subplot(2,1,1)
plt.plot(acc, label ='Train accuracy')
plt.plot(val_acc, label ='Validation accuracy')
plt.legend(loc='lower right')
plt.ylabel('Accuracy')
plt.ylim([min(plt.ylim()), 1])
plt.title('Training and Validation Accuracy')

plt.subplot(2,1,2)
plt.plot(loss, label ='Train losss')
plt.plot(val_loss, label ='Validation loss')
plt.legend(loc='upper right')
plt.ylabel('Cross entropy')
plt.ylim([0, 1.0])
plt.title('Training and Validation Loss')
plt.xlabel('epochs')
plt.show()

전이 학습 파이 튜닝 : 미리 학습된 ConvNet의 마지막 FC Layer만 변경해 분류 실행

이전 학습의 모바일넷을 동경시키고 새로 추가한 레이어만 학습 (베이스 모델의 후방 레이어 일부만 다시 학습)

먼저 베이스 모델을 동결한 후 학습 진행 -> 학습이 끝나면 동결 해제

base_model.trainable = True
print('베이스 모델의 레이어 :', len(base_model.layers)) # 베이스 모델의 레이어 : 154

fine_tune_at = 100

for layer in base_model.layers[:fine_tune_at]:
    layer.trainable = False

model.compile(loss = tf.keras.losses.BinaryCrossentropy(from_logits= True),\
              optimizer = tf.keras.optimizers.RMSprop(lr=base_learning_rate / 10), metrics=['accuracy'])
print(model.summary()) # Total params: 2,259,265

# 파일 튜인 학습
fine_tune_epochs = 2
initial_epochs = 5
total_epochs = initial_epochs + fine_tune_epochs
history_fine = model.fit(train_batches, epochs = total_epochs, initial_epoch=history.epoch[-1],\
                         validation_data = validation_batches)

- 시각화

print(history_fine.history)
acc += history_fine.history['accuracy']
val_acc += history_fine.history['val_accuracy']
loss += history_fine.history['loss']
val_loss += history_fine.history['val_loss']

plt.figure(figsize=(8, 8))
plt.subplot(2,1,1)
plt.plot(acc, label ='Train accuracy')
plt.plot(val_acc, label ='Validation accuracy')
plt.legend(loc='lower right')
plt.plot([initial_epochs -1, initial_epochs -1], plt.ylim(), label='Start fine tuning')
plt.ylabel('Accuracy')
plt.ylim([0.8, 1])
plt.title('Training and Validation Accuracy')

plt.subplot(2,1,2)
plt.plot(loss, label ='Train losss')
plt.plot(val_loss, label ='Validation loss')
plt.legend(loc='upper right')
plt.plot([initial_epochs -1, initial_epochs -1], plt.ylim(), label='Start fine tuning')
plt.ylabel('Cross entropy')
plt.ylim([0, 1.0])
plt.title('Training and Validation Loss')
plt.xlabel('epochs')
plt.show()

ANN, RNN(LSTM, GRU)

cafe.daum.net/flowlife/S2Ul/12

Daum 카페

cafe.daum.net

RNN

m.blog.naver.com/PostView.nhn?blogId=magnking&logNo=221311273459&proxyReferer=https:%2F%2Fwww.google.com%2F

[AI] RNN, LSTM이란?

RNN(Recurrent Neural Networks)은 다른 신경망과 어떻게 다른가?RNN은 이름에서 알 수 있는 것처...

blog.naver.com

RNN (순환신경망)

: 시계열 데이터 처리 - 자연어, 번역, 이미지 캡션, 채팅, 주식 ...

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import SimpleRNN, LSTM

SimpleRNN(3, input_shape =) :

LSTM(3, input_shape =) :

model = Sequential()
model.add(SimpleRNN(3, input_shape = (2, 10)))             # Total params: 42
model.add(SimpleRNN(3, input_length = 2, input_dim = 10))
model.add(LSTM(3, input_shape = (2, 10)))                   # Total params: 168

print(model.summary())

model = Sequential()
#model.add(SimpleRNN(3, batch_input_shape = (8, 2, 10))) # batch_size : 8, sequence : 2, 입력수 : 10, 출력 수 : 3
# Total params: 42

model.add(LSTM(3, batch_input_shape = (8, 2, 10)))  # Total params: 168

print(model.summary())

model = Sequential()
#model.add(SimpleRNN(3, batch_input_shape = (8, 2, 10), return_sequences=True))
model.add(LSTM(3, batch_input_shape = (8, 2, 10), return_sequences=True))
print(model.summary())

- SimpleRNN

www.tensorflow.org/api_docs/python/tf/keras/layers/SimpleRNN

tf.keras.layers.SimpleRNN | TensorFlow Core v2.4.1

Fully-connected RNN where the output is to be fed back to input.

www.tensorflow.org

- LSTM

www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM

tf.keras.layers.LSTM | TensorFlow Core v2.4.1

Long Short-Term Memory layer - Hochreiter 1997.

www.tensorflow.org

'BACK END > Deep Learning' 카테고리의 다른 글

[딥러닝] GAN (0)	2021.04.12
[딥러닝] RNN, NLP (0)	2021.04.05
[딥러닝] Keras - Logistic (0)	2021.03.25
[딥러닝] Keras - Linear (0)	2021.03.23
[딥러닝] TensorFlow (0)	2021.03.22

[LINUX] 리눅스 - 명령어 /eclipse /FlashPlayer /DB /apache /R /anaconda /hadoop

2021. 3. 26. 18:17

리눅스

- 리눅스 기본 명령어

cafe.daum.net/flowlife/9A8Q/161

리눅스 기본 명령어

cafe.daum.net

- 리눅스 기본 편집기 vi/vim 명령어

inpages.tistory.com/124

- 터미널 명령어 실행

pwd						: 사용자의 경로
ls						: 디렉토리
-l						: 자세히 보기
-a						: 숨김파일 확인
-al						: 숨김파일 자세히 보기
~						: 현재 사용자의 경로
/						: 리눅스의 경로
vi aaa					: 파일 생성
esc						: 입력어 대기화면
i, a					: append
shift + space			: 한영
:q!						: 미저장 종료
:wq						: 저장 후 종료
vi bbb.txt				: 파일 생성
리눅스에선 확장자 X
vi .ccc					: 숨김 파일 생성
.						: 숨김 파일
yy		  				: 복사
p 		  				: 붙여넣기
dd        				: 한줄 지우기
숫자+명령어   				: 명령어 여러번 반복
:set nu   				: 줄 번호 넣기
:set nonu 				: 줄 번호 제거
whoami    				: id 확인
s /home					: id 확인
mkdir kbs 				: 디렉토리 생성
mkdir -p  mbc/sbs 		: 상위 디렉토리 자동 생성
cd 디렉토리명 				: 디렉토리 변경
cd ~  					: home으로
cd .. 					: 상위 디렉토리로 이동
rmdir 디렉토리명   			: 디렉토리 삭제
rm 파일명         			: 파일 삭제
rm -rf kbs      		: 파일 있어도 삭제 가능
touch 파일명      			: 빈 파일 생성
cat 파일명        			: 파일 내용 확인
cat 파일명1 파일명2 			: 파일 내용확인
cp 파일명 디렉토리명/ 			: 파일 복사
cp 파일명 디렉토리명/새파일명 	: rename 파일 복사
mv 파일명 디렉토리        	: 파일 이동
rename 파일명 happy 새파일명 : 파일명 변경
mv 파일명 새파일명 			: 파일명 변경
head -3 파일명 			: 앞에 3줄 확인
tail -3 파일명 			: 뒤에 3줄 확인 
more +10 파일명 			: 
grep "검색어" 파일명 		: 검색
grep "[^A-Z]" 파일명* 		: 해당 파일명을 가진 모든 파일에서 대문자가 아닌 값 검색
whereis java 			: 파일 설치 모든 경로
whichis java 			: 파일 설치 경로
ifconfig				: ip 확인
su - root				: super user 접속
su - hadoop				: hadoop 일반 계정에 접속
useradd tom				: tom 일반 계정 만들기
passwd tom				: tom 일반 계정에 암호설정
exit					: 해당 계정 logout
userdel james			: 계정 삭제(logout상태에서)
rm -rf james			: 잔여 계정목록 삭제
-rwxrw-r--				: user, group, others
chmod u-w test.txt		: 현재 사용자에게 쓰기 권한 제거
chmod u-r test.txt		: 현재 사용자에게 읽기 권한 제거
chmod u+rwx test. extxt	: 현재 사용자에게 읽기/쓰기/excute 권한 추가
./test.txt				: 명령어 실행
chmod 777 test.txt		: 모든 권한 다 주기
chmod 111 test.txt		: excute 권한 다 주기
ls -l					: 목록 확인
rpm -qa					: 설치 목록확인
rpm -qa java*			: java 설치 경로
yum install gimp		: 설치
su -c 'yum install gimp': 설치
su -c 'yum remove gimp'	: 제거

- gimp 설치

yum list gi*			: 특정단어가 들어간 리스트
which gimp				: gimp 설치 여부 확인
ls -a /usr/bin/gimp
yum info gimp			: 패키지 정보
gimp					: 실행
wget https://t1.daumcdn.net/daumtop_chanel/op/20200723055344399.png : 웹상에서 다운로드
wget https://archive.apache.org/dist/httpd/Announcement1.3.txt		: 웹상에서 다운로드

- 파일 압축

vi aa
gzip aa					: 압축
gzip -d aa.gz			: 압축해제
bzip2 aa				: 압축
bzip2 -d aa.bz2			: 압축해제
tar cvf	my.tar aa bb	: 파일 묶기
tar xvf my.tar			: 압축 해제
java -version			: 자바 버전 확인
su -					: 관리자 접속
yum update				: 패키지 업데이트
rpm -qa | grep java*	: rpm 사용하여 파일 설치
java					: 자바 실행
javac					: javac 실행
yum remove java-1.8.0-openjdk-headless.x86_64	: 삭제
yum -y install java-11-openjdk-devel			: 패키지 설치
rpm -qa | java			: 설치 확인

- 이클립스 다운로드

mkdir work				: 디렉토리 생성
cd work
eclipse.org - download - 탐색기 - 다운로드 - 복사 - work 붙여넣기
tar xvfz eclipse tab키	: 압축해제
cd eclipse/
./eclipse
work -> jsou
open spective - java
Gerneral - Workspace - utf-8
Web  - utf-8
file - new - java project - pro1 / javaSE-11 - Don't create
new - class - test / main
sysout alt /
sum inss

- FlashPlayer 설치

get.adobe.com/kr/flashplayer 접속

wget ftp://ftp.pbone.net/mirror/www.mde.djura.org/2007.0/RPMS/FlashPlayer-9.0.31.0-1mde2007.0.i586.rpm

rm -f FlashPlayer-9.0.31.0-1mde2007.0.i586.rpm

chmod 777 FlashPlayer-9.0.31.0-1mde2007.0.i586.rpm

rpm -Uvh --nodeps FlashPlayer*.rpm

- putty

putty -> ifconfig의 ip -> open -> 예

- centos rpm

www.leafcats.com/171

- CentOS에 MariaDB 설치

cafe.daum.net/flowlife/HqLk/81

- MySql, MariaDB

cafe.daum.net/flowlife/HqLk/63 에서 다운로드
Build path - Configure Buil - Librarie - classpath - AddExternel JARs - apply

- apache server 다운로드

apache.org 접속
맨아래 Tomcat 클릭
Download / Tomcat 10 클릭
tar.gz (pgp, sha512) 다운로드

터미널 창 접속
ls /~다운로드
cd work
mv ~/다운로드.apach ~~ .gz ./
tar xvfz apache-tomcat-10.0.5.tar.gz
cd apache-tomcat-10.0.5/
cd bin/
pwd
./startup.sh							: 서버 실행
http://localhost:8080/ 접속
./shutdown.sh							: 서버 종료
./catalina.sh run						: 개발자용 서버 실행
http://localhost:8080/ 접속
ctrl + c								: 개발자용 서버 종료
cd conf
ls se*
vi server.xml
cd ~
pwd

eclipse 접속
Dynimic web project 생성
이름 : webpro / Targert runtime -> apache 10 및 work - apache 폴더 연결
webapp에 abc.html 생성.
window - web browser - firefox
내용 작성 후 서버 실행.

- R, R Studio Server 설치 및 기타

cafe.daum.net/flowlife/RlkF/11

- R 설치

su - root
yum install epel-release
yum install dnf-plugins-core
yum config-manager --set-enabled powertools
yum install R
R

a <- 10
a
df <- data.frame(value=rnorm(1000, 1, 1))
head(df)
head(df, 3)
q()

- R studio 설치

wget https://download1.rstudio.org/desktop/fedora28/x86_64/rstudio-1.2.5033-x86_64.rpm
rpm -ivh rstudio-1.2.5033-x86_64.rpm
exit
rstudio

install.packages("ggplot2")
library(ggplot2)
a <- 10
print(a)
print(head(iris, 3))
ggplot(data=iris, aes(x = Petal.Length, y = Petal.Width)) + geom_point()

- R studio 서버

vi /etc/selinux/config
SELINUX=permissive으로 수정
reboot
su - root
vi /etc/selinux/config
cd /usr/lib/firewalld/services/
ls http*
cp http.xml R.xml
vi R.xml
port="8787"으로 수정
firewall-cmd --permanent --zone=public --add-service=R
firewall-cmd --reload
https://www.rstudio.com/products/rstudio/download-server/
redhat/centos
wget https://download2.rstudio.org/server/centos7/x86_64/rstudio-server-rhel-1.4.1106-x86_64.rpm
wget https://download2.rstudio.org/server/fedora28/x86_64/rstudio-server-rhel-1.2.5033-x86_64.rpm

yum install --nogpgcheck rstudio-server-rhel-1.4.1106-x86_64.rpm
systemctl status rstudio-server
systemctl start rstudio-server

http://ip addr:8787/auth-sign-in	: 본인 ip에 접속

useradd testuser
passwd testuser
ls /home

mkdir mbc
cd mbc
ls

- 프로그램 셋팅 : ipython, anaconda

cafe.daum.net/flowlife/RUrO/44

주의 : VmWare를 사용할 때는 Virtual Machine Settings에서 Processors를 2 이상으로 주도록 하자.

- anaconda 설치

www.anaconda.com/products/individual

Download - Linux / 64-Bit (x86) Installer (529 MB) - 링크주소 복사

wget https://repo.anaconda.com/archive/Anaconda3-2020.02-Linux-x86_64.sh
chmod +x Anaconda~~
./Anaconda~~
q
yes
vi .bash_profile		: 파일 수정
=============================================
PATH=$PATH:$HOME/bin
export PATH

# added by Anaconda3 installer
export PATH="/home/hadoop/anaconda3/bin:$PATH"
==============================================
source .bash_profile
python3
conda deactivate     : 가상환경 나오기
source activate base : 가상환경 들어가기
jupyter notebook
jupyter lab
localhost:8888/tree
ctrl + c
y
jupyter notebook --generate-config
python
from notebook.auth import passwd
passwd()     #sha1 값 얻기
quit()
비밀번호 설정
Verify password 복사
vi .jupyter/jupyter_notebook_config.py		: 파일 수정
ctrl + end
==============================================
c.NotebookApp.password =u'       Verify password 붙여넣기		'
==============================================
quit()
vi .jupyter/jupyter_notebook_config.py
su -
cd /usr/lib/firewalld/services/
cp http.xml NoteBook.xml
vi NoteBook.xml
8001포트 설정
:wq
systemctl status firewalld.service           방화벽 서비스 상태 확인  start, stop, enable
firewall-cmd --permanent --zone=public --add-service=NoteBook
firewall-cmd --reload 
jupyter notebook --ip=0.0.0.0 --port=8001 --allow-root
jupyter lab --ip=0.0.0.0 --port=8001 --allow-root
vi .jupyter/jupyter_notebook_config.py
==============================================
c.NotebookApp.open_browser = False
==============================================

- 하둡 ppt

cafe.daum.net/flowlife/RZ23/24

: 빅데이터 파일을 여러 대의 서버에 분산 저장(HDFS, Hadoop Distributed File Syetem).

: 각 서버에서 분산병렬 처리.(MapReduce)

: 하둡은 데이터 수정불가.

: sale-out 사용.

- Hadoop 싱글노드로 설치 : Centos 기반 (2020년 기준)

cafe.daum.net/flowlife/RZ23/11

- 하둡 설치

conda deactivate	: 가상환경 나오기
java -version		: 자바 버전 확인
ls -al /etc/alternatives/java	: 자바 경로 확인 후 경로 복사
 => /usr/ ~~ .x86_64 복사
https://apache.org/ 접속 => 맨 밑에 hadoop접속 => Download => 
su -				: 관리자 접속
cd /usr/lib/firewalld/services/	: 특정 port 방화벽 해제
cp http.xml hadoop.xml
vi hadoop.xml					: 파일 수정 후 저장(:wq)
==============================================
<?xml version="1.0" encoding="utf-8"?>
<service>
    <short>Hadoop</short>
    <description></description>
    <port protocol="tcp" port="8042"/>    
    <port protocol="tcp" port="9864"/>
    <port protocol="tcp" port="9870"/>
    <port protocol="tcp" port="8088"/>
    <port protocol="tcp" port="19888"/>
</service>
==============================================
firewall-cmd --permanent --zone=public --add-service=hadoop
firewall-cmd --reload
exit
wget http://apache.mirror.cdnetworks.com/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz
tar xvzf hadoop-3.2.1.tar.gz
vi .bash_profile
==============================================
# User specific environment and startup programs
export JAVA_HOME=/usr/ ~~ .x86_64 붙여넣기
export HADOOP_HOME=/home/사용자명/hadoop-3.2.1     
PATH=$PATH:$HOME/.local/bin:$HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin  #추가
export PATH
==============================================
source .bash_profile		: .bash_profile을 수정된 내용으로 등록
conda deactivate
ls
cd hadoop-3.2.1/
vi README.txt
vi kor.txt					: 임의 내용 입력
cd etc
cd hadoop
pwd
vi hadoop-env.sh			: 파일 상단에 입력
==============================================
export JAVA_HOME=/usr/ ~~ .x86_64 붙여넣기
==============================================

vi core-site.xml        : 하둡 공통 설정을 기술
==============================================
<configuration>
        <property>
                <name>fs.default.name</name>
                <value>hdfs://localhost:9000</value>
        </property>
      <property>
                 <name>hadoop.tmp.dir</name> 
                       <value>/home/사용자명/hadoop-3.2.1/tmp/</value>
            </property>
</configuration>
==============================================

vi hdfs-site.xml      : HDFS 동작에 관한 설정
==============================================
<configuration>
     <property>
          <name>dfs.replication</name>
          <value>1</value>
     </property>
</configuration>
==============================================

vi yarn-env.sh         : yarn 설정 파일
==============================================
export JAVA_HOME=/usr/ ~~ .x86_64 붙여넣기
==============================================

vi mapred-site.xml      : yarn 설정 파일
==============================================
<configuration>
     <property>
          <name>mapreduce.framework.name</name>
          <value>yarn</value>
     </property>
     <property>
        <name>mapreduce.admin.user.env</name>
        <value>HADOOP_MAPRED_HOME=$HADOOP_COMMON_HOME</value>
    </property>
    <property>
        <name>yarn.app.mapreduce.am.env</name>
        <value>HADOOP_MAPRED_HOME=$HADOOP_COMMON_HOME</value>
    </property>
</configuration>
==============================================

vi yarn-site.xml
==============================================
<configuration>
<!-- Site specific YARN configuration properties -->
     <property>
          <name>yarn.nodemanager.aux-services</name>
          <value>mapreduce_shuffle</value>
     </property>
     <property>
          <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
          <value>org.apache.hadoop.mapred.ShuffleHandler</value>
     </property>
</configuration>
==============================================

cd					: root로 이동
hdfs namenode -format
hadoop version		: 하둡 버전 확인

ssh					: 원격지 시스템에 접근하여 암호화된 메시지를 전송할 수 있는 프로그램
ssh-keygen -t rsa	: ssh 키 생성. enter 3번.
cd .ssh
scp id_rsa.pub /home/사용자명/.ssh/authorized_keys	: 생성키를 접속할 때 사용하도록 복사함. yes
ssh 사용자명@localhost

start-all.sh		: deprecated
start-dfs.sh
start-yarn.sh
mr-jobhistory-daemon.sh start historyserver	: 안해도됨

jps					: 하둡 실행 상태 확인 - NameNode와 DataNode의 동작여부 확인

stop-all.sh			: deprecated
stop-dfs.sh
stop-yarn.sh

http://localhost:9870/ 접속 : Summary(HDFS 상태 확인)
http://localhost:8088/ 접속 : All Applications

- Map/Reduce 처리과정

cafe.daum.net/flowlife/RZ23/17

- 워드 카운트

conda deactivate
cd hadoop-3.2.1/
hdfs dfs -mkdir /test
hdfs dfs -ls/test
hdfs dfs -copyFromLocal ./README.txt /test	: 복사
hdfs dfs -cat /test/README.txt
hadoop jar /home/사용자명/hadoop-3.2.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar wordcount /test/README.txt /output			: map/reduce 작업 진행.
hdfs dfs -ls /output						: 목록확인
hdfs dfs -cat /output/part-r-00000
hdfs dfs -ls /
hdfs dfs -rm /output/part*					: 파일삭제
hdfs dfs -rm /output/_SUCCESS
hdfs dfs -rmdir /output

hdfs dfs -put ./kor.txt /test
hadoop jar /home/사용자명/hadoop-3.2.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar wordcount /test/kor.txt /daumdata
hdfs dfs -get /daumdata/part* happy.txt		: 하둡에서 파일 가져오기
ls
vi happy.txt

- eclipse + hadoop 연동

conda deactivate
start-dfs.sh
start-yarn.sh
jps
http://localhost:9870/ 접속	: Summary(HDFS 상태 확인)
http://localhost:8088/ 접속	: All Applications
eclipse 실행

Tensorflow.js 간단한 예

cafe.daum.net/flowlife/S2Ul/27

* abc.html

<!DOCTYPE html>
<html>
<head>
    <meta charset="UTF-8">
    <title>tensorflow.js sample</title>
    <script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs@1.0.0/dist/tf.min.js"></script>

    <script type="text/javascript">
    function abc(){
        // 선형회귀 모델 생성
        const model = tf.sequential();
        model.add(tf.layers.dense({units: 1, inputShape: [1]}));
 
        // 학습을 위한 준비 : 손실 함수와 최적화 함수를 설정
        model.compile({loss: 'meanSquaredError', optimizer: 'sgd'});
 
        // 학습 데이터
        const xs = tf.tensor2d([1, 2, 3, 4], [4, 1]);
        const ys = tf.tensor2d([1, 3, 5, 7], [4, 1]);
 
        // 데이터를 사용해서 학습
        model.fit(xs, ys).then(() => {
            // 학습된 모델로 결과 예측값 얻기
            pred = model.predict(tf.tensor2d([5], [1, 1]))
            pred.print()    // console로 출력
  document.write('예측값 : ', pred);
        });
    }
    </script>	
</head>
<body>
    결과는 브라우저의 콘솔로 확인하세요.
   <br>
    <button xxxxxxxxxxxxonclick="abc()">클릭</button>
</body>
</html>

propertive - javaSE
프로젝트 생성
hadoop_pro
Java1.8
프로젝트 오른쪽 클릭
Configure - Convert to Maven Project

* pom.xml

maven.org

hadoop-common 검색
apache
3.2.1
<dependency>
  <groupId>org.apache.hadoop</groupId>
  <artifactId>hadoop-common</artifactId>
  <version>3.2.1</version>
</dependency>

hadoop-client
apache
3.2.1
<dependency>
  <groupId>org.apache.hadoop</groupId>
  <artifactId>hadoop-client</artifactId>
  <version>3.2.1</version>
</dependency>

hadoop-hdfs
apache
3.2.1
<dependency>
  <groupId>org.apache.hadoop</groupId>
  <artifactId>hadoop-hdfs</artifactId>
  <version>3.2.2</version>
</dependency>

hadoop-mapreduce-client-core
apache
3.2.1
<dependency>
  <groupId>org.apache.hadoop</groupId>
  <artifactId>hadoop-mapreduce-client-core</artifactId>
  <version>3.2.2</version>
</dependency>

- example code download

hadoop.apache.org/

Documentation - 3.2.2

MapReduce
Tutorial
Example: WordCount v1.0 - Source Code

src - new - class - pack.WordCount

* WordCount.java

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

  public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }

  public static class IntSumReducer
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
	
	conf.set("fs.default.name", "hdfs://localhost:9000");
	
    Job job = Job.getInstance(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
	
    FileInputFormat.addInputPath(job, new Path("hdfs://localhost:9000/test/data.txt"));
    FileOutputFormat.setOutputPath(job, new Path("hdfs://localhost:9000/out1"));
	System.out.println("success");
	
    //FileInputFormat.addInputPath(job, new Path(args[0]));
    //FileOutputFormat.setOutputPath(job, new Path(args[1]));
	
	System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

ls
cd hadoop-3.2.1/
vi data.txt
임의 텍스트 입력
hdfs dfs -put data.txt /test
hdfs dfs -get /out1/part* result1.txt
vi result1.txt
vi vote.txt
임의 텍스트 입력
hdfs dfs -put vote.txt /test

src - new - class - pack2.VoteMapper

* VoteMapper.java

import java.util.regex.*;
import org.apache.hadoop.io.IntWritable
import org.apache.hadoop.io.Text
import org.apache.hadoop.mapreduce.Mapper


public class VoteMapper extends Mapper<Object, Text, Text, IntWritable>{
	private final static IntWritable one =  new IntWritable(1);
	private Text word = new Text();
	
	@Override
	public void map(Object ket, Text value, Context context)throws IOException, InterruptedException{
		String str = value.toString();
		System.out.println(str);
		String regex = "홍길동|신기해|한국인";
		Pattern pattern = Pattern.compile(regex);
		Matcher matcher = pattern.matcher(str);
		String result = "";
		while(match.find()){
			result += matcher.group() + "";
		}
		StringTokenizer itr = new StringTokenizer(result);
		
		while(itr.hasMoreTokens()){
			word.set(itr.nextToken());
			context.write(word, one);
		}
	}
}

src - new - class - pack2.VoteReducer

* VoteReducer.java

public class VoteReducer extends Reducer<Textm IntWritable, Text, IntWritable>{
	private IntWritable result = new IntWritable();
	
	@Override
	public void reduce(Text key, Iterable<IntWritable> values, Context context) throw IOException, InterruptedException{
		int sum = 0;
		for(IntWritable val:values){
			sum += val.get();
		}
		
		result.set(sum);
		context.wrtie(key, result);
	}
}

src - new - class - pack2.VoteCount

* VoteCount.java

public class VoteCount{
	public static void main(String[] args) throws Exception{
		Configuration conf = new Configuration();
		conf.set("fs.default.name", "hdfs://localhost:9000");
		
		Job job = Job.getInstance(conf, "vote");
		job.setJarByClass(VoteCount.class);
		job.setMapperClass(VoteMapper.class);
		job.setCombinerClass(VoteReducer.class);
		job.setReducerClass(VoteReducer.class);
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);
		
		FileInputFormat.addInputPath(job, new Path("hdfs://localhost:9000/test/vote.txt"));
		FileOutputFormat.setOutputPath(job, new Path("hdfs://localhost:9000/out2"));
		
		System.out.println("end");
		
		System.exit(job.waitForCompletion(true)?0:1);
	}
}

hdfs dfs -get /out2/part* result2.txt

'BACK END > Python Library' 카테고리의 다른 글

[Pandas] pandas 정리2 - db, django (0)	2021.03.02
[MatPlotLib] matplotlib 정리 (0)	2021.03.02
[Pandas] pandas 정리 (0)	2021.02.24
[NumPy] numpy 정리 (0)	2021.02.23

[딥러닝] Keras - Logistic

2021. 3. 25. 12:53

Keras - Logistic

tf 1.x 와 2.x : 단순선형회귀/로지스틱회귀 소스 코드

cafe.daum.net/flowlife/S2Ul/17

로지스틱 회귀 분석) 1.x

* ke12_classification_tf1.py

import tensorflow.compat.v1 as tf   # tf2.x 환경에서 1.x 소스 실행 시
tf.disable_v2_behavior()            # tf2.x 환경에서 1.x 소스 실행 시

x_data = [[1,2],[2,3],[3,4],[4,3],[3,2],[2,1]]
y_data = [[0],[0],[0],[1],[1],[1]]

# placeholders for a tensor that will be always fed.
X = tf.placeholder(tf.float32, shape=[None, 2])
Y = tf.placeholder(tf.float32, shape=[None, 1])
W = tf.Variable(tf.random_normal([2, 1]), name='weight')
b = tf.Variable(tf.random_normal([1]), name='bias')

# Hypothesis using sigmoid: tf.div(1., 1. + tf.exp(tf.matmul(X, W)))
hypothesis = tf.sigmoid(tf.matmul(X, W) + b)

# 로지스틱 회귀에서 Cost function 구하기
cost = -tf.reduce_mean(Y * tf.log(hypothesis) + (1 - Y) * tf.log(1 - hypothesis))

# Optimizer(코스트 함수의 최소값을 찾는 알고리즘) 구하기
train = tf.train.GradientDescentOptimizer(learning_rate=0.01).minimize(cost)

predicted = tf.cast(hypothesis > 0.5, dtype=tf.float32)
accuracy = tf.reduce_mean(tf.cast(tf.equal(predicted, Y), dtype=tf.float32))

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for step in range(10001):
        cost_val, _ = sess.run([cost, train], feed_dict={X: x_data, Y: y_data})
        if step % 200 == 0:
            print(step, cost_val)
 
    # Accuracy report (정확도 출력)
    h, c, a = sess.run([hypothesis, predicted, accuracy],feed_dict={X: x_data, Y: y_data})
    print("\nHypothesis: ", h, "\nCorrect (Y): ", c, "\nAccuracy: ", a)

import tensorflow.compat.v1 as tf

tf.disable_v2_behavior() : 텐서플로우 2환경에서 1 소스 실행 시 사용

tf.placeholder(자료형, shape=형태, name=) :

tf.matmul() :

tf.sigmoid() :

tf.reduce_mean() :

tf.log() :

tf.train.GradientDescentOptimizer(learning_rate=0.01) :

.minimize(cost) :

tf.cast() :

tf.Session() :

sess.run() :

tf.global_variables_initializer() :

로지스틱 회귀 분석) 2.x

* ke12_classification_tf2.py

import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation

np.random.seed(0)

x = np.array([[1,2],[2,3],[3,4],[4,3],[3,2],[2,1]])
y = np.array([[0],[0],[0],[1],[1],[1]])

model = Sequential([
    Dense(units = 1, input_dim=2),  # input_shape=(2,)
    Activation('sigmoid')
])

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

model.fit(x, y, epochs=1000, batch_size=1, verbose=1)

meval = model.evaluate(x,y)
print(meval)            # [0.209698(loss),  1.0(정확도)]

pred = model.predict(np.array([[1,2],[10,5]]))
print('예측 결과 : ', pred)     # [[0.16490099] [0.9996613 ]]
print('예측 결과 : ', np.squeeze(np.where(pred > 0.5, 1, 0)))  # [0 1]

for i in pred:
    print(1 if i > 0.5 else print(0))
print([1 if i > 0.5 else 0 for i in pred])

# 2. function API 사용
from tensorflow.keras.layers import Input
from tensorflow.keras.models import Model

inputs = Input(shape=(2,))
outputs = Dense(1, activation='sigmoid')
model2 = Model(inputs, outputs)

model2.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

model2.fit(x, y, epochs=500, batch_size=1, verbose=0)

meval2 = model2.evaluate(x,y)
print(meval2)            # [0.209698(loss),  1.0(정확도)]

- activation function

subinium.github.io/introduction-to-activation/

Introduction to Activation Function

activation을 알아봅시다.

subinium.github.io

- 와인 등급, 맛, 산도 등을 측정해 얻은 자료로 레드 와인과 화이트 와인 분류

* ke13_wine.py

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras import optimizers
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping
from sklearn.model_selection import train_test_split

wdf = pd.read_csv("https://raw.githubusercontent.com/pykwon/python/master/testdata_utf8/wine.csv", header=None)
print(wdf.head(2))
'''
    0     1    2    3      4     5     6       7     8     9    10  11  12
0  7.4  0.70  0.0  1.9  0.076  11.0  34.0  0.9978  3.51  0.56  9.4   5   1
1  7.8  0.88  0.0  2.6  0.098  25.0  67.0  0.9968  3.20  0.68  9.8   5   1
'''
print(wdf.info())
print(wdf.iloc[:, 12].unique()) # [1 0] wine 종류

dataset = wdf.values
print(dataset)
'''
[[ 7.4   0.7   0.   ...  9.4   5.    1.  ]
 [ 7.8   0.88  0.   ...  9.8   5.    1.  ]
 [ 7.8   0.76  0.04 ...  9.8   5.    1.  ]
 ...
 [ 6.5   0.24  0.19 ...  9.4   6.    0.  ]
 [ 5.5   0.29  0.3  ... 12.8   7.    0.  ]
 [ 6.    0.21  0.38 ... 11.8   6.    0.  ]]
'''
x = dataset[:, 0:12] # feature 값
y = dataset[:, -1]   # label 값
print(x[0]) # [ 7.4  0.7  0.  1.9  0.076  11.  34.  0.9978  3.51  0.56  9.4  5.]
print(y[0]) # 1.0

# 과적합 방지 - train/test
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=12)
print(x_train.shape, x_test.shape, y_train.shape)     # (4547, 12) (1950, 12) (4547,)

# model
model = Sequential()
model.add(Dense(30, input_dim=12, activation='relu'))
model.add(tf.keras.layers.BatchNormalization()) # 배치정규화. 그래디언트 손실과 폭주 문제 개선
model.add(Dense(15, activation='relu'))
model.add(tf.keras.layers.BatchNormalization()) # 배치정규화. 그래디언트 손실과 폭주 문제 개선
model.add(Dense(8, activation='relu'))
model.add(tf.keras.layers.BatchNormalization()) # 배치정규화. 그래디언트 손실과 폭주 문제 개선
model.add(Dense(1, activation='sigmoid'))
print(model.summary()) # Total params: 992

# 학습 설정
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# 모델 평가
loss, acc = model.evaluate(x_train, y_train, verbose=2)
print('훈련되지않은 모델의 분류 정확도 :{:5.2f}%'.format(100 * acc))  # 훈련되지않은 모델의 평가 :25.14%

model.add(tf.keras.layers.BatchNormalization()) : 배치정규화. 그래디언트 손실과 폭주 문제 개선

- BatchNormalization

eehoeskrap.tistory.com/430

[Deep Learning] Batch Normalization (배치 정규화)

사람은 역시 기본에 충실해야 하므로 ... 딥러닝의 기본중 기본인 배치 정규화(Batch Normalization)에 대해서 정리하고자 한다. 배치 정규화 (Batch Normalization) 란? 배치 정규화는 2015년 arXiv에 발표된 후

eehoeskrap.tistory.com

# 모델 저장 및 폴더 설정
import os
MODEL_DIR = './model/'
if not os.path.exists(MODEL_DIR): # 폴더가 없으면 생성
    os.mkdir(MODEL_DIR)

# 모델 저장조건 설정
modelPath = "model/{epoch:02d}-{loss:4f}.hdf5"

# 모델 학습 시 모니터링의 결과를 파일로 저장
chkpoint = ModelCheckpoint(filepath='./model/abc.hdf5', monitor='loss', save_best_only=True)
#chkpoint = ModelCheckpoint(filepath=modelPath, monitor='loss', save_best_only=True)

# 학습 조기 종료
early_stop = EarlyStopping(monitor='loss', patience=5)

# 훈련
# 과적합 방지 - validation_split
history = model.fit(x_train, y_train, epochs=10000, batch_size=64,\
                    validation_split=0.3, callbacks=[early_stop, chkpoint])

model.load_weights('./model/abc.hdf5')

from tensorflow.keras.callbacks import ModelCheckpoint

checkkpoint = ModelCheckpoint(filepath=경로, monitor='loss', save_best_only=True) : 모델 학습 시 모니터링의 결과를 파일로 저장

from tensorflow.keras.callbacks import EarlyStopping

early_stop = EarlyStopping(monitor='loss', patience=5) : 학습 조기 종료

model.fit(x, y, epochs=, batch_size=, validation_split=, callbacks=[early_stop, checkpoint])

model.load_weights(경로) : 모델 load

# 모델 평가
loss, acc = model.evaluate(x_test, y_test, verbose=2, batch_size=64)
print('훈련된 모델의 분류 정확도 :{:5.2f}%'.format(100 * acc))     # 훈련된 모델의 분류 정확도 :98.09%

# loss, val_loss
vloss = history.history['val_loss']
print('vloss :', vloss, len(vloss))

loss = history.history['loss']
print('loss :', loss, len(loss))

acc = history.history['accuracy']
print('acc :', acc, len(acc))
'''
vloss : [0.3071061074733734, 0.24310727417469025, 0.21292203664779663, 0.20357123017311096, 0.19876249134540558, 0.19339516758918762, 0.18849460780620575, 0.19663989543914795, 0.18071356415748596, 0.17616882920265198, 0.17531293630599976, 0.1801542490720749, 0.15864963829517365, 0.15213842689990997, 0.14762602746486664, 0.1503043919801712, 0.14793048799037933, 0.1309681385755539, 0.13258206844329834, 0.13192133605480194, 0.1243339478969574, 0.11655988544225693, 0.12307717651128769, 0.12738896906375885, 0.1113310232758522, 0.10832417756319046, 0.10952667146921158, 0.10551106929779053, 0.10609143227338791, 0.10121085494756699, 0.09997127950191498, 0.09778153896331787, 0.09552880376577377, 0.09823410212993622, 0.09609625488519669, 0.09461705386638641, 0.09470073878765106, 0.10075356811285019, 0.08981592953205109, 0.12177421152591705, 0.0883333757519722, 0.0909857228398323, 0.08964037150144577, 0.10728123784065247, 0.0898541733622551, 0.09610393643379211, 0.09143698215484619, 0.090325728058815, 0.08899156004190445, 0.08767704665660858, 0.08600322902202606, 0.08517392724752426, 0.092035673558712, 0.09141630679368973, 0.092674620449543, 0.10688834637403488, 0.12232159823179245, 0.08342760801315308, 0.08450359851121902, 0.09528715908527374, 0.08286084979772568, 0.0855109766125679, 0.09981518238782883, 0.10567736625671387, 0.08503438532352448] 65
loss : [0.5793761014938354, 0.2694554328918457, 0.2323148101568222, 0.21022693812847137, 0.20312409102916718, 0.19902488589286804, 0.19371536374092102, 0.18744204938411713, 0.1861375868320465, 0.18172481656074524, 0.17715702950954437, 0.17380622029304504, 0.16577215492725372, 0.15683749318122864, 0.15192237496376038, 0.14693987369537354, 0.14464591443538666, 0.13748657703399658, 0.13230560719966888, 0.13056866824626923, 0.12020964175462723, 0.11942493915557861, 0.11398345232009888, 0.11165868490934372, 0.10952220112085342, 0.10379171371459961, 0.09987008571624756, 0.10752293467521667, 0.09674300253391266, 0.09209998697042465, 0.09165043383836746, 0.0861961618065834, 0.0874367281794548, 0.08328106254339218, 0.07987993955612183, 0.07834275811910629, 0.07953618466854095, 0.08022965490818024, 0.07551567256450653, 0.07456657290458679, 0.08024302124977112, 0.06953852623701096, 0.07057023793458939, 0.06981713324785233, 0.07673583924770355, 0.06896857917308807, 0.06751637160778046, 0.0666055828332901, 0.06451215595006943, 0.06433264911174774, 0.0721585601568222, 0.072028249502182, 0.06898234039545059, 0.0603899322450161, 0.06275985389947891, 0.05977606773376465, 0.06264647841453552, 0.06375902146100998, 0.05906158685684204, 0.05760310962796211, 0.06351816654205322, 0.06012773886322975, 0.061231035739183426, 0.05984795466065407, 0.07533899694681168] 65
acc : [0.79572594165802, 0.9204902648925781, 0.9226901531219482, 0.9292897582054138, 0.930232584476471, 0.930232584476471, 0.9327467083930969, 0.9340037703514099, 0.934946596622467, 0.9380892515182495, 0.9377749562263489, 0.9390320777893066, 0.9396606087684631, 0.9434317946434021, 0.9424890279769897, 0.9437460899353027, 0.9472030401229858, 0.9500313997268677, 0.9487743377685547, 0.9538026452064514, 0.9550597071647644, 0.9569453001022339, 0.959145188331604, 0.9607165455818176, 0.9619736075401306, 0.9619736075401306, 0.9648020267486572, 0.9619736075401306, 0.9676304459571838, 0.9692017436027527, 0.9701445698738098, 0.9710873961448669, 0.9710873961448669, 0.9729729890823364, 0.9761156439781189, 0.975801408290863, 0.9786297678947449, 0.9739157557487488, 0.9764299392700195, 0.9786297678947449, 0.9732872247695923, 0.978315532207489, 0.975801408290863, 0.9786297678947449, 0.9745442867279053, 0.9776870012283325, 0.9811439514160156, 0.982086718082428, 0.9814581871032715, 0.9824010133743286, 0.9767441749572754, 0.9786297678947449, 0.9802011251449585, 0.9805154204368591, 0.9792582988739014, 0.9830295443534851, 0.9792582988739014, 0.9802011251449585, 0.9830295443534851, 0.980829656124115, 0.9798868894577026, 0.9817724823951721, 0.9811439514160156, 0.9827152490615845, 0.9751728177070618] 65
'''

# 시각화
epoch_len = np.arange(len(acc))
plt.plot(epoch_len, vloss, c='red', label='val_loss')
plt.plot(epoch_len, loss, c='blue', label='loss')
plt.xlabel('epochs')
plt.ylabel('loss')
plt.legend(loc='best')
plt.show()

plt.plot(epoch_len, acc, c='red', label='acc')
plt.xlabel('epochs')
plt.ylabel('acc')
plt.legend(loc='best')
plt.show()

# 예측
np.set_printoptions(suppress = True) # 과학적 표기 형식 해제
new_data = x_test[:5, :]
print(new_data)
'''
[[  7.2       0.15      0.39      1.8       0.043    21.      159.
    0.9948    3.52      0.47     10.        5.     ]
 [  6.9       0.3       0.29      1.3       0.053    24.      189.
    0.99362   3.29      0.54      9.9       4.     ]]
'''
pred = model.predict(new_data)
print('예측결과 :', np.where(pred > 0.5, 1, 0).flatten()) # 예측결과 : [0 0 0 0 1]

np.set_printoptions(suppress = True) : 과학적 표기 형식 해제

- K-Fold Cross Validation(교차검증)

nonmeyet.tistory.com/entry/KFold-Cross-Validation%EA%B5%90%EC%B0%A8%EA%B2%80%EC%A6%9D-%EC%A0%95%EC%9D%98-%EB%B0%8F-%EC%84%A4%EB%AA%85

K-Fold Cross Validation(교차검증) 정의 및 설명

정의 - K개의 fold를 만들어서 진행하는 교차검증 사용 이유 - 총 데이터 갯수가 적은 데이터 셋에 대하여 정확도를 향상시킬수 있음 - 이는 기존에 Training / Validation / Test 세 개의 집단으로 분류하

nonmeyet.tistory.com

- k-fold 교차 검증

: train data에 대해 k겹으로 나눠, 모든 데이터가 최소 1번은 test data로 학습에 사용되도록 하는 방법.
: k-fold 교차검증을 할때는 validation_split은 사용하지않는다.

: 데이터 양이 적을 경우 많이 사용되는 방법.

* ke14_k_fold.py

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras import optimizers
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf

# 데이터 수집
data = np.loadtxt('https://raw.githubusercontent.com/pykwon/python/master/testdata_utf8/diabetes.csv',\
                  dtype=np.float32, delimiter=',')
print(data[:2], data.shape) #(759, 9)
'''
[[-0.294118    0.487437    0.180328   -0.292929    0.          0.00149028
  -0.53117    -0.0333333   0.        ]
 [-0.882353   -0.145729    0.0819672  -0.414141    0.         -0.207153
  -0.766866   -0.666667    1.        ]]
'''

x = data[:, 0:-1]
y = data[:, -1]
print(x[:2])
'''
[[-0.294118    0.487437    0.180328   -0.292929    0.          0.00149028
  -0.53117    -0.0333333 ]
 [-0.882353   -0.145729    0.0819672  -0.414141    0.         -0.207153
  -0.766866   -0.666667  ]]
'''
print(y[:2])
# [0. 1.]

- 일반적인 모델 네트워크

model = Sequential([
    Dense(units=64, input_dim = 8, activation='relu'),
    Dense(units=32, activation='relu'),
    Dense(units=1, activation='sigmoid')
])

# 학습설정
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# 훈련
model.fit(x, y, batch_size=32, epochs=200, verbose=2)

# 모델평가
print(model.evaluate(x, y)) #loss, acc : [0.2690807580947876, 0.8761528134346008]

pred = model.predict(x[:3, :])
print('pred :', pred.flatten()) # pred : [0.03489202 0.9996008  0.04337612]
print('real :', y[:3])          # real : [0. 1. 0.]

- 일반적인 모델 네트워크2

def build_model():
    model = Sequential()
    model.add(Dense(units=64, input_dim = 8, activation='relu'))
    model.add(Dense(units=32, activation='relu'))
    model.add(Dense(units=1, activation='sigmoid'))
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    return model

- K-겹 교차검증 사용한 모델 네트워크

estimatorModel = KerasClassifier(build_fn = build_model, batch_size=32, epochs=200, verbose=2)
kfold = KFold(n_splits=5, shuffle=True, random_state=12) # n_splits : 분리 개수
print(cross_val_score(estimatorModel, x, y, cv=kfold))

# 훈련
estimatorModel.fit(x, y, batch_size=32, epochs=200, verbose=2)

# 모델평가
#print(estimatorModel.evaluate(x, y)) # AttributeError: 'KerasClassifier' object has no attribute 'evaluate'
pred2 = estimatorModel.predict(x[:3, :])
print('pred2 :', pred2.flatten()) # pred2 : [0. 1. 0.]
print('real  :', y[:3])            # real  : [0. 1. 0.]

from tensorflow.keras.wrappers.scikit_learn import KerasClassifier

from sklearn.model_selection import KFold, cross_val_score

estimatorModel = KerasClassifier(build_fn = 모델 함수, batch_size=, epochs=, verbose=) :

kfold = KFold(n_splits=, shuffle=True, random_state=) : n_splits : 분리 개수
cross_val_score(estimatorModel, x, y, cv=kfold) :

- KFold API

scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html

sklearn.model_selection.KFold — scikit-learn 0.24.1 documentation

scikit-learn.org

from sklearn.metrics import accuracy_score
print('분류 정확도(estimatorModel) :', accuracy_score(y, estimatorModel.predict(x)))
# 분류 정확도(estimatorModel) : 0.8774703557312253

영화 리뷰를 이용한 텍스트 분류

www.tensorflow.org/tutorials/keras/text_classification

영화 리뷰를 사용한 텍스트 분류 | TensorFlow Core

Note: 이 문서는 텐서플로 커뮤니티에서 번역했습니다. 커뮤니티 번역 활동의 특성상 정확한 번역과 최신 내용을 반영하기 위해 노력함에도 불구하고 공식 영문 문서의 내용과 일치하지 않을 수

www.tensorflow.org

* ke15_imdb.py

'''
여기에서는 인터넷 영화 데이터베이스(Internet Movie Database)에서 수집한 50,000개의 영화 리뷰 텍스트를 담은 
IMDB 데이터셋을 사용하겠습니다. 25,000개 리뷰는 훈련용으로, 25,000개는 테스트용으로 나뉘어져 있습니다. 
훈련 세트와 테스트 세트의 클래스는 균형이 잡혀 있습니다. 즉 긍정적인 리뷰와 부정적인 리뷰의 개수가 동일합니다.
매개변수 num_words=10000은 훈련 데이터에서 가장 많이 등장하는 상위 10,000개의 단어를 선택합니다.
데이터 크기를 적당하게 유지하기 위해 드물에 등장하는 단어는 제외하겠습니다.
'''

from tensorflow.keras.datasets import imdb
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)

print(train_data[0])   # 각 숫자는 사전에 있는 전체 문서에 나타난 모든 단어에 고유한 번호를 부여한 어휘사전
# [1, 14, 22, 16, 43, 530, 973, ...

print(train_labels) # 긍정 1 부정0
# [1 0 0 ... 0 1 0]

aa = []
for seq in train_data:
    #print(max(seq))
    aa.append(max(seq))

print(max(aa), len(aa))
# 9999 25000

word_index = imdb.get_word_index() # 단어와 정수 인덱스를 매핑한 딕셔너리
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
decord_review = ' '.join([reverse_word_index.get(i - 3, '?') for i in train_data[0]])
print(decord_review)
# ? this film was just brilliant casting location scenery story direction ...

- 데이터 준비 : list -> tensor로 변환. Onehot vector.

import numpy as np

def vector_seq(sequences, dim=10000):
    results = np.zeros((len(sequences), dim))
    for i, seq in enumerate(sequences):
        results[i, seq] = 1
    return results

x_train = vector_seq(train_data)
x_test = vector_seq(test_data)
print(x_train,' ', x_train.shape)
'''
[[0. 1. 1. ... 0. 0. 0.]
 [0. 1. 1. ... 0. 0. 0.]
 [0. 1. 1. ... 0. 0. 0.]
 ...
 [0. 1. 1. ... 0. 0. 0.]
 [0. 1. 1. ... 0. 0. 0.]
 [0. 1. 1. ... 0. 0. 0.]]   (25000, 10000)
'''

y_train = train_labels
y_test = test_labels
print(y_train) # [1 0 0 ... 0 1 0]

- 신경망 모델

from tensorflow.keras import models, layers, regularizers

model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(10000, ), kernel_regularizer=regularizers.l2(0.01)))
# regularizers.l2(0.001) : 가중치 행렬의 모든 원소를 제곱하고 0.001을 곱하여 네트워크의 전체 손실에 더해진다는 의미, 이 규제(패널티)는 훈련할 때만 추가됨
model.add(layers.Dropout(0.3)) # 과적합 방지를 목적으로 노드 일부는 학습에 참여하지 않음
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])

print(model.summary())

layers.Dropout(n) : 과적합 방지를 목적으로 노드 일부는 학습에 참여하지 않음

from tensorflow.keras import models, layers, regularizers

Dense(units=, activation=, input_shape=, kernel_regularizer=regularizers.l2(0.01))

- drop out

ko.d2l.ai/chapter_deep-learning-basics/dropout.html

3.13. 드롭아웃(dropout) — Dive into Deep Learning documentation

ko.d2l.ai

- regularizers

wdprogrammer.tistory.com/33

Regularization과 딥러닝의 일반적인 흐름 정리

2019-01-13-deeplearning-flow- 최적화(optimization) : 가능한 훈련 데이터에서 최고의 성능을 얻으려고 모델을 조정하는 과정 일반화(generalization) : 훈련된 모델이 이전에 본 적 없는 데이..

wdprogrammer.tistory.com

- 훈련시 검증 데이터 (validation data)

x_val = x_train[:10000]
partial_x_train = x_train[10000:]
print(len(x_val), len(partial_x_train)) # 10000 10000

y_val = y_train[:10000]
partial_y_train = y_train[10000:]

history = model.fit(partial_x_train, partial_y_train, batch_size=512, epochs=10, \
                    validation_data=(x_val, y_val))

print(model.evaluate(x_test, y_test))

- 시각화

import matplotlib.pyplot as plt
history_dict = history.history
loss = history_dict['loss']
val_loss = history_dict['val_loss'] 

epochs = range(1, len(loss) + 1)

# "bo"는 "파란색 점"입니다
plt.plot(epochs, loss, 'bo', label='Training loss')
# b는 "파란 실선"입니다
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

acc = history_dict['acc']
val_acc = history_dict['val_acc'] 

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation acc')
plt.xlabel('Epochs')
plt.ylabel('acc')
plt.legend()
plt.show()

import numpy as np
pred = model.predict(x_test[:5])
print('예측값 :', np.where(pred > 0.5, 1, 0).flatten()) # 예측값 : [0 1 1 1 1]
print('실제값 :', y_test[:5])                           # 실제값 : [0 1 1 0 1]

softmax

- softmax

m.blog.naver.com/wideeyed/221021710286

[딥러닝] 활성화 함수 소프트맥스(Softmax)

Softmax(소프트맥스)는 입력받은 값을 출력으로 0~1사이의 값으로 모두 정규화하며 출력 값들의 총합은 항...

blog.naver.com

- 활성화 함수를 softmax를 사용하여 다항분류

* ke16.py

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras.utils import to_categorical
import numpy as np

x_data = np.array([[1,2,1,4],
                  [1,3,1,6],
                  [1,4,1,8],
                  [2,1,2,1],
                  [3,1,3,1],
                  [5,1,5,1],
                  [1,2,3,4],
                  [5,6,7,8]], dtype=np.float32)
#y_data = [[0., 0., 1.] ...]
y_data = to_categorical([2,2,2,1,1,1,0,0]) # One-hot encoding
print(x_data)
'''
[[1. 2. 1. 4.]
 [1. 3. 1. 6.]
 [1. 4. 1. 8.]
 [2. 1. 2. 1.]
 [3. 1. 3. 1.]
 [5. 1. 5. 1.]
 [1. 2. 3. 4.]
 [5. 6. 7. 8.]]
'''
print(y_data)
'''
[[0. 0. 1.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [1. 0. 0.]
 [1. 0. 0.]]
'''

from tensorflow.keras.utils import to_categorical

to_categorical(데이터) : One-hot encoding

model = Sequential()
model.add(Dense(50, input_shape = (4,)))
model.add(Activation('relu'))
model.add(Dense(50))
model.add(Activation('relu'))
model.add(Dense(3))
model.add(Activation('softmax'))
print(model.summary()) # Total params: 2,953

opti = 'adam' # sgd, rmsprop,...
model.compile(optimizer=opti, loss='categorical_crossentropy', metrics=['acc'])

model.add(Activation('softmax')) :

model.compile(optimizer=, loss='categorical_crossentropy', metrics=) :

model.fit(x_data, y_data, epochs=100)
print(model.evaluate(x_data, y_data))        # [0.10124918818473816, 1.0]
print(np.argmax(model.predict(np.array([[1,8,1,8]]))))  # 2
print(np.argmax(model.predict(np.array([[10,8,5,1]])))) # 1

np.argmax() :

- 다항분류 : 동물 type

* ke17_zoo.py

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
import numpy as np
from tensorflow.keras.utils import to_categorical

xy = np.loadtxt('https://raw.githubusercontent.com/pykwon/python/master/testdata_utf8/zoo.csv', delimiter=',')
print(xy[:2], xy.shape) # (101, 17)

x_data = xy[:, 0:-1] # feature
y_data = xy[:, [-1]]   # label(class), type열
print(x_data[:2])
'''
[[1. 0. 0. 1. 0. 0. 1. 1. 1. 1. 0. 0. 4. 0. 0. 1.]
 [1. 0. 0. 1. 0. 0. 0. 1. 1. 1. 0. 0. 4. 1. 0. 1.]]
'''
print(y_data[:2]) # [0. 0.]
print(set(y_data.ravel())) # {0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0}

nb_classes = 7
y_one_hot = to_categorical(y_data, num_classes = nb_classes) # label에 대한 one-hot encoding
# num_classes : vector 수
print(y_one_hot[:3])
'''
[[1. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0.]]
'''

model = Sequential()
model.add(Dense(32, input_shape=(16, ), activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(nb_classes, activation='softmax'))

opti='adam'
model.compile(optimizer=opti, loss='categorical_crossentropy', metrics=['acc'])

history = model.fit(x_data, y_one_hot, batch_size=32, epochs=100, verbose=0, validation_split=0.3)
print(model.evaluate(x_data, y_one_hot))
# [0.2325848489999771, 0.9306930899620056]

history_dict = history.history
loss = history_dict['loss']
val_loss = history_dict['val_loss']
acc = history_dict['acc']
val_acc = history_dict['val_acc']

# 시각화
import matplotlib.pyplot as plt
plt.plot(loss, 'b-', label='train loss')
plt.plot(val_loss, 'r--', label='train val_loss')
plt.xlabel('epoch')
plt.ylabel('loss')
plt.legend()
plt.show()

plt.plot(acc, 'b-', label='train acc')
plt.plot(val_acc, 'r--', label='train val_acc')
plt.xlabel('epoch')
plt.ylabel('acc')
plt.legend()
plt.show()

#predict
pred_data = x_data[:1] # 한개만
pred = np.argmax(model.predict(pred_data))
print(pred) # 0
print()

pred_datas = x_data[:5] # 여러개
preds = [np.argmax(i) for i in model.predict(pred_datas)]
print('예측값 : ', preds)
# 예측값 :  [0, 0, 3, 0, 0]
print('실제값: ', y_data[:5].flatten())
# 실제값:  [0. 0. 3. 0. 0.]

# 새로운 data
print(x_data[:1])
new_data = [[1., 0., 0., 1., 0., 0., 1., 1., 1., 1., 0., 0., 4., 0., 0., 1.]]

new_pred = np.argmax(model.predict(new_data))
print('예측값 : ', new_pred) # 예측값 :  0

다항분류 softmax + roc curve

: iris dataset으로 분류 모델 작성 후 ROC curve 출력

* ke18_iris.py

- 데이터 수집

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler

iris = load_iris() # iris dataset
print(iris.DESCR)

x = iris.data # feature
print(x[:2])
# [[5.1 3.5 1.4 0.2]
#  [4.9 3.  1.4 0.2]]
y = iris.target # label
print(y)
# [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#  0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
#  2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
#  2 2]
print(set(y)) # 집합
# {0, 1, 2}

names = iris.target_names
print(names)  # ['setosa' 'versicolor' 'virginica']

feature_iris = iris.feature_names
print(feature_iris) # ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

- label 원-핫 인코딩

one_hot = OneHotEncoder() # to_categorical() ..
y = one_hot.fit_transform(y[:, np.newaxis]).toarray()
print(y[:2])
# [[1. 0. 0.]
#  [1. 0. 0.]]

- feature 표준화

scaler = StandardScaler()
x_scaler = scaler.fit_transform(x)
print(x_scaler[:2])
# [[-0.90068117  1.01900435 -1.34022653 -1.3154443 ]
#  [-1.14301691 -0.13197948 -1.34022653 -1.3154443 ]]

- train / test

x_train, x_test, y_train, y_test = train_test_split(x_scaler, y, test_size=0.3, random_state=1)
n_features = x_train.shape[1] # 열
n_classes = y_train.shape[1]  # 열
print(n_features, n_classes)  # 4 3 => input, output수

- n의 개수 만큼 모델 생성 함수

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

def create_custom_model(input_dim, output_dim, out_node, n, model_name='model'):
    def create_model():
        model = Sequential(name = model_name)
        for _ in range(n): # layer 생성
            model.add(Dense(out_node, input_dim = input_dim, activation='relu'))
        
        model.add(Dense(output_dim, activation='softmax'))
        model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])
        return model
    return create_model # 주소 반환(클로저)

models = [create_custom_model(n_features, n_classes, 10, n, 'model_{}'.format(n)) for n in range(1, 4)]
# layer수가 2 ~ 5개 인 모델 생성

for create_model in models:
    print('-------------------------')
    create_model().summary()
    # Total params: 83
    # Total params: 193
    # Total params: 303

- train

history_dict = {}

for create_model in models: # 각 모델 loss, acc 출력
    model = create_model()
    print('Model names :', model.name)
    # 훈련
    history = model.fit(x_train, y_train, batch_size=5, epochs=50, verbose=0, validation_split=0.3)
    # 평가
    score = model.evaluate(x_test, y_test)
    print('test dataset loss', score[0])
    print('test dataset acc', score[1])
    history_dict[model.name] = [history, model]
    
print(history_dict)
# {'model_1': [<tensorflow.python.keras.callbacks.History object at 0x00000273BA4E7280>, <tensorflow.python.keras.engine.sequential.Sequential object at 0x00000273B9B22A90>], ...}

- 시각화

fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(8, 6))
print(fig, ax1, ax2)

for model_name in history_dict: # 각 모델의 acc, val_acc, val_loss
    print('h_d :', history_dict[model_name][0].history['acc'])
    
    val_acc = history_dict[model_name][0].history['val_acc']
    val_loss = history_dict[model_name][0].history['val_loss']
    ax1.plot(val_acc, label=model_name)
    ax2.plot(val_loss, label=model_name)
    ax1.set_ylabel('validation acc')
    ax2.set_ylabel('validation loss')
    ax2.set_xlabel('epochs')
    ax1.legend()
    ax2.legend()

plt.show()

=> model1 < model2 < model3 모델 순으로 성능 우수

- 분류 모델에 대한 성능 평가 : ROC curve

plt.figure()
plt.plot([0, 1], [0, 1], 'k--')

from sklearn.metrics import roc_curve, auc

for model_name in history_dict: # 각 모델의 모델
    model = history_dict[model_name][1]
    y_pred = model.predict(x_test)
    fpr, tpr, _ = roc_curve(y_test.ravel(), y_pred.ravel())
    plt.plot(fpr, tpr, label='{}, AUC value : {:.3}'.format(model_name, auc(fpr, tpr)))

plt.xlabel('fpr')
plt.ylabel('tpr')
plt.title('ROC curve')
plt.legend()
plt.show()

- k-fold 교차 검증 - over fitting 방지

from tensorflow.keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score

creater_model = create_custom_model(n_features, n_classes, 10, 3)
estimator = KerasClassifier(build_fn = create_model, epochs=50, batch_size=10, verbose=2)
scores = cross_val_score(estimator, x_scaler, y, cv=10)
print('accuracy : {:0.2f}(+/-{:0.2f})'.format(scores.mean(), scores.std()))
# accuracy : 0.92(+/-0.11)

- 모델 3의 성능이 가장 우수

model = Sequential()

model.add(Dense(10, input_dim=4, activation='relu'))
model.add(Dense(10, activation='relu'))
model.add(Dense(10, activation='relu'))
model.add(Dense(3, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam',  metrics=['acc'])
model.fit(x_train, y_train, epochs=50, batch_size=10, verbose=2)
print(model.evaluate(x_test, y_test))
# [0.20484387874603271, 0.8888888955116272]

y_pred = np.argmax(model.predict(x_test), axis=1)
print('예측값 :', y_pred)
# 예측값 : [0 1 1 0 2 2 2 0 0 2 1 0 2 1 1 0 1 2 0 0 1 2 2 0 2 1 0 0 1 2 1 2 1 2 2 0 1
#  0 1 2 2 0 1 2 1]

real_y = np.argmax(y_test, axis=1).reshape(-1, 1)
print('실제값 :', real_y.ravel())
# 실제값 : [0 1 1 0 2 1 2 0 0 2 1 0 2 1 1 0 1 1 0 0 1 1 1 0 2 1 0 0 1 2 1 2 1 2 2 0 1
#  0 1 2 2 0 2 2 1]

print('분류 실패 수 :', (y_pred != real_y.ravel()).sum())
# 분류 실패 수 : 5

from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
print(confusion_matrix(real_y, y_pred))
# [[14  0  0]
#  [ 0 17  1]
#  [ 0  1 12]]

print(accuracy_score(real_y, y_pred)) # 0.9555555555555556
print(classification_report(real_y, y_pred))
#               precision    recall  f1-score   support
# 
#            0       1.00      1.00      1.00        14
#            1       0.94      0.94      0.94        18
#            2       0.92      0.92      0.92        13
# 
#     accuracy                           0.96        45
#    macro avg       0.96      0.96      0.96        45
# weighted avg       0.96      0.96      0.96        45

- 새로운 값으로 예측

new_x = [[5.5, 3.3, 1.2, 1.3], [3.5, 3.3, 0.2, 0.3], [1.5, 1.3, 6.2, 6.3]]
new_x = StandardScaler().fit_transform(new_x)
new_pred = model.predict(new_x)
print('예측값 :', np.argmax(new_pred, axis=1).reshape(-1, 1).flatten()) # 예측값 : [1 0 2]

숫자 이미지(MNIST) dataset으로 image 분류 모델

: 숫자 이미지를 metrics로 만들어 이미지에 대한 분류 결과를 mapping한 dataset

- mnist dataset

sdc-james.gitbook.io/onebook/4.-and/5.1./5.1.3.-mnist-dataset

5.1.3. MNIST Dataset 소개

sdc-james.gitbook.io

* ke19_mist.py

import tensorflow as tf
import sys

(x_train, y_train),(x_test, y_test) = tf.keras.datasets.mnist.load_data()
print(len(x_train), len(x_test),len(y_train), len(y_test)) # 60000 10000 60000 10000
print(x_train.shape, y_train.shape)                        # (60000, 28, 28) (60000,)
print(x_train[0])

for i in x_train[0]:
    for j in i:
        sys.stdout.write('%s   '%j)
    sys.stdout.write('\n')

x_train = x_train.reshape(60000, 784).astype('float32') # 3차원 -> 2차원
x_test = x_test.reshape(10000, 784).astype('float32')

import matplotlib.pyplot as plt
plt.imshow(x_train[0].reshape(28,28), cmap='Greys')
plt.show()
print(y_train[0]) # 5

plt.imshow(x_train[1].reshape(28,28), cmap='Greys')
plt.show()
print(y_train[1]) # 0

# 정규화
x_train /= 255 # 0 ~ 255 사이의 값을 0 ~ 1사이로 정규화
x_test /= 255
print(x_train[0])

print(set(y_train)) # {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}

y_train = tf.keras.utils.to_categorical(y_train, 10) # one-hot encoding
y_test = tf.keras.utils.to_categorical(y_test, 10)   # one-hot encoding
print(y_train[0])   # [0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]

- train dataset의 일부를 validation dataset

x_val = x_train[50000:60000]
y_val = y_train[50000:60000]
x_train = x_train[0:50000]
y_train = y_train[0:50000]
print(x_val.shape, ' ', x_train.shape) # (10000, 28, 28)   (50000, 28, 28)
print(y_val.shape, ' ', y_train.shape) # (10000, 10)   (50000, 10)

model = tf.keras.Sequential()

model.add(tf.keras.layers.Dense(512, input_shape=(784, )))
model.add(tf.keras.layers.Activation('relu'))
model.add(tf.keras.layers.Dropout(0.2)) # 20% drop -> over fitting 방지

model.add(tf.keras.layers.Dense(512))
# model.add(tf.keras.layers.Dense(512, kernel_regularizer=tf.keras.regularizers.l2(0.001))) # 가중치 규제
model.add(tf.keras.layers.Activation('relu'))
model.add(tf.keras.layers.Dropout(0.2))

model.add(tf.keras.layers.Dense(10))
model.add(tf.keras.layers.Activation('softmax'))

model.compile(optimizer=tf.keras.optimizers.Adam(lr=0.01), loss='categorical_crossentropy', metrics=['accuracy'])
print(model.summary()) # Total params: 669,706

- 훈련

from tensorflow.keras.callbacks import EarlyStopping
e_stop = EarlyStopping(patience=5, monitor='loss')

history = model.fit(x_train, y_train, epochs=1000, batch_size=256, validation_data=(x_val, y_val),\
                    callbacks=[e_stop], verbose=1)
print(history.history.keys()) # dict_keys(['loss', 'accuracy', 'val_loss', 'val_accuracy'])


print('loss :', history.history['loss'],', val_loss :', history.history['val_loss'])
print('accuracy :', history.history['accuracy'],', val_accuracy :', history.history['val_accuracy'])

plt.plot(history.history['loss'], label='loss')
plt.plot(history.history['val_loss'], label='val_loss')
plt.xlabel('epochs')
plt.ylabel('loss')
plt.legend()
plt.show()

plt.plot(history.history['accuracy'], label='accuracy')
plt.plot(history.history['val_accuracy'], label='val_accuracy')
plt.xlabel('epochs')
plt.ylabel('accuracy')
plt.legend()
plt.show()

score = model.evaluate(x_test, y_test)
print('score loss :', score[0])
# score loss : 0.12402850389480591

print('score accuracy :', score[1])
# score accuracy : 0.9718999862670898

model.save('ke19.hdf5')

model = tf.keras.models.load_model('ke19.hdf5')

- 예측

pred = model.predict(x_test[:1])
print('예측값 :', pred)
# 예측값 : [[4.3060442e-27 3.1736336e-14 3.9369942e-17 3.7753089e-14 6.8288101e-22
#   5.2651956e-21 2.7473105e-33 1.0000000e+00 1.6139679e-21 1.6997739e-14]]
# [7]

import numpy as np
print(np.argmax(pred, 1))
print('실제값 :', y_test[:1])
# 실제값 : [[0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]]
print('실제값 :', np.argmax(y_test[:1], 1))
# 실제값 : [7]

- 새로운 이미지로 분류

from PIL import Image
im = Image.open('num.png')
img = np.array(im.resize((28, 28), Image.ANTIALIAS).convert('L'))
print(img, img.shape) # (28, 28)

plt.imshow(img, cmap='Greys')
plt.show()

from PIL import Image

Image.open('파일경로') : 이미지 파일 open.

Image.ANTIALIAS : 높은 해상도의 사진 또는 영상을 낮은 해상도로 변환하거나 나타낼 시의 깨짐을 최소화 시켜주는 방법.

convert('L') : grey scale로 변환.

data = img.reshape([1, 784])
data = data/255  # 정규화
print(data)

new_pred = model.predict(data)
print('new_pred :', new_pred)
# new_pred : [[4.92454797e-04 1.15842435e-04 6.54530758e-03 5.23587340e-04
#   3.31552816e-04 5.98833859e-01 3.87458414e-01 9.34154059e-07
#   5.55288605e-03 1.45193975e-04]]
print('new_pred :', np.argmax(new_pred, 1))
# new_pred : [5]

이미지 분류 패션 MNIST

- Fashion MNIST

www.kaggle.com/zalando-research/fashionmnist

Fashion MNIST

An MNIST-like dataset of 70,000 28x28 labeled fashion images

www.kaggle.com

* ke20_fasion.py

import tensorflow as tf
from tensorflow import keras
import numpy as np
import matplotlib.pyplot as plt

fashion_mnist = tf.keras.datasets.fashion_mnist
(train_image, train_labels), (test_image, test_labels) = fashion_mnist.load_data()
print(train_image.shape, train_labels.shape, test_image.shape)
# (60000, 28, 28) (60000,)

print(set(train_labels))
# {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat', 'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']

plt.imshow(train_image[0])
plt.colorbar()
plt.show()

plt.figure(figsize=(10, 10))
for i in range(25):
    plt.subplot(5, 5, i+1)
    plt.xticks([])
    plt.yticks([])
    plt.xlabel(class_names[train_labels[i]])
    plt.imshow(train_image[i])
 
plt.show()

- 정규화

# print(train_image[0])
train_image = train_image/255
# print(train_image[0])
test_image = test_image/255

- 모델 구성

model = tf.keras.Sequential([
    tf.keras.layers.Flatten(input_shape = (28, 28)), # 차원 축소. 일반적으로 생략 가능(자동 동작).
    tf.keras.layers.Dense(512, activation = tf.nn.relu),
    tf.keras.layers.Dense(128, activation = tf.nn.relu),
    tf.keras.layers.Dense(10, activation = tf.nn.softmax)
    ])

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy']) # label에 대해서 one-hot encoding

model.fit(train_image, train_labels, batch_size=128, epochs=5, verbose=1)

model.save('ke20.hdf5')

model = tf.keras.models.load_model('ke20.hdf5')

model.compile(optimizer=, loss='sparse_categorical_crossentropy', metrics=) : label에 대해서 one-hot encoding

test_loss, test_acc = model.evaluate(test_image, test_labels)
print('loss :', test_loss)
# loss : 0.34757569432258606
print('acc :', test_acc)
# acc : 0.8747000098228455

pred = model.predict(test_image)
print(pred[0])
# [8.5175507e-06 1.2854183e-06 8.2240956e-07 1.3558407e-05 2.0901878e-06
#  1.3651027e-02 7.2083326e-06 4.6001904e-02 2.0302361e-05 9.4029325e-01]
print('예측값 :', np.argmax(pred[0]))
# 예측값 : 9
print('실제값 :', test_labels[0])
# 실제값 : 9

- 각 이미지 출력용 함수

def plot_image(i, pred_arr, true_label, img):
    pred_arr, true_label, img = pred_arr[i], true_label[i], img[i]
    plt.xticks([])
    plt.yticks([])
    plt.imshow(img, cmap='Greys')
    
    pred_label = np.argmax(pred_arr)
    if pred_label == true_label:
        color = 'blue'
    else:
        color = 'red'
        
    plt.xlabel('{} {:2.0f}% ({})'.format(class_names[pred_label], 100 * np.max(pred_arr), \
                                         class_names[true_label]), color = color)

i = 0
plt.figure(figsize = (6, 3))
plt.subplot(1, 2, 1)
plot_image(i, pred, test_labels, test_image)
plt.show()

def plot_value_arr(i, pred_arr, true_label):
    pred_arr, true_label = pred_arr[i], true_label[i]
    thisplot = plt.bar(range(10), pred_arr)
    plt.ylim([0, 1])
    pred_label = np.argmax(pred_arr)
    thisplot[pred_label].set_color('red')
    thisplot[true_label].set_color('blue')


i = 12
plt.figure(figsize = (6, 3))
plt.subplot(1, 2, 1)
plot_image(i, pred, test_labels, test_image)
plt.subplot(1, 2, 2)
plot_value_arr(i, pred, test_labels)
plt.show()

합성곱 신경망 (Convolutional Neural Network, CNN)

: 원본 이미지(행렬)를 CNN의 필터(행렬)로 합성 곱을 하여 행렬 크기를 줄여 분류한다.

: 부하를 줄이며, 이미지 분류 향상에 영향을 준다.

- CNN

untitledtblog.tistory.com/150

[머신 러닝/딥 러닝] 합성곱 신경망 (Convolutional Neural Network, CNN)과 학습 알고리즘

1. 이미지 처리와 필터링 기법 필터링은 이미지 처리 분야에서 광범위하게 이용되고 있는 기법으로써, 이미지에서 테두리 부분을 추출하거나 이미지를 흐릿하게 만드는 등의 기능을 수행하기

untitledtblog.tistory.com

=> input -> [ conv -> relu -> pooling ] -> ... -> Flatten -> Dense -> ... -> output

- MNIST dataset으로 cnn진행

* ke21_cnn.py

import tensorflow as tf
from tensorflow.keras import datasets, models, layers

(train_images, train_labels),(test_images, test_labels) = datasets.mnist.load_data()
print(train_images.shape)                    # (60000, 28, 28)

from tensorflow.keras import datasets

datasets.mnist.load_data() : mnist dataset

- CNN : 3차원을 4차원(+channel(RGB))으로 구조 변경

train_images = train_images.reshape((60000, 28, 28, 1))
print(train_images.shape, train_images.ndim) # (60000, 28, 28, 1) 4
train_images = train_images / 255.0 # 정규화
print(train_images[0])

test_images = test_images.reshape((10000, 28, 28, 1))
test_images = test_images / 255.0 # 정규화

print(train_labels[:3]) # [5 0 4]

channel 수 : 흑백 - 1, 컬러 - 3

- 모델

input_shape = (28, 28, 1)
model = models.Sequential()

# 형식 : tf.keras.layers.Conv2D(filters, kernel_size, strides=(1, 1), padding='valid', ...
model.add(layers.Conv2D(64, kernel_size = (3, 3), strides=(1, 1), padding ='valid',\
                        activation='relu', input_shape=input_shape))
model.add(layers.MaxPooling2D(pool_size=(2, 2), strides=None))
model.add(layers.Dropout(0.2))

model.add(layers.Conv2D(32, kernel_size = (3, 3), strides=(1, 1), padding ='valid', activation='relu'))
model.add(layers.MaxPooling2D(pool_size=(2, 2), strides=None))
model.add(layers.Dropout(0.2))

model.add(layers.Conv2D(16, kernel_size = (3, 3), strides=(1, 1), padding ='valid', activation='relu'))
model.add(layers.MaxPooling2D(pool_size=(2, 2), strides=None))
model.add(layers.Dropout(0.2))

model.add(layers.Flatten()) # Fully Connect layer - CNN 처리된 데이터를 1차원 자료로 변경

from tensorflow.keras import layers

layers.Conv2D(output수, kernel_size=, strides=, padding=, activation=, input_shape=) : CNN Conv

strides : 보폭, None - pool_size와 동일
padding : valid - 영역 밖에 0으로 채우지 않고 곱 진행, same - 영역 밖에 0으로 채우고 곱 진행.

layers.MaxPooling2D(pool_size=, strides=) : CNN Pooling

layers.Flatten() : Fully Connect layer - CNN 처리된 데이터를 1차원 자료로 변경

- Conv2D

www.tensorflow.org/api_docs/python/tf/keras/layers/Conv2D

tf.keras.layers.Conv2D | TensorFlow Core v2.4.1

2D convolution layer (e.g. spatial convolution over images).

www.tensorflow.org

- 모델

model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(32, activation='relu'))
model.add(layers.Dense(10, activation='softmax'))

print(model.summary())

- 학습설정

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
# label에 대해서 one-hot encoding

model.compile(optimizer='', loss='sparse_categorical_crossentropy', metrics=) : label에 대해서 one-hot encoding.

- 훈련

from tensorflow.keras.callbacks import EarlyStopping
early_stop = EarlyStopping(monitor='val_loss', patience=3) # 조기 종료

histoy = model.fit(train_images, train_labels, batch_size=128, epochs=100, verbose=1, validation_split=0.2,\
                   callbacks = [early_stop])

- 평가

train_loss, train_acc = model.evaluate(train_images, train_labels)
print('train_loss :', train_loss)
print('train_acc :', train_acc)

test_loss, test_acc = model.evaluate(test_images, test_labels)
print('test_loss :', test_loss)
print('test_acc :', test_acc)
# test_loss : 0.06314415484666824
# test_acc : 0.9812999963760376

- 모델 저장

model.save('ke21.h5')

model = tf.keras.models.load_model('ke21.h5')

import pickle
histoy = histoy.history # loss, acc

with open('data.pickle', 'wb') as f: # 파일 저장
    pickle.dump(histoy)              # 객체 저장

with open('data.pickle', 'rb') as f: # 파일 읽기
    history = pickle.load(f)         # 객체 읽기

import pickle

pickle.dump(객체) : 객체 저장

pickle.load(f) : 객체 불러오기

- 예측

import numpy as np
print('예측값 :', np.argmax(model.predict(test_images[:1])))
print('예측값 :', np.argmax(model.predict(test_images[[0]])))
print('실제값 :', test_labels[0])
# 예측값 : 7
# 예측값 : 7
# 실제값 : 7

print('예측값 :', np.argmax(model.predict(test_images[[1]])))
print('실제값 :', test_labels[1])
# 예측값 : 2
# 실제값 : 2

- acc와 loss로 시각화

import matplotlib.pyplot as plt

def plot_acc(title = None):
    plt.plot(history['accuracy'])
    plt.plot(history['val_accuracy'])
    if title is not None:
        plt.title(title)
    plt.ylabel(title)
    plt.xlabel('epoch')
    plt.legend(['train data', 'validation data'], loc = 0)
    
plot_acc('accuracy')
plt.show()

def plot_loss(title = None):
    plt.plot(history['loss'])
    plt.plot(history['val_loss'])
    if title is not None:
        plt.title(title)
    plt.ylabel(title)
    plt.xlabel('epoch')
    plt.legend(['train data', 'validation data'], loc = 0)
    
plot_loss('loss')
plt.show()

Tensor : image process, CNN

cafe.daum.net/flowlife/S2Ul/3

Daum 카페

cafe.daum.net

- 딥러닝 적용사례

brunch.co.kr/@itschloe1/23

딥러닝의 30가지 적용 사례

비전문가들도 이해할 수 있을 구체적 예시 | *본 글은 Yaron Hadad의 블로그 'http://www.yaronhadad.com/deep-learning-most-amazing-applications/'를 동의 하에 번역하였습니다. 최근 몇 년간 딥러닝은 컴퓨터 비전부

brunch.co.kr

CNN - 이미지 분류

RNN - 시계열. ex) 자연어, ..

GAN - 창조

- CNN

taewan.kim/post/cnn/

CNN, Convolutional Neural Network 요약

Convolutional Neural Network, CNN을 정리합니다.

taewan.kim

* tf_cnn_mnist_subclassing.ipynb

- MNIST로 cnn 연습

import tensorflow as tf
from tensorflow.keras import datasets, models, layers, Model
from tensorflow.keras.layers import Dense, Flatten, Conv2D, MaxPool2D, Dropout

(train_images, train_labels),(test_images, test_labels) = tf.keras.datasets.mnist.load_data()
print(train_images.shape)                    # (60000, 28, 28)

train_images = train_images.reshape((60000, 28, 28, 1))
print(train_images.shape, train_images.ndim) # (60000, 28, 28, 1) 4
train_images = train_images / 255.0 # 정규화
#print(train_images[0])

test_images = test_images.reshape((10000, 28, 28, 1))
test_images = test_images / 255.0 # 정규화

print(train_labels[:3]) # [5 0 4]

- 데이터 섞기

import numpy as np
x = np.random.sample((5,2))
print(x)
'''
[[0.19516051 0.38639727]
 [0.89418845 0.05847686]
 [0.16835491 0.11172334]
 [0.8109798  0.68812899]
 [0.03361333 0.83081767]]
'''
dset = tf.data.Dataset.from_tensor_slices(x)
print(dset) # <TensorSliceDataset shapes: (2,), types: tf.float64>
dset = tf.data.Dataset.from_tensor_slices(x).shuffle(1000).batch(2) # batch(묶음수), shuffle(buffer수) : 섞음 
print(dset) # <BatchDataset shapes: (None, 2), types: tf.float64>
for a in dset:
    print(a)
    '''
    tf.Tensor(
[[0.93919653 0.52250196]
 [0.44236167 0.53000042]
 [0.69057762 0.32003977]], shape=(3, 2), dtype=float64)
tf.Tensor(
[[0.09166211 0.67060753]
 [0.39949866 0.57685399]], shape=(2, 2), dtype=float64)
 '''

tf.data.Dataset.from_tensor_slices(x).shuffle(1000).batch(3) : batch(묶음수), shuffle(buffer수) : 섞음

- MNIST이 train data를 섞기

train_ds = tf.data.Dataset.from_tensor_slices(((train_images, train_labels))).shuffle(60000).batch(28)
test_ds = tf.data.Dataset.from_tensor_slices(((test_images, test_labels))).batch(28)
print(train_ds)
print(test_ds)

- 모델 생성방법 : subclassing API 사용

class MyModel(Model):
    def __init__(self):
        super(MyModel, self).__init__()
        self.conv1 = Conv2D(filters=32, kernel_size = [3,3], padding ='valid', activation='relu')
        self.pool1 = MaxPool2D((2, 2))

        self.conv2 = Conv2D(filters=32, kernel_size = [3,3], padding ='valid', activation='relu')
        self.pool2 = MaxPool2D((2, 2))

        self.flatten = Flatten(dtype='float32')

        self.d1 = Dense(64, activation='relu')
        self.drop1 = Dropout(rate = 0.3)
        self.d2 = Dense(10, activation='softmax')

    def call(self, inputs):
        net = self.conv1(inputs)
        net = self.pool1(net)
        net = self.conv2(net)
        net = self.pool2(net)
        net = self.flatten(net)
        net = self.d1(net)
        net = self.drop1(net)
        net = self.d2(net)
        return net

model = MyModel()
temp_inputs = tf.keras.Input(shape=(28, 28, 1))
model(temp_inputs)
print(model.summary())
'''
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d_2 (Conv2D)            multiple                  320       
_________________________________________________________________
max_pooling2d (MaxPooling2D) multiple                  0         
_________________________________________________________________
conv2d_3 (Conv2D)            multiple                  9248      
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 multiple                  0         
_________________________________________________________________
flatten (Flatten)            multiple                  0         
_________________________________________________________________
dense (Dense)                multiple                  51264     
_________________________________________________________________
dropout (Dropout)            multiple                  0         
_________________________________________________________________
dense_1 (Dense)              multiple                  650       
=================================================================
Total params: 61,482
'''

- 일반적 모델학습 방법1

loss_object = tf.keras.losses.SparseCategoricalCrossentropy()
optimizer = tf.keras.optimizers.Adam()

# 일반적 모델학습 방법1
model.compile(optimizer=optimizer, loss=loss_object, metrics=['acc'])
model.fit(train_images, train_labels, batch_size=128, epochs=5, verbose=2, max_queue_size=10, workers=1, use_multiprocessing=True)
# use_multiprocessing : 프로세스 기반의 
score = model.evaluate(test_images, test_labels)
print('test loss :', score[0])
print('test acc :', score[1])
# test loss : 0.028807897120714188
# test acc : 0.9907000064849854

import numpy as np
print('예측값 :', np.argmax(model.predict(test_images[:2]), 1))
print('실제값 :', test_labels[:2])
# 예측값 : [7 2]
# 실제값 : [7 2]

- 모델 학습방법2: GradientTape

train_loss = tf.keras.metrics.Mean()
train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy()

test_loss = tf.keras.metrics.Mean()
test_accuracy = tf.keras.metrics.SparseCategoricalAccuracy()

@tf.function
def train_step(images, labels):
    with tf.GradientTape() as tape:
        predictions = model(images)
        loss = loss_object(labels, predictions)

    gradients = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))
    train_loss(loss) # 가중치 평균 계산       loss = loss_object(labels, predictions)
    train_accuracy(labels, predictions)

@tf.function
def test_step(images, labels):
    predictions = model(images)
    t_loss = loss_object(labels, predictions)
    test_loss(t_loss)
    test_accuracy(labels, predictions)

EPOCHS = 5

for epoch in range(EPOCHS):
    for train_images, train_labels in train_ds:
        train_step(train_images, train_labels)
    
    for test_images, test_labels in test_ds:
        test_step(test_images, test_labels)
    
    templates = 'epochs:{}, train_loss:{}, train_acc:{}, test_loss:{}, test_acc:{}'
    print(templates.format(epoch + 1, train_loss.result(), train_accuracy.result()*100,\
                           test_loss.result(), test_accuracy.result()*100))

print('예측값 :', np.argmax(model.predict(test_images[:2]), 1))
print('실제값 :', test_labels[:2].numpy())
# 예측값 : [3 4]
# 실제값 : [3 4]

- image data generator

: 샘플수가 적을 경우 사용.

chancoding.tistory.com/93

[Keras] CNN ImageDataGenerator : 손글씨 글자 분류

안녕하세요. 이전 포스팅을 통해서 CNN을 활용한 직접 만든 손글씨 이미지 분류 작업을 진행했습니다. 생각보다 데이터가 부족했음에도 80% 정도의 정확도를 보여주었습니다. 이번 포스팅에서는

chancoding.tistory.com

+ 이미지 보강

* tf_cnn_image_generator.ipynb

import tensorflow as tf
from tensorflow.keras.datasets import mnist
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping
import matplotlib.pyplot as plt
import numpy as np
import sys

np.random.seed(0)
tf.random.set_seed(3)

(x_train, y_train), (x_test, y_test) = mnist.load_data()

x_train = x_train.reshape(-1, 28, 28, 1).astype('float32') /255
x_test = x_test.reshape(-1, 28, 28, 1).astype('float32') /255

#print(x_train[0])
# print(y_train[0])
y_train = to_categorical(y_train)
print(y_train[0]) # [0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
y_test = to_categorical(y_test)

- 이미지 보강 클래스 : 기존 이미지를 좌우대칭, 회전, 기울기, 이동 등을 통해 이미지의 양을 늘림

from tensorflow.keras.preprocessing.image import ImageDataGenerator
# 연습
img_gen = ImageDataGenerator(
    rotation_range = 10, # 회전 범위
    zoom_range = 0.1, # 확대 축소
    shear_range = 0.5, # 축 기준 
    width_shift_range = 0.1, # 평행이동
    height_shift_range = 0.1, # 수직이동
    horizontal_flip = True, # 좌우 반전
    vertical_flip = False # 상하 반전
)
augument_size = 100
x_augument = img_gen.flow(np.tile(x_train[0].reshape(28*28), 100).reshape(-1, 28, 28, 1),
                          np.zeros(augument_size),
                          batch_size = augument_size,
                          shuffle = False).next()[0]
plt.figure(figsize=(10, 10))
for c in range(100):
    plt.subplot(10, 10, c+1)
    plt.axis('off')
    plt.imshow(x_augument[c].reshape(28, 28), cmap='gray')
plt.show()

img_generate = ImageDataGenerator(
    rotation_range = 10, # 회전 범위
    zoom_range = 0.1, # 확대 축소
    shear_range = 0.5, # 축 기준 
    width_shift_range = 0.1, # 평행이동
    height_shift_range = 0.1, # 수직이동
    horizontal_flip = False, # 좌우 반전
    vertical_flip = False # 상하 반전
)
augument_size = 30000 # 변형 이미지 3만개
randIdx = np.random.randint(x_train.shape[0], size = augument_size)
x_augment = x_train[randIdx].copy()
y_augment = y_train[randIdx].copy()

x_augument = img_generate.flow(x_augment,
                          np.zeros(augument_size),
                          batch_size = augument_size,
                          shuffle = False).next()[0]

# 원래 이미지에 증식된 이미지를 추가
x_train = np.concatenate((x_train, x_augment))
y_train = np.concatenate((y_train, y_augment))
print(x_train.shape) # (90000, 28, 28, 1)

model = tf.keras.models.Sequential([
    tf.keras.layers.Conv2D(filters=32, kernel_size=(3, 3), input_shape=(28, 28, 1), padding='same', activation='relu'),
    tf.keras.layers.MaxPooling2D(pool_size=(2,2)),
    tf.keras.layers.Dropout(0.3),

    tf.keras.layers.Conv2D(filters=32, kernel_size=(3, 3), input_shape=(28, 28, 1), padding='same', activation='relu'),
    tf.keras.layers.MaxPooling2D(pool_size=(2,2)),

    tf.keras.layers.Flatten(),

    tf.keras.layers.Dense(units=128, activation='relu'),
    tf.keras.layers.Dropout(0.3),
    tf.keras.layers.Dense(units=64, activation='relu'),
    tf.keras.layers.Dropout(0.3),
    tf.keras.layers.Dense(units=10, activation='softmax')
])
model.compile(optimizer='Adam', loss='categorical_crossentropy', metrics=['accuracy'])
print(model.summary())
'''
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d_8 (Conv2D)            (None, 28, 28, 32)        320       
_________________________________________________________________
max_pooling2d_7 (MaxPooling2 (None, 14, 14, 32)        0         
_________________________________________________________________
dropout_6 (Dropout)          (None, 14, 14, 32)        0         
_________________________________________________________________
conv2d_9 (Conv2D)            (None, 14, 14, 32)        9248      
_________________________________________________________________
max_pooling2d_8 (MaxPooling2 (None, 7, 7, 32)          0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 1568)              0         
_________________________________________________________________
dense_2 (Dense)              (None, 128)               200832    
_________________________________________________________________
dropout_7 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 64)                8256      
_________________________________________________________________
dropout_8 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense_4 (Dense)              (None, 10)                650       
=================================================================
Total params: 219,306
'''

early_stop = EarlyStopping(monitor='val_loss', patience=3)
history = model.fit(x_train, y_train, validation_split=0.2, epochs=100, batch_size=64, \
                     verbose=2, callbacks=[early_stop])
print('Accuracy : %.3f'%(model.evaluate(x_test, y_test)[1]))
# Accuracy : 0.992

print('accuracy :%.3f'%(model.evaluate(x_test, y_test)[1]))
# accuracy :0.992

# 시각화
plt.figure(figsize=(12,4))
plt.subplot(1, 2, 1)
plt.plot(history.history['accuracy'], marker = 'o', c='red', label='acc')
plt.plot(history.history['val_accuracy'], marker = 's', c='blue', label='val_acc')
plt.xlabel('epochs')
plt.ylim(0.5, 1)
plt.legend(loc='lower right')

plt.subplot(1, 2, 2)
plt.plot(history.history['loss'], marker = 'o', c='red', label='loss')
plt.plot(history.history['val_loss'], marker = 's', c='blue', label='val_loss')
plt.xlabel('epochs')
plt.legend(loc='upper right')
plt.show()

'BACK END > Deep Learning' 카테고리의 다른 글

[딥러닝] RNN, NLP (0)	2021.04.05
[딥러닝] Tensorflow - 이미지 분류 (0)	2021.04.01
[딥러닝] Keras - Linear (0)	2021.03.23
[딥러닝] TensorFlow (0)	2021.03.22
[딥러닝] TensorFlow 환경설정 (0)	2021.03.22

[딥러닝] Keras - Linear

2021. 3. 23. 13:16

Keras

: Layer로 이루어진 모델을 생성.

: layer간의 개별적인 parameter 운용이 가능.

- Keras Sequential API

keras.io/ko/models/sequential/

Sequential - Keras Documentation

Sequential 모델 API 시작하려면, 케라스 Sequential 모델 가이드를 읽어보십시오. Sequential 모델 메서드 compile compile(optimizer, loss=None, metrics=None, loss_weights=None, sample_weight_mode=None, weighted_metrics=None, target_te

keras.io

- Keras 기본 개념 및 모델링 순서

cafe.daum.net/flowlife/S2Ul/10

Daum 카페

cafe.daum.net

- activation function

선형회귀 : Linear : mse

이항분류 : step function/sigmoid function/Relu

다항분류 : softmax

layer : 병렬처리 node 구조

dense : layer 정의

sequential : hidden layer의 network 구조. 내부 relu + 종단 sigmoid or softmax

실제값과 예측값에 차이가 클 경우 feedback(역전파 - backpropagation)으로 모델 개선

실제값과 예측값이 완전히 같은 경우 overfitting 문제 발생.

- 역전파

m.blog.naver.com/samsjang/221033626685

[35편] 딥러닝의 핵심 개념 - 역전파(backpropagation) 이해하기1

1958년 퍼셉트론이 발표된 후 같은 해 7월 8일자 뉴욕타임즈는 앞으로 조만간 걷고, 말하고 자아를 인식하...

blog.naver.com

Keras 모듈로 논리회로 처리 모델(분류)

* ke1.py

import tensorflow as tf
import numpy as np

print(tf.keras.__version__)

1. 데이터 수집 및 가공

x = np.array([[0,0],[0,1],[1,0],[1,1]])
#y = np.array([0,1,1,1]) # or
#y = np.array([0,0,0,1]) # and
y = np.array([0,1,1,0]) # xor : node가 1인 경우 처리 불가

2. 모델 생성(네트워크 구성)

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation

model = Sequential([
    Dense(input_dim =2, units=1),
    Activation('sigmoid')
    ])
    
model = Sequential()
model.add(Dense(units=1, input_dim=2))
model.add(Activation('sigmoid'))
# input_dim : 입력층의 뉴런 수
# units : 출력 뉴런 수

from tensorflow.keras.models import Sequential

from tensorflow.keras.layers import Dense, Activation

model = Sequential() : 네트워크 생성

model.add(함수) : 모델 속성 설정

Dense(units=, input_dim=) : Layer 정의

input_dim : 입력층의 뉴런 수
units : 출력 뉴런 수

init : 가중치 초기화 방법. uniform(균일분포)/normal(가우시안 분포)

Activation('수식명') : 활성함수 설정. linear(선형회귀)/sigmoid(이진분류)/softmax(다항분류)/relu(은닉층)

3. 모델 학습과정 설정

model.compile(optimizer='sgd', loss='binary_crossentropy', metrics=['accuracy'])
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

from tensorflow.keras.optimizers import SGD, RMSprop, Adam

model.compile(optimizer=SGD(lr=0.01), loss='binary_crossentropy', metrics=['accuracy'])
model.compile(optimizer=SGD(lr=0.01, momentum=0.9), loss='binary_crossentropy', metrics=['accuracy'])
model.compile(optimizer=RMSprop(lr=0.01), loss='binary_crossentropy', metrics=['accuracy'])
model.compile(optimizer=Adam(lr=0.01), loss='binary_crossentropy', metrics=['accuracy'])

from tensorflow.keras.optimizers import SGD, RMSprop, Adam

compile(optimizer=, loss='binary_crossentropy', metrics=['accuracy']) : 학습설정

SGD : 확률적 경사 하강법(Stochastic Gradient Descent)
RMSprop : Adagrad는 학습을 계속 진행한 경우에는, 나중에 가서는 학습률이 지나치게 떨어진다는 단점
Adam : Momentum과 RMSprop의 장점을 이용한 방법

lr : learning rate. 학습률.
momentum : 관성

4. 모델 학습

model.fit(x, y, epochs=1000, batch_size=1, verbose=1)

model.fit(x, y, epochs=, batch_size=, verbose=) : 모델 학습

epochs : 학습횟수

batch_size : 가중치 갱신시 묶음 횟수, (가중치 갱신 횟수 = 데이터수 / batch size), 속도에 영향을 줌.

5. 모델평가

loss_metrics = model.evaluate(x, y)
print('loss_metrics :', loss_metrics)
# loss_metrics : [0.4869873821735382, 0.75]

evaluate(feature, label) : 모델 성능평가

6. 예측값

pred = model.predict(x)
print('pred :\n', pred)
'''
 [[0.36190987]
 [0.85991323]
 [0.8816227 ]
 [0.98774564]]
'''
pred = (model.predict(x) > 0.5).astype('int32')
print('pred :\n', pred.flatten())
#  [0 1 1 1]

7. 모델 저장

# 완벽한 모델이라 판단되면 모델을 저장
model.save('test.hdf5')
del model # 모델 삭제

from tensorflow.keras.models import load_model
model2 = load_model('test.hdf5')
pred2 = (model2.predict(x) > 0.5).astype('int32')
print('pred2 :\n', pred2.flatten())

model.save('파일명.hdf5') : 모델 삭제

del model : 모델 삭제

from tensorflow.keras.models import load_model
model = load_model('파일명.hdf5') : 모델 불러오기

논리 게이트 XOR 해결을 위해 Node 추가

* ke2.py

import tensorflow as tf
import numpy as np

# 1. 데이터 수집 및 가공
x = np.array([[0,0],[0,1],[1,0],[1,1]])
y = np.array([0,1,1,0]) # xor : node가 1인 경우 처리 불가

# 2. 모델 생성(네트워크 구성)
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation

model = Sequential()
model.add(Dense(units=5, input_dim=2))
model.add(Activation('relu'))
model.add(Dense(units=5))
model.add(Activation('relu'))
model.add(Dense(units=1))
model.add(Activation('sigmoid'))

model.add(Dense(units=5, input_dim=2, activation='relu'))
model.add(Dense(5, activation='relu' ))
model.add(Dense(1, activation='sigmoid'))

# 모델 파라미터 확인
print(model.summary())
'''
Layer (type)                 Output Shape              Param #   
=================================================================
dense (Dense)                (None, 5)                 15        
_________________________________________________________________
activation (Activation)      (None, 5)                 0         
_________________________________________________________________
dense_1 (Dense)              (None, 5)                 30        
_________________________________________________________________
activation_1 (Activation)    (None, 5)                 0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 6         
_________________________________________________________________
activation_2 (Activation)    (None, 1)                 0         
=================================================================
Total params: 51
Trainable params: 51
Non-trainable params: 0
'''

=> Param : (2+1) * 5 = 15 -> (5+1) * 5 = 30 -> (5+1)*1 = 6

: (input_dim + 1) * units
=> Total params : 15 + 30 + 6 = 51

# 3. 모델 학습과정 설정
model.compile(optimizer=Adam(0.01), loss='binary_crossentropy', metrics=['accuracy'])

# 4. 모델 학습
history = model.fit(x, y, epochs=100, batch_size=1, verbose=1)

# 5. 모델 성능 평가
loss_metrics = model.evaluate(x, y)
print('loss_metrics :', loss_metrics) # loss_metrics : [0.13949958980083466, 1.0]

pred = (model.predict(x) > 0.5).astype('int32')
print('pred :\n', pred.flatten())
print('------------')
print(model.input)
print(model.output)
print(model.weights) # kernel(가중치), bias 값 확인.

print('------------')
print(history.history['loss'])     # 학습 중의 데이터 확인
print(history.history['accuracy'])

# 모델학습 시 발생하는 loss 값 시각화
import matplotlib.pyplot as plt
plt.plot(history.history['loss'], label='train loss')
plt.xlabel('epochs')
plt.show()

import pandas as pd
pd.DataFrame(history.history).plot(figsize=(8, 5))
plt.show()

- 시뮬레이션

playground.tensorflow.org/#activation=tanh&batchSize=10&dataset=circle&regDataset=reg-plane&learningRate=0.03&regularizationRate=0&noise=0&networkShape=4,2&seed=0.26884&showTestData=false&discretize=false&percTrainData=50&x=true&y=true&xTimesY=false&xSquared=false&ySquared=false&cosX=false&sinX=false&cosY=false&sinY=false&collectStats=false&problem=classification&initZero=false&hideText=false

Tensorflow — Neural Network Playground

Tinker with a real neural network right here in your browser.

playground.tensorflow.org

cost function

: cost(loss, 손실)가 최소가 되는 weight 값 찾기

* ke3.py

import tensorflow as tf
import matplotlib.pyplot as plt

x = [1, 2, 3]
y = [1, 2, 3]
b = 0

w = 1
hypothesis = x * w + b # 예측값
cost =  tf.reduce_sum(tf.pow(hypothesis - y, 2)) / len(x)

w_val = []
cost_val = []

for i in range(-30, 50):
    feed_w = i * 0.1 # 0.1 : learning rate(학습률)
    hypothesis = tf.multiply(feed_w, x) + b
    cost =  tf.reduce_mean(tf.square(hypothesis - y))
    cost_val.append(cost)
    w_val.append(feed_w)
    print(str(i) + ' ' + ', cost:' + str(cost.numpy()) + ', w:', str(feed_w))
    
plt.plot(w_val, cost_val)
plt.xlabel('w')
plt.ylabel('cost')
plt.show()

Gradient Tape()을 이용한 최적의 w 얻기

: 경사하강법으로 cost를 최소화

* ke4.py

# 단순 선형회귀 예측 모형 작성
# x = 5일 때 f(x) = 50에 가까워지는 w 값 찾기

import tensorflow as tf
import numpy as np

x = tf.Variable(5.0)
w = tf.Variable(0.0)

@tf.function
def train_step():
    with tf.GradientTape() as tape: # 자동 미분을 위한 API 제공
        #print(tape.watch(w))
        y = tf.multiply(w, x) + 0
        loss = tf.square(tf.subtract(y, 50)) # (예상값 - 실제값)의 제곱
    grad = tape.gradient(loss, w)  # 자동 미분
    mu = 0.01 # 학습율
    w.assign_sub(mu * grad)
    return loss

for i in range(10):
    loss = train_step()
    print('{:1}, w:{:4.3}, loss:{:4.5}'.format(i, w.numpy(), loss.numpy()))
'''
0, w: 5.0, loss:2500.0
1, w: 7.5, loss:625.0
2, w:8.75, loss:156.25
3, w:9.38, loss:39.062
4, w:9.69, loss:9.7656
5, w:9.84, loss:2.4414
6, w:9.92, loss:0.61035
7, w:9.96, loss:0.15259
8, w:9.98, loss:0.038147
9, w:9.99, loss:0.0095367
'''

tf.GradientTape() :

gradient(loss, w) : 자동미분

# 옵티마이저 객체 사용
opti = tf.keras.optimizers.SGD()

x = tf.Variable(5.0)
w = tf.Variable(0.0)

@tf.function
def train_step2():
    with tf.GradientTape() as tape:          # 자동 미분을 위한 API 제공
        y = tf.multiply(w, x) + 0
        loss = tf.square(tf.subtract(y, 50)) # (예상값 - 실제값)의 제곱
    grad = tape.gradient(loss, w)            # 자동 미분
    opti.apply_gradients([(grad, w)])
    return loss

for i in range(10):
    loss = train_step2()
    print('{:1}, w:{:4.3}, loss:{:4.5}'.format(i, w.numpy(), loss.numpy()))

opti = tf.keras.optimizers.SGD() :

opti.apply_gradients([(grad, w)]) :

# 최적의 기울기, y절편 구하기
opti = tf.keras.optimizers.SGD()

x = tf.Variable(5.0)
w = tf.Variable(0.0)
b = tf.Variable(0.0)

@tf.function
def train_step3():
    with tf.GradientTape() as tape:          # 자동 미분을 위한 API 제공
        #y = tf.multiply(w, x) + b
        y = tf.add(tf.multiply(w, x), b)
        loss = tf.square(tf.subtract(y, 50)) # (예상값 - 실제값)의 제곱
    grad = tape.gradient(loss, [w, b])       # 자동 미분
    opti.apply_gradients(zip(grad, [w, b]))
    return loss

w_val = []     # 시각화 목적으로 사용
cost_val = []

for i in range(10):
    loss = train_step3()
    print('{:1}, w:{:4.3}, loss:{:4.5}, b:{:4.3}'.format(i, w.numpy(), loss.numpy(), b.numpy()))
    w_val.append(w.numpy())
    cost_val.append(loss.numpy())

'''
0, w: 5.0, loss:2500.0, b: 1.0
1, w: 7.4, loss:576.0, b:1.48
2, w:8.55, loss:132.71, b:1.71
3, w: 9.1, loss:30.576, b:1.82
4, w:9.37, loss:7.0448, b:1.87
5, w: 9.5, loss:1.6231, b: 1.9
6, w:9.56, loss:0.37397, b:1.91
7, w:9.59, loss:0.086163, b:1.92
8, w: 9.6, loss:0.019853, b:1.92
9, w:9.61, loss:0.0045738, b:1.92
'''
    
import matplotlib.pyplot as plt
plt.plot(w_val, cost_val, 'o')
plt.xlabel('w')
plt.ylabel('cost')
plt.show()

# 선형회귀 모델작성
opti = tf.keras.optimizers.SGD()

w = tf.Variable(tf.random.normal((1,)))
b = tf.Variable(tf.random.normal((1,)))

@tf.function
def train_step4(x, y):
    with tf.GradientTape() as tape:          # 자동 미분을 위한 API 제공
        hypo = tf.add(tf.multiply(w, x), b)
        loss = tf.reduce_mean(tf.square(tf.subtract(hypo, y))) # (예상값 - 실제값)의 제곱
    grad = tape.gradient(loss, [w, b])       # 자동 미분
    opti.apply_gradients(zip(grad, [w, b]))
    return loss

x = [1., 2., 3., 4., 5.]      # feature 
y = [1.2, 2.0, 3.0, 3.5, 5.5] # label

w_vals = []     # 시각화 목적으로 사용
loss_vals = []

for i in range(100):
    loss_val = train_step4(x, y)
    loss_vals.append(loss_val.numpy())
    if i % 10 ==0:
        print(loss_val)
    w_vals.append(w.numpy())
    
print('loss_vals :', loss_vals)
print('w_vals :', w_vals)
# loss_vals : [2.457926, 1.4767673, 0.904997, 0.57179654, 0.37762302, 0.26446754, 0.19852567, 0.16009742, 0.13770261, 0.12465141, 0.1170452, 0.112612054, 0.11002797, 0.10852148, 0.10764296, 0.10713041, 0.10683115, 0.10665612, 0.10655358, 0.10649315, 0.10645743, 0.10643599, 0.106422946, 0.10641475, 0.10640935, 0.10640564, 0.10640299, 0.10640085, 0.106398985, 0.106397435, 0.10639594, 0.10639451, 0.10639312, 0.10639181, 0.10639049, 0.10638924, 0.1063879, 0.10638668, 0.10638543, 0.10638411, 0.10638293, 0.1063817, 0.10638044, 0.10637925, 0.106378004, 0.10637681, 0.10637561, 0.106374465, 0.10637329, 0.10637212, 0.10637095, 0.10636979, 0.10636864, 0.10636745, 0.10636636, 0.10636526, 0.10636415, 0.10636302, 0.10636191, 0.10636077, 0.10635972, 0.10635866, 0.10635759, 0.10635649, 0.1063555, 0.10635439, 0.10635338, 0.10635233, 0.10635128, 0.106350325, 0.106349275, 0.1063483, 0.106347285, 0.10634627, 0.10634525, 0.10634433, 0.10634329, 0.10634241, 0.10634136, 0.10634048, 0.10633947, 0.106338575, 0.10633757, 0.10633665, 0.10633578, 0.10633484, 0.10633397, 0.106333, 0.10633211, 0.10633123, 0.106330395, 0.10632948, 0.10632862, 0.1063277, 0.10632684, 0.10632604, 0.10632517, 0.10632436, 0.106323466, 0.10632266]
# w_vals : [array([1.3279629], dtype=float32), array([1.2503898], dtype=float32), array([1.1911799], dtype=float32), array([1.145988], dtype=float32), array([1.1114972], dtype=float32), array([1.0851754], dtype=float32), array([1.0650897], dtype=float32), array([1.0497644], dtype=float32), array([1.0380731], dtype=float32), array([1.0291559], dtype=float32), array([1.0223563], dtype=float32), array([1.0171733], dtype=float32), array([1.0132244], dtype=float32), array([1.0102174], dtype=float32), array([1.0079296], dtype=float32), array([1.0061907], dtype=float32), array([1.0048708], dtype=float32), array([1.0038706], dtype=float32), array([1.0031146], dtype=float32), array([1.002545], dtype=float32), array([1.0021176], dtype=float32), array([1.0017987], dtype=float32), array([1.0015627], dtype=float32), array([1.0013899], dtype=float32), array([1.0012653], dtype=float32), array([1.0011774], dtype=float32), array([1.0011177], dtype=float32), array([1.0010793], dtype=float32), array([1.0010573], dtype=float32), array([1.0010476], dtype=float32), array([1.0010475], dtype=float32), array([1.0010545], dtype=float32), array([1.001067], dtype=float32), array([1.0010837], dtype=float32), array([1.0011035], dtype=float32), array([1.0011257], dtype=float32), array([1.0011497], dtype=float32), array([1.001175], dtype=float32), array([1.0012014], dtype=float32), array([1.0012285], dtype=float32), array([1.0012561], dtype=float32), array([1.0012841], dtype=float32), array([1.0013124], dtype=float32), array([1.0013409], dtype=float32), array([1.0013695], dtype=float32), array([1.0013981], dtype=float32), array([1.0014268], dtype=float32), array([1.0014554], dtype=float32), array([1.001484], dtype=float32), array([1.0015126], dtype=float32), array([1.0015413], dtype=float32), array([1.0015697], dtype=float32), array([1.0015981], dtype=float32), array([1.0016265], dtype=float32), array([1.0016547], dtype=float32), array([1.0016829], dtype=float32), array([1.001711], dtype=float32), array([1.001739], dtype=float32), array([1.0017669], dtype=float32), array([1.0017947], dtype=float32), array([1.0018225], dtype=float32), array([1.0018501], dtype=float32), array([1.0018777], dtype=float32), array([1.0019051], dtype=float32), array([1.0019325], dtype=float32), array([1.0019598], dtype=float32), array([1.001987], dtype=float32), array([1.002014], dtype=float32), array([1.0020411], dtype=float32), array([1.002068], dtype=float32), array([1.0020949], dtype=float32), array([1.0021216], dtype=float32), array([1.0021482], dtype=float32), array([1.0021747], dtype=float32), array([1.0022012], dtype=float32), array([1.0022275], dtype=float32), array([1.0022538], dtype=float32), array([1.00228], dtype=float32), array([1.0023061], dtype=float32), array([1.0023321], dtype=float32), array([1.0023581], dtype=float32), array([1.002384], dtype=float32), array([1.0024097], dtype=float32), array([1.0024353], dtype=float32), array([1.002461], dtype=float32), array([1.0024865], dtype=float32), array([1.0025119], dtype=float32), array([1.0025371], dtype=float32), array([1.0025624], dtype=float32), array([1.0025876], dtype=float32), array([1.0026126], dtype=float32), array([1.0026375], dtype=float32), array([1.0026624], dtype=float32), array([1.0026872], dtype=float32), array([1.0027119], dtype=float32), array([1.0027366], dtype=float32), array([1.0027611], dtype=float32), array([1.0027856], dtype=float32), array([1.00281], dtype=float32), array([1.0028343], dtype=float32)]

plt.plot(w_vals, loss_vals, 'o--')
plt.xlabel('w')
plt.ylabel('cost')
plt.show()

# 선형회귀선 시각화
y_pred = tf.multiply(x, w) + b    # 모델 완성
print('y_pred :', y_pred.numpy())

plt.plot(x, y, 'ro')
plt.plot(x, y_pred, 'b--')
plt.show()

tf 1.x 와 2.x : 단순선형회귀/로지스틱회귀 소스 코드

cafe.daum.net/flowlife/S2Ul/17

Daum 카페

cafe.daum.net

tensorflow 1.x 사용

단순선형회귀 - 경사하강법 함수 사용 1.x

* ke5_tensorflow1.py

import tensorflow.compat.v1 as tf   # tensorflow 1.x 소스 실행 시
tf.disable_v2_behavior()            # tensorflow 1.x 소스 실행 시

import matplotlib.pyplot as plt

x_data = [1.,2.,3.,4.,5.]
y_data = [1.2,2.0,3.0,3.5,5.5]

x = tf.placeholder(tf.float32)
y = tf.placeholder(tf.float32)
w = tf.Variable(tf.random_normal([1]))
b = tf.Variable(tf.random_normal([1]))

hypothesis = x * w + b
cost = tf.reduce_mean(tf.square(hypothesis - y))

print('\n경사하강법 메소드 사용------------')
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.01)
train = optimizer.minimize(cost)

sess = tf.Session()   # Launch the graph in a session.
sess.run(tf.global_variables_initializer())

w_val = []
cost_val = []

for i in range(501):
    _, curr_cost, curr_w, curr_b = sess.run([train, cost, w, b], {x:x_data, y:y_data})
    w_val.append(curr_w)
    cost_val.append(curr_cost)
    if i  % 10 == 0:
        print(str(i) + ' cost:' + str(curr_cost) + ' weight:' + str(curr_w) +' b:' + str(curr_b))

plt.plot(w_val, cost_val)
plt.xlabel('w')
plt.ylabel('cost')
plt.show()

print('--회귀분석 모델로 Y 값 예측------------------')
print(sess.run(hypothesis, feed_dict={x:[5]}))        # [5.0563836]
print(sess.run(hypothesis, feed_dict={x:[2.5]}))      # [2.5046895]
print(sess.run(hypothesis, feed_dict={x:[1.5, 3.3]})) # [1.4840119 3.3212316]

선형회귀분석 기본 - Keras 사용 2.x

* ke5_tensorflow2.py

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras import optimizers 

x_data = [1.,2.,3.,4.,5.]
y_data = [1.2,2.0,3.0,3.5,5.5]

model=Sequential()   # 계층구조(Linear layer stack)를 이루는 모델을 정의
model.add(Dense(1, input_dim=1, activation='linear'))

# activation function의 종류 : https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/activations
sgd=optimizers.SGD(lr=0.01)  # 학습률(learning rate, lr)은 0.01
model.compile(optimizer=sgd, loss='mse',metrics=['mse'])
lossmetrics = model.eval‎uate(x_data,y_data)
print(lossmetrics)

# 옵티마이저는 경사하강법의 일종인 확률적 경사 하강법 sgd를 사용.
# 손실 함수(Loss function)은 평균제곱오차 mse를 사용.
# 주어진 X와 y데이터에 대해서 오차를 최소화하는 작업을 100번 시도.
model.fit(x_data, y_data, batch_size=1, epochs=100, shuffle=False, verbose=2)

from sklearn.metrics import r2_score
print('설명력 : ', r2_score(y_data, model.predict(x_data)))

print('예상 수 : ', model.predict([5]))         # [[4.801656]]
print('예상 수 : ', model.predict([2.5]))       # [[2.490468]]
print('예상 수 : ', model.predict([1.5, 3.3]))  # [[1.565993][3.230048]]

단순선형모델 작성

keras model 작성방법 3가지 / 최적모델 찾기

cafe.daum.net/flowlife/S2Ul/22

Daum 카페

cafe.daum.net

출처 : https://www.pyimagesearch.com/2019/10/28/3-ways-to-create-a-keras-model-with-tensorflow-2-0-sequential-functional-and-model-subclassing/

- 공부시간에 따른 성적 결과 예측 - 모델 작성방법 3가지

* ke6_regression.py

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras import optimizers
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf

x_data = np.array([1,2,3,4,5], dtype=np.float32)      # feature
y_data = np.array([11,32,53,64,70], dtype=np.float32) # label

print(np.corrcoef(x_data, y_data))   # 0.9743547 인과관계가 있다고 가정

- 방법 1 : Sequential API 사용 - 여러개의 층을 순서대로 쌓아올린 완전 연결모델

model = Sequential()
model.add(Dense(units=1, input_dim=1, activation='linear'))
model.add(Dense(units=1, activation='linear'))
print(model.summary())

opti = optimizers.Adam(lr=0.01)
model.compile(optimizer=opti, loss='mse', metrics=['mse'])
model.fit(x=x_data, y=y_data, batch_size=1, epochs=100, verbose=1)
loss_metrics = model.evaluate(x=x_data, y=y_data)
print('loss_metrics: ', loss_metrics)
# loss_metrics:  [61.95122146606445, 61.95122146606445]

from sklearn.metrics import r2_score
print('설명력 : ', r2_score(y_data, model.predict(x_data))) # 설명력 :  0.8693012272129582

print('실제값 : ', y_data)                          # 실제값 :  [11. 32. 53. 64. 70.]
print('예측값 : ', model.predict(x_data).flatten()) # 예측값 :  [26.136082 36.97163  47.807175 58.642727 69.478264]

print('예상점수 : ', model.predict([0.5, 3.45, 6.7]).flatten())
# 예상점수 :  [22.367954 50.166172 80.79132 ]

plt.plot(x_data, model.predict(x_data), 'b', x_data, y_data, 'ko')
plt.show()

- 방법 2 : funcion API 사용 - Sequential API보다 유연한 모델을 작성

from tensorflow.keras.layers import Input
from tensorflow.keras.models import Model

inputs = Input(shape=(1, )) # input layer 생성
output1 = Dense(2, activation='linear')(inputs)
output2 = Dense(1, activation='linear')(output1)
model2 = Model(inputs, output2)

from tensorflow.keras.layers import Input

from tensorflow.keras.models import Model

input = Input(shape=(입력수, )) : input layer 생성
output = Dense(출력수, activation='linear')(input) : output 연결
model2 = Model(input, output) : 모델 생성

opti = optimizers.Adam(lr=0.01)
model2.compile(optimizer=opti, loss='mse', metrics=['mse'])
model2.fit(x=x_data, y=y_data, batch_size=1, epochs=100, verbose=1)
loss_metrics = model2.evaluate(x=x_data, y=y_data)
print('loss_metrics: ', loss_metrics) # loss_metrics:  [46.31613540649414, 46.31613540649414]
print('설명력 : ', r2_score(y_data, model2.predict(x_data))) # 설명력 :  0.8923131851204337

- 방법 3 : Model subclassing API 사용 - 동적인 모델을 작성

class MyModel(Model):
    def __init__(self): # 생성자
        super(MyModel, self).__init__()
        self.d1 = Dense(2, activation='linear') # layer 생성
        self.d2 = Dense(1, activation='linear')
    
    def call(self, x):  # 모델.fit()에서 호출
        x = self.d1(x)
        return self.d2(x)
        
model3 = MyModel()   # init 호출

opti = optimizers.Adam(lr=0.01)
model3.compile(optimizer=opti, loss='mse', metrics=['mse'])
model3.fit(x=x_data, y=y_data, batch_size=1, epochs=100, verbose=1)
loss_metrics = model3.evaluate(x=x_data, y=y_data)
print('loss_metrics: ', loss_metrics) # loss_metrics:  [41.4090576171875, 41.4090576171875]
print('설명력 : ', r2_score(y_data, model3.predict(x_data))) # 설명력 :  0.9126391191522784

다중 선형회귀 모델 + 텐서보드(모델의 구조 및 학습과정/결과를 시각화)

5명의 3번 시험 점수로 다음 시험점수 예측

* ke7.py

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras import optimizers
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf

# 데이터 수집
x_data = np.array([[70, 85, 80], [71, 89, 78], [50, 80, 60], [66, 20, 60], [50, 30, 10]])
y_data = np.array([73, 82, 72, 57, 34])

# Sequential API 사용

# 모델생성
model = Sequential()
#model.add(Dense(1, input_dim=3, activation='linear'))

# 모델 설정
model.add(Dense(6, input_dim=3, activation='linear', name='a'))
model.add(Dense(3, activation='linear', name='b'))
model.add(Dense(1, activation='linear', name='c'))
print(model.summary())

# 학습설정
opti = optimizers.Adam(lr=0.01)
model.compile(optimizer=opti, loss='mse', metrics=['mse'])
history = model.fit(x_data, y_data, batch_size=1, epochs=30, verbose=2)

# 시각화
plt.plot(history.history['loss'])
plt.xlabel('epochs')
plt.ylabel('loss')
plt.show()

# 모델 평가
loss_metrics = model.evaluate(x=x_data, y=y_data)

from sklearn.metrics import r2_score

print('loss_metrics: ', loss_metrics)
print('설명력 : ', r2_score(y_data, model.predict(x_data)))
# 설명력 :  0.7680899501992267
print('예측값 :', model.predict(x_data).flatten())
# 예측값 : [84.357574 83.79331  66.111855 57.75085  21.302818]

# funcion API 사용
from tensorflow.keras.layers import Input
from tensorflow.keras.models import Model

inputs = Input(shape=(3,))
output1 = Dense(6, activation='linear', name='a')(inputs)
output2 = Dense(3, activation='linear', name='b')(output1)
output3 = Dense(1, activation='linear', name='c')(output2)

linaer_model = Model(inputs, output3)
print(linaer_model.summary())

- TensorBoard : 알고리즘에 대한 동작을 확인하여 시행착오를 최소화

from tensorflow.keras.callbacks import TensorBoard

tb = TensorBoard(log_dir ='.\\my',
                 histogram_freq=True,
                 write_graph=True,
                 write_images=False)

# 학습설정
opti = optimizers.Adam(lr=0.01)
linear_model.compile(optimizer=opti, loss='mse', metrics=['mse'])
history = linear_model.fit(x_data, y_data, batch_size=1, epochs=30, verbose=2,\
                    callbacks = [tb])

# 모델 평가
loss_metrics = linear_model.evaluate(x=x_data, y=y_data)

from sklearn.metrics import r2_score

print('loss_metrics: ', loss_metrics)
# loss_metrics:  [26.276317596435547, 26.276317596435547]
print('설명력 : ', r2_score(y_data, linear_model.predict(x_data)))
# 설명력 :  0.9072950307860612
print('예측값 :', linear_model.predict(x_data).flatten())
# 예측값 : [80.09034  80.80026  63.217213 55.48591  33.510746]

# 새로운 값 예측
x_new = np.array([[50, 55, 50], [91, 99, 98]])
print('예상점수 :', linear_model.predict(x_new).flatten())
# 예상점수 : [53.61225  98.894615]

from tensorflow.keras.callbacks import TensorBoard

tb = TensorBoard(log_dir ='', histogram_freq=, write_graph=, write_images=) :

log_dir : 로그 경로 설정

histogram_freq : 히스토그램 표시

wrtie_graph : 그래프 그리기

write_images : 실행도중 사용 이미지 유무

model.fit(x, y, batch_size=, epochs=, verbose=, callbacks = [tb]) :

- TensorBoard의 결과확인은 cmd창에서 확인한다.

cd C:\work\psou\pro4\tf_test2
tensorboard --logdir my/

TensorBoard 2.4.1 at http://localhost:6006/ (Press CTRL+C to quit)

=> http://localhost:6006/ 접속

- TensorBoard 사용방법

pythonkim.tistory.com/39

텐서보드 사용법

TensorBoard는 TensorFlow에 기록된 로그를 그래프로 시각화시켜서 보여주는 도구다. 1. TensorBoard 실행 tensorboard --logdir=/tmp/sample 루트(/) 폴더 밑의 tmp 폴더 밑의 sample 폴더에 기록된 로그를 보겠..

pythonkim.tistory.com

정규화/표준화

: 데이터 간에 단위에 차이가 큰 경우

- scaler 종류

zereight.tistory.com/268

Scaler 의 종류

https://mkjjo.github.io/python/2019/01/10/scaler.html 스케일링의 종류 Scikit-Learn에서는 다양한 종류의 스케일러를 제공하고 있다. 그중 대표적인 기법들이다. 종류 설명 1 StandardScaler 기본 스케일. 평..

zereight.tistory.com

StandardScaler	기본 스케일. 평균과 표준편차 사용
MinMaxScaler	최대/최소값이 각각 1, 0이 되도록 스케일링
MaxAbsScaler	최대절대값과 0이 각각 1, 0이 되도록 스케일링
RobustScaler	중앙값(median)과 IQR(interquartile range) 사용. 아웃라이어의 영향을 최소화

* ke8_scaler.py

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras import optimizers
from tensorflow.keras.optimizers import SGD, RMSprop, Adam

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
from sklearn.preprocessing import MinMaxScaler, minmax_scale, StandardScaler, RobustScaler

data = pd.read_csv('https://raw.githubusercontent.com/pykwon/python/master/testdata_utf8/Advertising.csv')
del data['no']
print(data.head(2))
'''
      tv  radio  newspaper  sales
0  230.1   37.8       69.2   22.1
1   44.5   39.3       45.1   10.4
'''

# 정규화 : 0 ~ 1사이로 표현
xy = minmax_scale(data, axis=0, copy=True)
print(xy[:2])
# [[0.77578627 0.76209677 0.60598065 0.80708661]
#  [0.1481231  0.79233871 0.39401935 0.34645669]]

from sklearn.preprocessing import MinMaxScaler, minmax_scale, StandardScaler, RobustScaler

minmax_scale(data, axis=, copy=) : 정규화

# train/test : 과적합 방지
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(xy[:, :-1], xy[:, -1], \
                                                    test_size=0.3, random_state=123)
print(x_train[:2], x_train.shape) # tv, radio, newpaper
print(x_test[:2], x_test.shape)
print(y_train[:2], y_train.shape) # sales
print(y_test[:2], y_test.shape)
'''
[[0.80858979 0.08266129 0.32189974]
 [0.30334799 0.00604839 0.20140721]] (140, 3)
[[0.67331755 0.0625     0.30167106]
 [0.26885357 0.         0.07827617]] (60, 3)
[0.42125984 0.27952756] (140,)
[0.38582677 0.28346457] (60,)

# 모델 생성
model = Sequential()

model.add(Dense(1, input_dim =3)) # 레이어 1개
model.add(Activation('linear'))

model.add(Dense(1, input_dim =3, activation='linear'))
print(model.summary())
tf.keras.utils.plot_model(model,'abc.png')

tf.keras.utils.plot_model(model,'파일명') : 레이어 도식화하여 파일 저장.

# 학습설정
model.compile(optimizer=Adam(0.01), loss='mse', metrics=['mse'])
history = model.fit(x_train, y_train, batch_size=32, epochs=100, verbose=1,\
          validation_split = 0.2) # train data를 8:2로 분리해서 학습도중 검증 추가.
print('history:', history.history)

# 모델 평가
loss = model.evaluate(x_test, y_test)
print('loss :', loss)
# loss : [0.003264167346060276, 0.003264167346060276]

from sklearn.metrics import r2_score

pred = model.predict(x_test)
print('예측값 : ', pred[:3].flatten())
# 예측값 :  [0.4591275  0.21831244 0.569612  ]
print('실제값 : ', y_test[:3])
# 실제값 :  [0.38582677 0.28346457 0.51574803]
print('설명력 : ', r2_score(y_test, pred))
# 설명력 :  0.920154340793872

model.fit(x_train, y_train, batch_size=, epochs=, verbose=, validation_split = 0.2) : train data를 8:2로 분리해서 학습도중 검증 추가.

주식 데이터 회귀분석

* ke9_stock.py

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras import optimizers
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
from sklearn.preprocessing import MinMaxScaler, minmax_scale, StandardScaler, RobustScaler

xy = np.loadtxt('https://raw.githubusercontent.com/pykwon/python/master/testdata_utf8/stockdaily.csv',\
                delimiter=',', skiprows=1)
print(xy[:2], len(xy))
'''
[[8.28659973e+02 8.33450012e+02 8.28349976e+02 1.24770000e+06
  8.31659973e+02]
 [8.23020020e+02 8.28070007e+02 8.21655029e+02 1.59780000e+06
  8.28070007e+02]] 732
'''

# 정규화
scaler = MinMaxScaler(feature_range=(0, 1))
xy = scaler.fit_transform(xy)
print(xy[:3])
'''
[[0.97333581 0.97543152 1.         0.11112306 0.98831302]
 [0.95690035 0.95988111 0.9803545  0.14250246 0.97785024]
 [0.94789567 0.94927335 0.97250489 0.11417048 0.96645463]]
'''

x_data = xy[:, 0:-1]
y_data = xy[:, -1]
print(x_data[0], y_data[0])
# [0.97333581 0.97543152 1.         0.11112306] 0.9883130206172026
print(x_data[1], y_data[1])
# [0.95690035 0.95988111 0.9803545  0.14250246] 0.9778502390712853

# 하루전 데이터로 다음날 종가 예측
x_data = np.delete(x_data, -1, 0) # 마지막행 삭제
y_data = np.delete(y_data, 0)     # 0 행 삭제
print()

print('predict tomorrow')
print(x_data[0], '>=', y_data[0])
# [0.97333581 0.97543152 1.         0.11112306] >= 0.9778502390712853

model = Sequential()
model.add(Dense(input_dim=4, units=1))

model.compile(optimizer='adam', loss='mse', metrics=['mse'])
model.fit(x_data, y_data, epochs=100, verbose=2)
print(x_data[10])
# [0.88894325 0.88357424 0.90287217 0.10453527]

test = x_data[10].reshape(-1, 4)
print(test)
# [[0.88894325 0.88357424 0.90287217 0.10453527]]
print('실제값 :', y_data[10], ', 예측값 :', model.predict(test).flatten())
# 실제값 : 0.9003840704898083 , 예측값 : [0.8847432]

from sklearn.metrics import r2_score

pred = model.predict(x_data)
print('설명력 : ', r2_score(y_data, pred))
# 설명력 :  0.995010085719306

# 데이터를 분리
train_size = int(len(x_data) * 0.7)
test_size = len(x_data) - train_size
print(train_size, test_size)       # 511 220
x_train, x_test = x_data[0:train_size], x_data[train_size:len(x_data)]
print(x_train[:2], x_train.shape)  #  (511, 4)
y_train, y_test = y_data[0:train_size], y_data[train_size:len(x_data)]
print(y_train[:2], y_train.shape)  #  (511,)

model2 = Sequential()
model2.add(Dense(input_dim=4, units=1))

model2.compile(optimizer='adam', loss='mse', metrics=['mse'])
model2.fit(x_train, y_train, epochs=100, verbose=0)

result = model.evaluate(x_test, y_test)
print('result :', result)       # result : [0.0038084371481090784, 0.0038084371481090784]
pred2 = model2.predict(x_test)
print('설명력 : ', r2_score(y_test, pred2)) # 설명력 :  0.8712214499209135

plt.plot(y_test, 'b')
plt.plot(pred2, 'r--')
plt.show()

# 머신러닝 이슈는 최적화와 일반화의 줄다리기
# 최적화 : 성능 좋은 모델 생성. 과적합 발생.
# 일반화 : 모델이 새로운 데이터에 대한 분류/예측을 잘함

boston dataset으로 주택가격 예측

* ke10_boston.py

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras import optimizers
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras.datasets import boston_housing

#print(boston_housing.load_data())
(x_train, y_train), (x_test, y_test) = boston_housing.load_data()
print(x_train[:2], x_train.shape) # (404, 13)
print(y_train[:2], y_train.shape) # (404,)
print(x_test[:2], x_test.shape)   # (102, 13)
print(y_test[:2], y_test.shape)   # (102,)
'''
CRIM: 자치시(town) 별 1인당 범죄율
ZN: 25,000 평방피트를 초과하는 거주지역의 비율
INDUS:비소매상업지역이 점유하고 있는 토지의 비율
CHAS: 찰스강에 대한 더미변수(강의 경계에 위치한 경우는 1, 아니면 0)
NOX: 10ppm 당 농축 일산화질소
RM: 주택 1가구당 평균 방의 개수
AGE: 1940년 이전에 건축된 소유주택의 비율
DIS: 5개의 보스턴 직업센터까지의 접근성 지수
RAD: 방사형 도로까지의 접근성 지수
TAX: 10,000 달러 당 재산세율
PTRATIO: 자치시(town)별 학생/교사 비율
B: 1000(Bk-0.63)^2, 여기서 Bk는 자치시별 흑인의 비율을 말함.
LSTAT: 모집단의 하위계층의 비율(%)
MEDV: 본인 소유의 주택가격(중앙값) (단위: $1,000)
'''

from sklearn.preprocessing import MinMaxScaler, minmax_scale, StandardScaler
# 표준화 : (요소값 - 평균) / 표준편차
x_train = StandardScaler().fit_transform(x_train)
x_test = StandardScaler().fit_transform(x_test)
print(x_train[:2])
'''
[[-0.27224633 -0.48361547 -0.43576161 -0.25683275 -0.1652266  -0.1764426
   0.81306188  0.1166983  -0.62624905 -0.59517003  1.14850044  0.44807713
   0.8252202 ]
 [-0.40342651  2.99178419 -1.33391162 -0.25683275 -1.21518188  1.89434613
  -1.91036058  1.24758524 -0.85646254 -0.34843254 -1.71818909  0.43190599
  -1.32920239]]
'''

def build_model():
    model = Sequential()
    model.add(Dense(64, activation='linear', input_shape = (x_train.shape[1], )))
    model.add(Dense(32, activation='linear'))
    model.add(Dense(1, activation='linear')) # 보통 출력수를 줄여나감.
    
    model.compile(optimizer='adam', loss='mse', metrics=['mse'])
    return model
    
model = build_model()
print(model.summary())

# 연습 1 : trian/test로 학습. validation dataset 미사용
history = model.fit(x_train, y_train, epochs=50, batch_size=10, verbose=0)
mse_history = history.history['mse'] # loss, mse 중 mse
print('mse_history :', mse_history)
# mse_history : [548.9549560546875, 466.8479919433594, 353.4585876464844, 186.83999633789062, 58.98761749267578, 26.056533813476562, 23.167158126831055, 23.637117385864258, 23.369510650634766, 22.879520416259766, 23.390832901000977, 23.419946670532227, 23.037487030029297, 23.752803802490234, 23.961477279663086, 23.314424514770508, 23.156572341918945, 24.04509162902832, 23.13265609741211, 24.095226287841797, 23.08273696899414, 23.30631446838379, 24.038318634033203, 23.243263244628906, 23.506254196166992, 23.377840042114258, 23.529315948486328, 23.724761962890625, 23.4329891204834, 23.686052322387695, 23.25194549560547, 23.544504165649414, 23.093494415283203, 22.901500701904297, 23.991165161132812, 23.457441329956055, 24.34749412536621, 23.256059646606445, 23.843273162841797, 23.13270378112793, 24.404985427856445, 24.354494094848633, 23.51766014099121, 23.392494201660156, 23.11193084716797, 23.509197235107422, 23.29837417602539, 24.12410545349121, 23.416379928588867, 23.74490737915039]

# 연습 2 : trian/test로 학습. validation dataset 사용
history = model.fit(x_train, y_train, epochs=50, batch_size=10, verbose=0,\
                    validation_split = 0.3)
mse_history = history.history['mse'] # loss, mse, val_loss, val_mse 중 mse
print('mse_history :', mse_history)
# mse_history : [19.48627281188965, 19.15229606628418, 18.982120513916016, 19.509700775146484, 19.484264373779297, 19.066728591918945, 20.140111923217773, 19.462392807006836, 19.258283615112305, 18.974916458129883, 20.06231117248535, 19.748247146606445, 20.13493537902832, 19.995471954345703, 19.182003021240234, 19.42215347290039, 19.571495056152344, 19.24733543395996, 19.52226448059082, 19.074302673339844, 19.558866500854492, 19.209842681884766, 18.880287170410156, 19.14659309387207, 19.033899307250977, 19.366600036621094, 18.843536376953125, 19.674291610717773, 19.239337921142578, 19.594730377197266, 19.586498260498047, 19.684917449951172, 19.49432945251465, 19.398204803466797, 19.537694931030273, 19.503393173217773, 19.27028465270996, 19.265226364135742, 19.07738494873047, 19.075668334960938, 19.237651824951172, 19.83896827697754, 18.86182403564453, 19.732463836669922, 20.0035400390625, 19.034374237060547, 18.72059440612793, 19.841144561767578, 19.51473045349121, 19.27489471435547]
val_mse_history = history.history['val_mse'] # loss, mse, val_loss, val_mse 중 val_mse
print('val_mse_history :', mse_history)
# val_mse_history : [19.911706924438477, 19.533662796020508, 20.14069366455078, 20.71445655822754, 19.561399459838867, 19.340707778930664, 19.23623275756836, 19.126638412475586, 19.64912223815918, 19.517324447631836, 20.47089958190918, 19.591028213500977, 19.35943603515625, 20.017181396484375, 19.332469940185547, 19.519393920898438, 20.045940399169922, 18.939823150634766, 20.331043243408203, 19.793170928955078, 19.281906127929688, 19.30805778503418, 18.842435836791992, 19.221630096435547, 19.322744369506836, 19.64993667602539, 19.05265998840332, 18.85285758972168, 19.07070541381836, 19.016603469848633, 19.707555770874023, 18.752607345581055, 19.066970825195312, 19.616897583007812, 19.585346221923828, 19.096216201782227, 19.127830505371094, 19.077239990234375, 19.891225814819336, 19.251203536987305, 19.305219650268555, 18.768598556518555, 19.763708114624023, 19.80074119567871, 19.371135711669922, 19.151229858398438, 19.302906036376953, 19.169986724853516, 19.26124382019043, 19.901819229125977]

# 시각화
plt.plot(mse_history, 'r')
plt.plot(val_mse_history, 'b--')
plt.xlabel('epoch')
plt.ylabel('mse, val_mse')
plt.show()

from sklearn.metrics import r2_score

print('설명력 : ', r2_score(y_test, model.predict(x_test)))
# 설명력 :  0.7525586754103629

회귀분석 모델 : 자동차 연비 예측

* ke11_cars.py

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras import layers 

dataset = pd.read_csv('https://raw.githubusercontent.com/pykwon/python/master/testdata_utf8/auto-mpg.csv')
del dataset['car name']
print(dataset.head(2))
pd.set_option('display.max_columns', 100)
print(dataset.corr())
'''
                   mpg  cylinders  displacement    weight  acceleration  \
mpg           1.000000  -0.775396     -0.804203 -0.831741      0.420289   
cylinders    -0.775396   1.000000      0.950721  0.896017     -0.505419   
displacement -0.804203   0.950721      1.000000  0.932824     -0.543684   
weight       -0.831741   0.896017      0.932824  1.000000     -0.417457   
acceleration  0.420289  -0.505419     -0.543684 -0.417457      1.000000   
model year    0.579267  -0.348746     -0.370164 -0.306564      0.288137   
origin        0.563450  -0.562543     -0.609409 -0.581024      0.205873   

              model year    origin  
mpg             0.579267  0.563450  
cylinders      -0.348746 -0.562543  
displacement   -0.370164 -0.609409  
weight         -0.306564 -0.581024  
acceleration    0.288137  0.205873  
model year      1.000000  0.180662  
origin          0.180662  1.000000 
'''
dataset.drop(['cylinders','acceleration', 'model year', 'origin'], axis='columns', inplace=True)
print()
print(dataset.head(2))
'''
    mpg  displacement horsepower  weight
0  18.0         307.0        130    3504
1  15.0         350.0        165    3693
'''
dataset['horsepower'] = dataset['horsepower'].apply(pd.to_numeric, errors = 'coerce') # errors = 'coerce' : 에러 무시 
# data 중에 ?가 있어 형변환시 NaN 발생.
print(dataset.info())
print(dataset.isnull().sum()) # horsepower      6
dataset = dataset.dropna()
print('----------------------------------------------------')
print(dataset)

import seaborn as sns
sns.pairplot(dataset[['mpg', 'displacement', 'horsepower', 'weight']], diag_kind='kde')
plt.show()

# train/test
train_dataset = dataset.sample(frac= 0.7, random_state=123)
test_dataset = dataset.drop(train_dataset.index)
print(train_dataset.shape) # (274, 4)
print(test_dataset.shape)  # (118, 4)

# 표준화 작업 (수식을 직접 사용)을 위한 작업
train_stat = train_dataset.describe()
print(train_stat)
#train_dataset.pop('mpg')
train_stat = train_stat.transpose()
print(train_stat)

# label : mpg
train_labels = train_dataset.pop('mpg')
print(train_labels[:2])
'''
222    17.0
247    39.4
'''
test_labels = test_dataset.pop('mpg')
print(train_dataset)
'''
     displacement  horsepower  weight
222         260.0       110.0    4060
247          85.0        70.0    2070
136         302.0       140.0    4141
'''
print(test_labels[:2])
'''
1    15.0
2    18.0
'''
print(test_dataset)

def st_func(x):
    return ((x - train_stat['mean']) / train_stat['std'])

print(st_func(10))
'''
mpg            -1.706214
displacement   -1.745771
horsepower     -2.403940
weight         -3.440126
'''
print(train_dataset[:3])
'''
     displacement  horsepower  weight
222         260.0       110.0    4060
247          85.0        70.0    2070
136         302.0       140.0    4141
'''
print(st_func(train_dataset[:3]))
'''
     displacement  horsepower  mpg    weight
222      0.599039    0.133053  NaN  1.247890
247     -1.042328   -0.881744  NaN -1.055604
136      0.992967    0.894151  NaN  1.341651
'''
st_train_data = st_func(train_dataset) # train feature
st_test_data = st_func(test_dataset)   # test feature
st_train_data.pop('mpg')
st_test_data.pop('mpg')
print(st_train_data)
print(st_test_data)

# 모델에 적용할 dataset 준비완료
# Model
def build_model():
    network = tf.keras.Sequential([
        layers.Dense(units=64, input_shape=[3], activation='linear'),
        layers.Dense(64, activation='linear'), # relu
        layers.Dense(1, activation='linear')
        ])
    #opti = tf.keras.optimizers.RMSprop(0.01)
    opti = tf.keras.optimizers.Adam(0.01)
    network.compile(optimizer=opti, loss='mean_squared_error', \
                    metrics=['mean_absolute_error', 'mean_squared_error'])
    return network

print(build_model().summary())   # Total params: 4,481
# fit() 전에 모델을 실행해볼수도 있다.
model = build_model()
print(st_train_data[:1])
print(model.predict(st_train_data[:1])) # 결과 무시

# 훈련
epochs = 10

# 학습 조기 종료
early_stop = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3)

history = model.fit(st_train_data, train_labels, batch_size=32,\
                    epochs=epochs, validation_split=0.2, verbose=1)
df = pd.DataFrame(history.history)
print(df.head(3))
print(df.columns)

tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3) : 학습 조기 종료

# 시각화
def plot_history(history):
    hist = pd.DataFrame(history.history)
    hist['epoch'] = history.epoch
    plt.figure(figsize = (8,12))
    
    plt.subplot(2, 1, 1)
    plt.xlabel('epoch')
    plt.ylabel('Mean Abs Error[MPG]')
    plt.plot(hist['epoch'], hist['mean_absolute_error'], label='train error')
    plt.plot(hist['epoch'], hist['val_mean_absolute_error'], label='val error')
    #plt.ylim([0, 5])
    plt.legend()
    
    plt.subplot(2, 1, 2)
    plt.xlabel('epoch')
    plt.ylabel('Mean Squared Error[MPG]')
    plt.plot(hist['epoch'], hist['mean_squared_error'], label='train error')
    plt.plot(hist['epoch'], hist['val_mean_squared_error'], label='error')
    #plt.ylim([0, 20])
    plt.legend()
    plt.show()

plot_history(history)

'BACK END > Deep Learning' 카테고리의 다른 글

[딥러닝] Tensorflow - 이미지 분류 (0)	2021.04.01
[딥러닝] Keras - Logistic (0)	2021.03.25
[딥러닝] TensorFlow (0)	2021.03.22
[딥러닝] TensorFlow 환경설정 (0)	2021.03.22
[딥러닝] DBScan (0)	2021.03.22

[딥러닝] TensorFlow

2021. 3. 22. 17:25

TensorFlow

: 케라스를 사용하여 분류/예측 진행.

: 그래프(node, edge로 구성) 정보를 가짐.

: C로 만듦

www.tensorflow.org/guide/intro_to_graphs?hl=ko

그래프 및 함수 소개 | TensorFlow Core

이 가이드는 TensorFlow 및 Keras의 내부를 살펴봄으로써 TensorFlow의 동작 방식을 알아봅니다. 대신 Keras를 바로 시작하려면 Keras 가이드 모음을 참조하세요. 이 가이드에서는 TensorFlow 코드를 간단하게

www.tensorflow.org

* tf1.py

- 상수

import tensorflow as tf

print(tf.__version__)
print('GPU 사용가능' if tf.test.is_gpu_available() else '사용불가')

# 상수
print(1, type(1))                           # 1 <class 'int'>
print(tf.constant(1), type(tf.constant(1))) # scala  0-D tensor
# tf.Tensor(1, shape=(), dtype=int32) <class 'tensorflow.python.framework.ops.EagerTensor'>

print(tf.constant([1]))                     # vector 1-D tensor
# tf.Tensor([1], shape=(1,), dtype=int32)

print(tf.constant([[1]]))                   # matrix 2-D tensor
# tf.Tensor([[1]], shape=(1, 1), dtype=int32)
print()

a = tf.constant([1, 2])
b = tf.constant([3, 4])
c = a + b
print(c)               # tf.Tensor([4 6], shape=(2,), dtype=int32)
c = tf.add(a, b)
print(c)               # tf.Tensor([4 6], shape=(2,), dtype=int32)
d = tf.constant([3])
e = c + d              # broad casting
print(e)               # tf.Tensor([7 9], shape=(2,), dtype=int32)
print()

print(7)
print(tf.convert_to_tensor(7, dtype=tf.float32)) # tf.Tensor(7.0, shape=(), dtype=float32)
print(tf.cast(7, dtype=tf.float32))              # tf.Tensor(7.0, shape=(), dtype=float32)
print(tf.constant(7.0))

import tensorflow as tf

tf.constant(1)

tf.constant([1])

tf.constant([[1]])

tf.add(a,b) : tensor 더하기

tf.convert_to_tensor(7, dtype=tf.float32) : tensor로 형변환

tf.cast(7, dtype=tf.float32) : tensor로 형변환

- nump의 ndarray와 tensor 사이의 형변환

import numpy as np
arr = np.array([1, 2])
print(arr, type(arr))   # [1 2] <class 'numpy.ndarray'>
tfarr = tf.add(arr, 5)  # ndarray에서 tensor type으로 자동 형변환
print(tfarr)            # tf.Tensor([6 7], shape=(2,), dtype=int32)
print(tfarr.numpy())    # [6 7]
                        # tensor type에서 ndarray로 강제 형변환. 
print(np.add(tfarr, 3)) # [ 9 10]
                        # tensor type에서 ndarray로 자동 형변환.

numpy() : ndarray로 형변환

* tf2.py

- 변수

import tensorflow as tf

print(tf.constant([1]))  # tf.Tensor([1], shape=(1,), dtype=int32)

f = tf.Variable(1)
print(f)                 # <tf.Variable 'Variable:0' shape=() dtype=int32, numpy=1>

v = tf.Variable(tf.ones(2,))   # 1-D
m = tf.Variable(tf.ones(2, 1)) # 2-D
print(v, m)
# <tf.Variable 'Variable:0' shape=(2,) dtype=float32, numpy=array([1., 1.], dtype=float32)>
# <tf.Variable 'Variable:0' shape=(2,) dtype=float32, numpy=array([1., 1.], dtype=float32)>
print(m.numpy()) # [1. 1.]
print()

v1 = tf.Variable(1)
print(v1)      # <tf.Variable 'Variable:0' shape=() dtype=int32, numpy=1>
v1.assign(10)
print(v1, type(v1))      # <tf.Variable 'Variable:0' shape=() dtype=int32, numpy=10>
# <class 'tensorflow.python.ops.resource_variable_ops.ResourceVariable'>

v2 = tf.Variable(tf.ones(shape=(1)))
v2.assign([20])
print(v2)  # <tf.Variable 'Variable:0' shape=(1,) dtype=float32, numpy=array([20.], dtype=float32)>

v3 = tf.Variable(tf.ones(shape=(1, 2)))
v3.assign([[30, 40]])
print(v3)  # <tf.Variable 'Variable:0' shape=(1, 2) dtype=float32, numpy=array([[30., 40.]], dtype=float32)>
print()

v1 = tf.Variable([3])
v2 = tf.Variable([5])
v3 = v1 * v2 + 1
print(v3)           # tf.Tensor([16], shape=(1,), dtype=int32)
print()

var = tf.Variable([1,2,3,4,5], dtype=tf.float64)
result1 = var + 1
print(result1)      # tf.Tensor([2. 3. 4. 5. 6.], shape=(5,), dtype=float64)

v1 = tf.Variable(1, dtype=tf.float64) :

tf.Variable(tf.ones(2,)) :

tf.Variable(tf.ones(2, 1)) :

tf.Variable(tf.ones(shape=(1,))) :

tf.Variable(tf.ones(shape=(1, 2))) :

tf.ones(2,) :

v1.assign(10) :

w = tf.Variable(tf.ones(shape=(1,)))
b = tf.Variable(tf.ones(shape=(1,)))
w.assign([2])
b.assign([2])

def func1(x): # 파이썬 함수
    return w * x + b

print(func1(3))     # tf.Tensor([8.], shape=(1,), dtype=float32)
print(func1([3]))   # tf.Tensor([8.], shape=(1,), dtype=float32)
print(func1([[3]])) # tf.Tensor([[8.]], shape=(1, 1), dtype=float32)

@tf.function  # auto graph 기능 : tf.Graph + tf.Session. 파이썬 함수를 호출가능한 그래프 객체로변환. 텐서 플로우 그래프에 포함 되어 실행됨. 속도향상.
def func2(x): # 파이썬 함수
    return w * x + b

print(func2(3))

w = tf.Variable(tf.keras.backend.random_normal([5, 5], mean=0, stddev=0.3))
print(w.numpy())
'''
[[-1.3490540e-01 -1.9329010e-01  3.0367750e-01 -8.5950837e-02
  -4.1638307e-02]
 [ 2.0019636e-02 -2.5594628e-01  3.2065052e-01 -2.9873247e-03
  -1.8881789e-01]
 [ 4.1752983e-02  5.6705410e-03  2.5054044e-01  3.9801872e-03
   8.7102905e-02]
 [ 2.2132353e-01  2.5961196e-01  5.9260022e-02 -3.5298767e-04
   8.9973018e-02]
 [ 2.1339096e-01  2.9289970e-01  8.9739263e-02 -3.5879064e-01
   1.7020643e-01]]
'''
print(w.numpy().mean())   # 0.046684794
import numpy as np
print(np.mean(w.numpy())) # 0.046684794
b = tf.Variable(tf.zeros([5]))
print(w + b)
'''
[[-0.1764095  -0.2845988   0.12445427 -0.2934744   0.02773428]
 [-0.13376766  0.4082014  -0.26797575  0.23485608  0.10693993]
 [-0.15702389 -0.29115614  0.05970388 -0.01733402  0.11660431]
 [ 0.0814186   0.00365748  0.09495246 -0.17214663 -0.3305759 ]
 [-0.05509191 -0.29747888 -0.25892213 -0.20705828  0.3140773 ]], shape=(5, 5), dtype=float32)
'''

# assign
aa = tf.ones((2, 1))
print(aa.numpy()) # [[1.] [1.]]

m = tf.Variable(tf.zeros((2, 1)))
m.assign(aa)
print(m.numpy())  # [[1.] [1.]]

m.assign_add(aa)
print(m.numpy())  # [[2.] [2.]]

m.assign_sub(aa)
print(m.numpy())  # [[1.] [1.]]
print()

m.assign(2 * m)
print(m.numpy())  # [[2.] [2.]]

: 텐서플로우는 텐서 계산을 그래프로 작업한다.

: 2.x부터는 그래프가 묵시적으로 활동한다.
: 그래프는 계산의 단위를 나타내는 tf.Operation 객체와 연산 간에 흐르는 데이터의 단위를 나타내는 tf.Tensor 객체의

세트를 포함한다.
: 데이터 구조는 tf. 컨텍스트에서 정의됩니다.

* tf3.py

import tensorflow as tf

a = tf.constant(1)
print(a)            # tf.Tensor(1, shape=(), dtype=int32)

g1 = tf.Graph()

with g1.as_default():
    c1 = tf.constant(1, name= "c_one")
    c2 = tf.constant(1, name= "c_two")
    print(c1)       # Tensor("c_one:0", shape=(), dtype=int32)
    print(type(c1)) # <class 'tensorflow.python.framework.ops.Tensor'>
    print()
    print(c1.op)    # c1은 constant를 가리키는 pointer
    '''
    name: "c_one"
    op: "Const"
    attr {
      key: "dtype"
      value {
        type: DT_INT32
      }
    }
    attr {
      key: "value"
      value {
        tensor {
          dtype: DT_INT32
          tensor_shape {
          }
          int_val: 1
        }
      }
    }
    '''
    print()
    print(g1.as_graph_def())
    # tensor board : graph를 시각화
    '''
    node {
      name: "c_one"
      op: "Const"
      attr {
        key: "dtype"
        value {
          type: DT_INT32
        }
      }
      attr {
        key: "value"
        value {
          tensor {
            dtype: DT_INT32
            tensor_shape {
            }
            int_val: 1
          }
        }
      }
    }
    node {
      name: "c_two"
      op: "Const"
      attr {
        key: "dtype"
        value {
          type: DT_INT32
        }
      }
      attr {
        key: "value"
        value {
          tensor {
            dtype: DT_INT32
            tensor_shape {
            }
            int_val: 1
          }
        }
      }
    }
    versions {
      producer: 561
    }
    '''

g1 = tf.Graph() : 그래프 생성

g1.as_default() :

g1.as_graph_def() :

c1 = tf.constant(1, name= "") : 상수 생성

c1.op : 내부 구조 확인

v1 = tf.Variable(initial_value=1, name='') : 변수 생성

v1.op : 내부 구조 확인

g2 = tf.Graph()
with g2.as_default():
    v1 = tf.Variable(initial_value=1, name='v1')
    print(v1)        # <tf.Variable 'v1:0' shape=() dtype=int32>
    print(type(v1))  # <class 'tensorflow.python.ops.resource_variable_ops.ResourceVariable'>
    print()
    print(v1.op)
    '''
    name: "v1"
    op: "VarHandleOp"
    attr {
      key: "_class"
      value {
        list {
          s: "loc:@v1"
        }
      }
    }
    attr {
      key: "allowed_devices"
      value {
        list {
        }
      }
    }
    attr {
      key: "container"
      value {
        s: ""
      }
    }
    attr {
      key: "dtype"
      value {
        type: DT_INT32
      }
    }
    attr {
      key: "shape"
      value {
        shape {
        }
      }
    }
    attr {
      key: "shared_name"
      value {
        s: "v1"
      }
    }
    '''
    print()
print(g2.as_graph_def())
'''
node {
  name: "v1/Initializer/initial_value"
  op: "Const"
  attr {
    key: "_class"
    value {
      list {
        s: "loc:@v1"
      }
    }
  }
  attr {
    key: "dtype"
    value {
      type: DT_INT32
    }
  }
  attr {
    key: "value"
    value {
      tensor {
        dtype: DT_INT32
        tensor_shape {
        }
        int_val: 1
      }
    }
  }
}
node {
  name: "v1"
  op: "VarHandleOp"
  attr {
    key: "_class"
    value {
      list {
        s: "loc:@v1"
      }
    }
  }
  attr {
    key: "allowed_devices"
    value {
      list {
      }
    }
  }
  attr {
    key: "container"
    value {
      s: ""
    }
  }
  attr {
    key: "dtype"
    value {
      type: DT_INT32
    }
  }
  attr {
    key: "shape"
    value {
      shape {
      }
    }
  }
  attr {
    key: "shared_name"
    value {
      s: "v1"
    }
  }
}
node {
  name: "v1/IsInitialized/VarIsInitializedOp"
  op: "VarIsInitializedOp"
  input: "v1"
}
node {
  name: "v1/Assign"
  op: "AssignVariableOp"
  input: "v1"
  input: "v1/Initializer/initial_value"
  attr {
    key: "dtype"
    value {
      type: DT_INT32
    }
  }
}
node {
  name: "v1/Read/ReadVariableOp"
  op: "ReadVariableOp"
  input: "v1"
  attr {
    key: "dtype"
    value {
      type: DT_INT32
    }
  }
}
versions {
  producer: 561
}
'''

tf.constant : 텐서(상수값) 기억

tf.Variable : 텐서가 저장된 주소를 기억

* tf4.py

import numpy as np
import tensorflow as tf

a = 10
print(a, type(a))    # 10 <class 'int'>
print()

b = tf.constant(10)
print(b, type(b))   
# tf.Tensor(10, shape=(), dtype=int32) 
# <class 'tensorflow.python.framework.ops.EagerTensor'>

c = tf.Variable(10)
print(c, type(c))
# <tf.Variable 'Variable:0' shape=() dtype=int32, numpy=10>
# <class 'tensorflow.python.ops.resource_variable_ops.ResourceVariable'>
print()

node1 = tf.constant(3.0, tf.float32)
node2 = tf.constant(4.0)
print(node1)        # tf.Tensor(3.0, shape=(), dtype=float32)
print(node2)        # tf.Tensor(4.0, shape=(), dtype=float32)
node3 = tf.add(node1, node2)
print(node3)        # tf.Tensor(7.0, shape=(), dtype=float32)

v = tf.Variable(1) # 1

def find_next_odd():       # 파이썬 함수
    v.assign(v + 1)        # 2
    if tf.equal(v % 2, 0): # 파이썬 제어문
        v.assign(v + 10)   # 12

@tf.function
def find_next_odd():       # auto graph 기능에 의해 tenserflow의 Graph 객체 환경에서 작업할 수 있도록 코드 변형.
    v.assign(v + 1)        # 2
    if tf.equal(v % 2, 0): #  Graph 객체 환경에서 사용하는  제어문으로 코드 변환
        v.assign(v + 10)   # 12
        
find_next_odd()
print(v.numpy()) # 12

@tf.function :

tf.equal(변수, 비교값) :

- Auto Graph

rfriend.tistory.com/555

TensorFlow 2.0 의 AutoGraph 와 tf.function 으로 Python코드를 TF Graph로 자동 변환하기

TensorFlow에서 그래프(Graphs)는 tf.Operation 객체의 집합을 포함하고 있는 데이터 구조로서, 연산의 단위와 텐서 객체, 연산간에 흐르는 데이터 단위를 나타냅니다. ("Graphs are data structures that contain..

rfriend.tistory.com

=> @tf.function 사용시 function내부에서 데이터 강제 가공 처리 불가.

=> @tf.function 사용 전 funcion 실행하여 정상 실행 여부 확인 후 추가.

def func():
    temp = tf.constant(0)
    # temp=0
    su = 1
    for _ in range(3):
        temp = tf.add(temp, su)
        # temp += su
    return temp

kbs = func()
print(kbs) # tf.Tensor(3, shape=(), dtype=int32)
print(kbs.numpy(), ' ', np.array(kbs)) # 3   3

temp = tf.constant(0)
@tf.function
def func2():
    #temp = tf.constant(0)
    global temp
    su = 1
    for _ in range(3):
        temp = tf.add(temp, su)
    return temp

mbc = func2()
print(mbc) # tf.Tensor(3, shape=(), dtype=int32)

global

#@tf.function 사용불가
def func3():
    temp = tf.Variable(0)
    su = 1
    for _ in range(3):
        #temp = tf.add(temp, su)
        temp = temp +su #temp += su 불가
    return temp

sbs = func3()
print(sbs)

=> tf.Variable() 내부에 사용시 @tf.function 사용불가

temp = tf.Variable(0) # auto graph 외부에 선언
@tf.function
def func4():
    su = 1
    for _ in range(3):
        #temp = tf.add(temp, su) 불가
        #temp = temp +su 불가
        temp.assign_add(su) # 누적방법
    return temp

ytn = func4()
print(ytn)

=> tf.Variable() 외부에 사용시 @tf.function 사용 가능하며 누적은 temp.assign_add(su)로만 가능

# 구구단
@tf.function
def gugu1(dan):
    su = 0
    for _ in range(9):
        su = tf.add(su, 1)
        # print(su.numpy())
        # AttributeError: 'Tensor' object has no attribute 'numpy'
        # print('{} * {} = {:2}'.format(dan, su, dan * su))
        # TypeError: unsupported format string passed to Tensor.__format__

print(gugu1(3))

=> @tf.function사용 시 numpy() 강제 형변환, format 사용 불가.

@tf.function
def gugu2(dan):
    for i in range(1, 10):
        result = tf.multiply(dan, i)
        # print(result.numpy()) # AttributeError: 'Tensor' object has no attribute 'numpy'
        print(result)

print(gugu2(3))

연산자와 기본 함수

* tf5.py

import tensorflow as tf
import numpy as np

x = tf.constant(7)
y = 3

# 삼항 연산
result1 = tf.cond(x > y, lambda:tf.add(x,y), lambda:tf.subtract(x, y))
print(result1, ' ', result1.numpy()) # tf.Tensor(10, shape=(), dtype=int32)   10

tf.cond(조건, 참일때 실행함수, 거짓일때 실행함수) : 삼항연산자

# case 조건
f1 = lambda:tf.constant(1)
print(f1) # 주소

f2 = lambda:tf.constant(2)
print(f2()) # 실행값

a = tf.constant(3)
b = tf.constant(4)
result2 = tf.case([(tf.less(a, b), f1)], default=f2)
print(result2, ' ', result2.numpy()) # tf.Tensor(1, shape=(), dtype=int32)   1
print()

# 관계연산
print(tf.equal(1, 2).numpy())    # False
print(tf.not_equal(1, 2))        # tf.Tensor(True, shape=(), dtype=bool)
print(tf.less(1, 2))             # tf.Tensor(True, shape=(), dtype=bool)
print(tf.greater(1, 2))          # tf.Tensor(False, shape=(), dtype=bool)
print(tf.greater_equal(1, 2))    # tf.Tensor(False, shape=(), dtype=bool)

# 논리연산
print(tf.logical_and(True, False).numpy())  # False
print(tf.logical_or(True, False).numpy())   # True
print(tf.logical_not(True).numpy())         # False

kbs = tf.constant([1,2,2,2,3])
val, idx = tf.unique(kbs)
print(val.numpy()) # [1 2 3]
print(idx.numpy()) # [0 1 1 1 2]
print()

ar = [[1,2],[3,4]]
print(tf.reduce_mean(ar).numpy()) # 차원 축소를 하며 평균 산출 => 2
print(tf.reduce_mean(ar, axis=0).numpy()) # 열방향 [2 3]
print(tf.reduce_mean(ar, axis=1).numpy()) # 행방향 [1 3]
print(tf.reduce_sum(ar).numpy())  # 차원 축소를 하며 합 산출   => 10
print()

t = np.array([[[0,1,2],[3,4,5]],[[6,7,8],[9,10,11]]])
print(t.shape) # (2, 2, 3)
print(tf.reshape(t, shape=[2, 6]))
print(tf.reshape(t, shape=[-1, 6]))
print(tf.reshape(t, shape=[2, -1]))
'''
tf.Tensor(
[[ 0  1  2  3  4  5]
 [ 6  7  8  9 10 11]], shape=(2, 6), dtype=int32)
tf.Tensor(
[[ 0  1  2  3  4  5]
 [ 6  7  8  9 10 11]], shape=(2, 6), dtype=int32)
tf.Tensor(
[[ 0  1  2  3  4  5]
 [ 6  7  8  9 10 11]], shape=(2, 6), dtype=int32)
'''
print()
print(tf.squeeze(t)) # 열 요소가 1개인 경우 차원 축소
'''
tf.Tensor(
[[[ 0  1  2]
  [ 3  4  5]]

 [[ 6  7  8]
  [ 9 10 11]]], shape=(2, 2, 3), dtype=int32)
'''
print()

aa = np.array([[1], [2], [3], [4]])
print(aa.shape)     # (4, 1)
bb = tf.squeeze(aa)
print(bb, bb.shape) # tf.Tensor([1 2 3 4], shape=(4,), dtype=int32) (4,)
print()

print(tf.expand_dims(t, 0)) # 차원 확장
'''
tf.Tensor(
[[[[ 0  1  2]
   [ 3  4  5]]

  [[ 6  7  8]
   [ 9 10 11]]]], shape=(1, 2, 2, 3), dtype=int32)
'''
print(tf.expand_dims(t, 1))  # shape=(2, 1, 2, 3), dtype=int32)
print(tf.expand_dims(t, -1)) # shape=(2, 2, 3, 1), dtype=int32)
print()

print(tf.one_hot([0,1,2,0], depth=2))
'''
tf.Tensor(
[[1. 0.]
 [0. 1.]
 [0. 0.]
 [1. 0.]], shape=(4, 2), dtype=float32)
'''
print(tf.one_hot([0,1,2,0], depth=3))
'''
tf.Tensor(
[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [1. 0. 0.]], shape=(4, 3), dtype=float32)
'''
print(tf.one_hot([0,1,2,0], depth=3))
print(tf.argmax(tf.one_hot([0,1,2,0], depth=3)).numpy()) # 각행에서가장 큰값 출력
# [0 1 2]

'BACK END > Deep Learning' 카테고리의 다른 글

[딥러닝] Keras - Logistic (0)	2021.03.25
[딥러닝] Keras - Linear (0)	2021.03.23
[딥러닝] TensorFlow 환경설정 (0)	2021.03.22
[딥러닝] DBScan (0)	2021.03.22
[딥러닝] k-means (0)	2021.03.22

[딥러닝] TensorFlow 환경설정

2021. 3. 22. 16:32

가상 드라이버 설치

① my.vmware.com/en/web/vmware/downloads 접속

https://my.vmware.com/en/web/vmware/downloads

my.vmware.com

② VMware Workstation Player Download하여 [VMware-player-16.1.0-17198959.exe] 설치

CentOS ISO 파일 다운로드

① www.centos.org/ - Download

The CentOS Project

CfP open for May virtual Dojo We plan to host an online dojo, May 13th and 14th. Details and the call for presentations are now available on the events wiki. FOSDEM CentOS Dojo Recap We held the annual CentOS Dojo at FOSDEM on Feburuary 4th and 5th. Catch

www.centos.org

② IOS - x86_64 접속

③ [http://mirror.navercorp.com/centos/8.3.2011/isos/x86_64/] 선택하여 CentOS-8.3.2011-x86_64-dvd1.iso 다운로드

Linux 설치

① [VMware-player-16.1.0-17198959.exe] 실행후 설치

- [i will install the operating system later] 선택
- [Linux], [CentOS 8 64-bit] 선택
- [my]입력, brow 선택 후 경로 [C:\work\vm\test] 선택
- 용량 입력(20GB) - Finish

② Edit Virtual machine setting

- 용량 선택(4GB)

- process - 2로 수정
- cd/dvd - user iso image brow - centos 선택
- network - advance - Generate 선택
- ok

③ my 선택 및 Play virtual machine
- i enter

- 한국어 선택 및 계속 진행

④ 설정

- 언어지원 - 완료
- 시간 및 날짜 - 아시아 - 서울 - 켬
- 설치 소스 - 자동감지
- 소프트웨어 선택 - 워크스테이션 - GNOME , 인터넷 프로그램, 개발용 툴, 네트워크 서버 선택 완료.
- 오토매틱
- 이더넷 켬
- 암호설정 1234 완료 두번
- 사용자 생성 - hadoop 완료 두번

- 설치 시작

⑤ 접속

- 약관동의 및 동의 선택
- 설정완료 선택
- 한국어(hangul) 선택

⑥ 리눅스 업데이트

- 리눅스 접속
- 소프트웨어 클릭
- 업데이트 클릭
- 다운로드 클릭

C++ 설치

- support.microsoft.com/ko-kr/help/2977003/the-latest-supported-visual-c-downloads 접속

- [vc_redist.x64.exe] 다운로드 및 실행하여 설치

tensorflow 설치

- anaconda prompt 실행

pip install --upgrade pip 입력
pip install --upgrade tensorflow 입력

python
import tensorflow as tf
tf.__version__

'BACK END > Deep Learning' 카테고리의 다른 글

[딥러닝] Keras - Linear (0)	2021.03.23
[딥러닝] TensorFlow (0)	2021.03.22
[딥러닝] DBScan (0)	2021.03.22
[딥러닝] k-means (0)	2021.03.22
[딥러닝] 클러스터링 (0)	2021.03.19

[딥러닝] DBScan

2021. 3. 22. 12:56

DBSCAN

: 밀도 기반 클러스터링 : kmeans와 달리 k를 지정하지않음

- 이론

untitledtblog.tistory.com/146

[데이터 마이닝] DBSCAN과 밀도 기반 클러스터링

1. 밀도 기반 클러스터링 (Density-based clustering) 클러스터링 알고리즘은 크게 중심 기반 (center-based) 알고리즘과 밀도 기반 (density-based) 알고리즘으로 나눌 수 있다. 중심 기반 알고리즘의 가장 대표

untitledtblog.tistory.com

- K-Means / DBSCAN 동작 보기

https://www.naftaliharris.com/blog/visualizing-dbscan-clustering/

Visualizing DBSCAN Clustering

January 24, 2015 A previous post covered clustering with the k-means algorithm. In this post, we consider a fundamentally different, density-based approach called DBSCAN. In contrast to k-means, which modeled clusters as sets of points near to their center

www.naftaliharris.com

* cluster6_dbscan

import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import make_moons
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering

x, y = make_moons(n_samples = 200, noise = 0.05, random_state=0)
print(x)
print(y)
plt.scatter(x[:, 0], x[:, 1])
plt.show()

from sklearn.datasets import make_moons

make_moons(n_samples = 샘플수, noise = 노이즈 정도, random_state=난수seed) :

def plotResult(x, y):
    plt.scatter(x[y == 0, 0], x[y == 0, 1], c='blue', marker='o', label='clu-1')
    plt.scatter(x[y == 1, 0], x[y == 1, 1], c='red', marker='s', label='clu-2')
    plt.legend()
    plt.show()

- KMEANS 사용 : 완전한 분리 X

km = KMeans(n_clusters=2, random_state=0)
pred1 = km.fit_predict(x)

plotResult(x, pred1)

- DBSCAN 사용

dm = DBSCAN(eps=0.2, min_samples=5, metric = 'euclidean')
pred2 = dm.fit_predict(x)
print(pred2)

plotResult(x, pred2)

'BACK END > Deep Learning' 카테고리의 다른 글

[딥러닝] TensorFlow (0)	2021.03.22
[딥러닝] TensorFlow 환경설정 (0)	2021.03.22
[딥러닝] k-means (0)	2021.03.22
[딥러닝] 클러스터링 (0)	2021.03.19
[딥러닝] Neural Network (0)	2021.03.19

[딥러닝] k-means

2021. 3. 22. 10:35

k-means

: 비계층 군집분석

: 특정한 임의 지점을 선택해 해당 중심에 가까운 포인트들을 선택하는 군집화 기법

- 이론

ratsgo.github.io/machine%20learning/2017/04/19/KC/

K-평균 군집화(K-means Clustering) · ratsgo's blog

이번 글에서는 K-평균 군집화(K-means Clustering)에 대해 살펴보겠습니다. (줄여서 KC라 부르겠습니다) 이번 글은 고려대 강필성 교수님과 역시 같은 대학의 김성범 교수님 강의를 정리했음을 먼저 밝

ratsgo.github.io

* cluster3_kmeans.py

import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs

print(make_blobs)

x, y = make_blobs(n_samples=150, n_features=2, centers=3, cluster_std = 0.5, shuffle = True, random_state=0)
print(x)
'''
[[ 2.60509732  1.22529553]
 [ 0.5323772   3.31338909]
 [ 0.802314    4.38196181]
 [ 0.5285368   4.49723858]
 [ 2.61858548  0.35769791]
 [ 1.59141542  4.90497725]
 ...
]
'''
print(y)
# [1 0 0 0 1 0 0 1 2 0 1 2 2 ...]

plt.scatter(x[:, 0], x[:, 1], c='gray', marker='o')
plt.grid(True)
plt.show()

from sklearn.datasets import make_blobs

make_blobs(n_samples=샘플수, n_features=, centers=중심점수, cluster_std = 분산, shuffle = True, random_state=난수 seed) : blobs dataset 생성k-means

from sklearn.cluster import KMeans

kmodel = KMeans(n_clusters = 3, init='k-means++', random_state = 0).fit(x)
print(kmodel)


pred = kmodel.fit_predict(x)
print('pred:', pred)
'''
pred: [1 2 2 2 1 2 2 1 0 2 1 0 0 2 2 0 0 1 0 1 2 1 2 2 0 1 1 2 0 1 0 0 0 0 2 1 1
 1 2 2 0 0 2 1 1 1 0 2 0 2 1 2 2 1 1 0 2 1 0 2 0 0 0 0 2 0 2 1 2 2 2 1 1 2
 1 2 2 0 0 2 1 1 2 2 1 1 1 0 0 1 1 2 1 2 1 2 0 0 1 1 1 1 0 1 1 2 0 2 2 2 0
 2 1 0 2 0 2 2 0 0 2 1 2 2 1 1 0 1 0 0 0 0 1 0 0 0 2 0 1 0 2 2 1 1 0 0 0 0
 1 1]
'''
print(x[pred == 0])
'''
[[-2.12133364  2.66447408]
 [-0.37494566  2.38787435]
 [-1.84562253  2.71924635]
 ...
]
'''
print()
print(x[pred == 1])
'''
[[ 2.60509732  1.22529553]
 [ 2.61858548  0.35769791]
 [ 2.37533328  0.08918564]
 ...
]
'''
print()
print(x[pred == 2])
'''
[[ 0.5323772   3.31338909]
 [ 0.802314    4.38196181]
 [ 0.5285368   4.49723858]
 ...
]
'''
print()

from sklearn.cluster import KMeans

KMeans(n_clusters = 군집 수, init='k-means++', random_state = 난수 seed).fit(x) : kmeans

- KMeans API

scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

sklearn.cluster.KMeans — scikit-learn 0.24.1 documentation

scikit-learn.org

plt.scatter(x[pred==0, 0], x[pred==0, 1], c = 'red', marker='o', label='cluster1')
plt.scatter(x[pred==1, 0], x[pred==1, 1], c = 'green', marker='s', label='cluster2')
plt.scatter(x[pred==2, 0], x[pred==2, 1], c = 'blue', marker='v', label='cluster3')
plt.scatter(kmodel.cluster_centers_[:, 0], kmodel.cluster_centers_[:, 1], c = 'black', marker='+', s=50, label='center')
plt.legend()
plt.grid(True)
plt.show()

# 몇개의 그룹으로 나눌지가 중요. k의 값.
# 방법 1 : elbow - 클러스터간 SSE(오차 제곱의 함, sum of squares error)의 차이를 이용해 k 개수를 알 수 있다.
plt.rc('font', family = 'malgun gothic')

def elbow(x):
    sse = []
    for i in range(1, 11): # KMeans 모델을 10번 실행
        km = KMeans(n_clusters = i, init='k-means++', random_state = 0).fit(x)
        sse.append(km.inertia_)
    print(sse)
    plt.plot(range(1, 11), sse, marker='o')
    plt.xlabel('클러스터 수')
    plt.ylabel('SSE')
    plt.show()

elbow(x) # k는 3을 추천

# 방법 2 : silhoutte
'''
실루엣(silhouette) 기법
  클러스터링의 품질을 정량적으로 계산해 주는 방법이다.
  클러스터의 개수가 최적화되어 있으면 실루엣 계수의 값은 1에 가까운 값이 된다.
  실루엣 기법은 k-means 클러스터링 기법 이외에 다른 클러스터링에도 적용이 가능하다
'''
import numpy as np
from sklearn.metrics import silhouette_samples
from matplotlib import cm
 
# 데이터 X와 X를 임의의 클러스터 개수로 계산한 k-means 결과인 y_km을 인자로 받아 각 클러스터에 속하는 데이터의 실루엣 계수값을 수평 막대 그래프로 그려주는 함수를 작성함.
# y_km의 고유값을 멤버로 하는 numpy 배열을 cluster_labels에 저장. y_km의 고유값 개수는 클러스터의 개수와 동일함.
def plotSilhouette(x, pred):
    cluster_labels = np.unique(pred)
    n_clusters = cluster_labels.shape[0]   # 클러스터 개수를 n_clusters에 저장
    sil_val = silhouette_samples(x, pred, metric='euclidean')  # 실루엣 계수를 계산
    y_ax_lower, y_ax_upper = 0, 0
    yticks = []
    for i, c in enumerate(cluster_labels):
        # 각 클러스터에 속하는 데이터들에 대한 실루엣 값을 수평 막대 그래프로 그려주기
        c_sil_value = sil_val[pred == c]
        c_sil_value.sort()
        y_ax_upper += len(c_sil_value)
       
        plt.barh(range(y_ax_lower, y_ax_upper), c_sil_value, height=1.0, edgecolor='none')
        yticks.append((y_ax_lower + y_ax_upper) / 2)
        y_ax_lower += len(c_sil_value)
   
    sil_avg = np.mean(sil_val)         # 평균 저장
    plt.axvline(sil_avg, color='red', linestyle='--')  # 계산된 실루엣 계수의 평균값을 빨간 점선으로 표시
    plt.yticks(yticks, cluster_labels + 1)
    plt.ylabel('클러스터')
    plt.xlabel('실루엣 개수')
    plt.show() 
'''
그래프를 보면 클러스터 1~3 에 속하는 데이터들의 실루엣 계수가 0으로 된 값이 아무것도 없으며, 실루엣 계수의 평균이 0.7 보다 크므로 잘 분류된 결과라 볼 수 있다.
'''

X, y = make_blobs(n_samples=150, n_features=2, centers=3, cluster_std=0.5, shuffle=True, random_state=0)
km = KMeans(n_clusters=3, random_state=0) 
y_km = km.fit_predict(X)

plotSilhouette(X, y_km)

* cluster4.py

# 숫자 이미지 데이터에 K-평균 알고리즘 사용하기
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()  
import numpy as np
from sklearn.datasets import load_digits

digits = load_digits()      # 64개의 특징(feature)을 가진 1797개의 표본으로 구성된 숫자 데이터
print(digits.data.shape)  # (1797, 64) 64개의 특징은 8*8 이미지의 픽셀당 밝기를 나타냄

from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=10, random_state=0)
clusters = kmeans.fit_predict(digits.data)
print(kmeans.cluster_centers_.shape)  # (10, 64)  # 64차원의 군집 10개를 얻음

# 군집중심이 어떻게 보이는지 시각화
fig, ax = plt.subplots(2, 5, figsize=(8, 3))
centers = kmeans.cluster_centers_.reshape(10, 8, 8)
for axi, center in zip(ax.flat, centers):
    axi.set(xticks=[], yticks=[])
    axi.imshow(center, interpolation='nearest')
plt.show()  # 결과를 통해 KMeans가 레이블 없이도 1과 8을 제외하면 
# 인식 가능한 숫자를 중심으로 갖는 군집을 구할 수 있다는 사실을 알 수 있다. 

# k평균은 군집의 정체에 대해 모르기 때문에 0-9까지 레이블은 바뀔 수 있다.
# 이 문제는 각 학습된 군집 레이블을 그 군집 내에서 발견된 실제 레이블과 매칭해 보면 해결할 수 있다.
from scipy.stats import mode

labels = np.zeros_like(clusters)
for i in range(10):
    mask = (clusters == i)
    labels[mask] = mode(digits.target[mask])[0]

# 정확도 확인
from sklearn.metrics import accuracy_score
print(accuracy_score(digits.target, labels))  # 0.79354479

# 오차행렬로 시각화
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(digits.target, labels)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
            xticklabels=digits.target_names,
            yticklabels=digits.target_names)
plt.xlabel('true label')
plt.ylabel('predicted label')
plt.show()  # 오차의 주요 지점은 1과 8에 있다.

# 참고로 t분포 확률 알고리즘을 사용하면 분류 정확도가 높아진다.
from sklearn.manifold import TSNE

# 시간이 약간 걸림
tsne = TSNE(n_components=2, init='random', random_state=0)
digits_proj = tsne.fit_transform(digits.data)

# Compute the clusters
kmeans = KMeans(n_clusters=10, random_state=0)
clusters = kmeans.fit_predict(digits_proj)

# Permute the labels
labels = np.zeros_like(clusters)
for i in range(10):
    mask = (clusters == i)
    labels[mask] = mode(digits.target[mask])[0]

# Compute the accuracy
print(accuracy_score(digits.target, labels))  # 0.93266555

iris dataset으로 지도/비지도 학습 - KNN, KMEANS

* cluster5_iris.py

from sklearn.datasets import load_iris

iris_dataset = load_iris()
print(iris_dataset.keys())
# dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])

print(iris_dataset['data'][:3])
'''
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]]
'''
print(iris_dataset['feature_names'])
# ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
print(iris_dataset['target'][:3]) # [0 0 0]
print(iris_dataset['target_names']) # ['setosa' 'versicolor' 'virginica']

# train/test
from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(iris_dataset['data'], iris_dataset['target'], test_size = 0.25, random_state = 42)
print(train_x.shape, test_x.shape) # (112, 4) (38, 4)

- 지도학습 : KNN

from sklearn.neighbors import KNeighborsClassifier

knnModel = KNeighborsClassifier(n_neighbors=1, weights='distance', metric = 'euclidean')
print(knnModel)
knnModel.fit(train_x, train_y)

# 모델 성능 
import numpy as np
predict_label = knnModel.predict(test_x)
print('예측값 :', predict_label) # [1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0 0 0 1 0 0 2 1 0]
print('실제값 :', test_y)        # [1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0 0 0 1 0 0 2 1 0]
print('test acc : {:.3f}'.format(np.mean(predict_label == test_y))) # test acc : 1.000

from sklearn import metrics
print('test acc :', metrics.accuracy_score(test_y, predict_label))  # test acc : 1.0
print()

# 새로운 값을 분류
new_input = np.array([[6.6, 5.5, 4.4, 1.1]])
print(knnModel.predict(new_input))       # [1]
print(knnModel.predict_proba(new_input)) # [[0. 1. 0.]]

dist, index = knnModel.kneighbors(new_input)
print(dist, index) # [[2.24276615]] [[3]]
print()

- 비지도학습 : K-MEANS

from sklearn.cluster import KMeans
kmeansModel = KMeans(n_clusters = 3, init='k-means++', random_state=0)
kmeansModel.fit(train_x) # feature만 참여
print(kmeansModel.labels_) # [1 1 0 0 0 1 1 0 0 2 0 2 0 2 0 1 ...

print('0 cluster : ', train_y[kmeansModel.labels_ == 0])
# 0 cluster :  [2 1 1 1 2 1 1 1 1 1 2 1 1 1 2 2 2 1 1 1 1 1 2 1 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 2 2 1 2 1]
print('1 cluster : ', train_y[kmeansModel.labels_ == 1])
# 1 cluster :  [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
print('2 cluster : ', train_y[kmeansModel.labels_ == 2])
# 2 cluster :  [2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 1 2 2 2 2]

# 새로운 값을 분류
new_input = np.array([[6.6, 5.5, 4.4, 1.1]])
predict_cluster = kmeansModel.predict(new_input)
print(predict_cluster)       # [2]
print()

# 성능 측정
predict_test_x = kmeansModel.predict(test_x)
print(predict_test_x)

np_arr = np.array(predict_test_x)
np_arr[np_arr == 0], np_arr[np_arr == 1], np_arr[np_arr == 2] = 3, 4, 5 # 임시 저장용
print(np_arr)
np_arr[np_arr == 3] = 1 # 군집3을 1로 versicolor로 변경
np_arr[np_arr == 4] = 0 # 군집4을 0로 setosa로 변경
np_arr[np_arr == 5] = 2 # 군집5을 2로 verginica로 변경
print(np_arr)

predict_label = np_arr.tolist()
print(predict_label)
print('test acc :{:.3f}'.format(np.mean(predict_label == test_y))) # test acc :0.947

ㅁ

'BACK END > Deep Learning' 카테고리의 다른 글

[딥러닝] TensorFlow 환경설정 (0)	2021.03.22
[딥러닝] DBScan (0)	2021.03.22
[딥러닝] 클러스터링 (0)	2021.03.19
[딥러닝] Neural Network (0)	2021.03.19
[딥러닝] KNN (0)	2021.03.18

[딥러닝] 클러스터링

2021. 3. 19. 13:09

클러스터링(Clustering)

: 비지도 학습의 일종
: 계층적 군집분석

- 계층적 군집분석 종류

- 응집형 : 자료 하나하나를 군집으로 간주하고, 가까운 군집끼리 연결하는 방법. 군집의 크기를 점점 늘려가는 알고리즘. 상향식

- 분리형 : 전체 자료를 큰 군집으로 간주하고, 유의미한 부분을 분리해 나가는 방법. 군집의 크기를 점점 줄여가는 알고리즘. 하향식

k-means : 군집 수(k) 지정. 거리(유클리디안 거리 계산 법)들의 평균으로 비계층적 군집분석 진행.

- 이론

m.blog.naver.com/PostView.nhn?blogId=gkenq&logNo=10188552802&proxyReferer=https:%2F%2Fwww.google.com%2F

군집 분석 (Clustering analysis)

군집 분석은 각 개체의 유사성을 측정하여 높은 대상 집단을 분류하고, 군집에 속한 개체들의 유사성과 서...

blog.naver.com

* cluster1.py

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.rc('font', family='malgun gothic')

np.random.seed(123)
var = ['x', 'y']
labels = ['점0','점1', '점2', '점3', '점4']
x = np.random.random_sample([5, 2]) * 10
df = pd.DataFrame(x, columns = var, index = labels)
print(df)
'''           x         y
점0  6.964692  2.861393
점1  2.268515  5.513148
점2  7.194690  4.231065
점3  9.807642  6.848297
점4  4.809319  3.921175
'''

plt.scatter(x[:, 0], x[:, 1], c='blue', marker='o')
plt.grid(True)
plt.show()

from scipy.spatial.distance import pdist, squareform

dist_vec = pdist(df, metric='euclidean') # 데이터간 거리를 유클리디안 거리계산을 사용하여 측정
print('distmatrix :', dist_vec) 
# [5.3931329  1.38884785 4.89671004 2.40182631 5.09027885 7.6564396 2.99834352 3.69830057 2.40541571 5.79234641]

print(squareform(dist_vec)) # 데이터를 테이블 형태로 변경
'''
[[0.         5.3931329  1.38884785 4.89671004 2.40182631]
 [5.3931329  0.         5.09027885 7.6564396  2.99834352]
 [1.38884785 5.09027885 0.         3.69830057 2.40541571]
 [4.89671004 7.6564396  3.69830057 0.         5.79234641]
 [2.40182631 2.99834352 2.40541571 5.79234641 0.        ]]
'''

row_dist = pd.DataFrame(squareform(dist_vec))
print(row_dist)
'''
          0         1         2         3         4
0  0.000000  5.393133  1.388848  4.896710  2.401826
1  5.393133  0.000000  5.090279  7.656440  2.998344
2  1.388848  5.090279  0.000000  3.698301  2.405416
3  4.896710  7.656440  3.698301  0.000000  5.792346
4  2.401826  2.998344  2.405416  5.792346  0.000000
'''

from scipy.spatial.distance import pdist

distance=pdist(df, metric='euclidean') : 데이터간 거리를 유클리디안 거리계산을 사용하여 측정

from scipy.spatial.distance import squareform

squareform(distance) : 데이터를 테이블 형태로 변경

from scipy.cluster.hierarchy import linkage # 응집형 계층적 군집분석

row_clusters = linkage(dist_vec, method='ward') # method : complete, single, average, .. 
print(row_clusters)
'''
[[0.         2.         1.38884785 2.        ]
 [4.         5.         2.65710936 3.        ]
 [1.         6.         5.45400408 4.        ]
 [3.         7.         6.64710151 5.        ]]
'''

df = pd.DataFrame(row_clusters, columns=['클러스터1', '클러스터2', '거리', '멤버 수'])
print(df)
'''
   클러스터1  클러스터2        거리  멤버 수
0    0.0    2.0  1.388848   2.0
1    4.0    5.0  2.657109   3.0
2    1.0    6.0  5.454004   4.0
3    3.0    7.0  6.647102   5.0
'''

from scipy.cluster.hierarchy import linkage

linkage(distance, method='ward') :응집형 계층적 군집분석

method : complete, single, average, ...

- linkage API

docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html

scipy.cluster.hierarchy.linkage — SciPy v1.6.1 Reference Guide

For method ‘single’, an optimized algorithm based on minimum spanning tree is implemented. It has time complexity \(O(n^2)\). For methods ‘complete’, ‘average’, ‘weighted’ and ‘ward’, an algorithm called nearest-neighbors chain is imple

docs.scipy.org

from scipy.cluster.hierarchy import dendrogram

dendrogram(row_clusters, labels=labels)
plt.tight_layout()
plt.ylabel('유클리드 거리')
plt.show()

from scipy.cluster.hierarchy import dendrogram

dendrogram(linkage값, labels=) : dendrogram 생성

- 계층적 클러스터 분류 결과 시각화

from sklearn.cluster import AgglomerativeClustering

ac = AgglomerativeClustering(n_clusters = 3, affinity='euclidean', linkage='ward')
labels = ac.fit_predict(x)
print('결과 :', labels) # 결과 : [0 2 0 1 0]

from sklearn.cluster import AgglomerativeClustering
AgglomerativeClustering(n_clueters = 3, affinty='euclidean', linkage='ward') : 병합 군집 알고리즘

a = labels.reshape(-1, 1)
print(a)
'''
[[0]
 [2]
 [0]
 [1]
 [0]]
'''
x1 = np.hstack([x, a])
print('x1 :', x1)
'''
x1 : 
[[6.96469186 2.86139335 0.        ]
 [2.26851454 5.51314769 2.        ]
 [7.1946897  4.2310646  0.        ]
 [9.80764198 6.84829739 1.        ]
 [4.80931901 3.92117518 0.        ]]
'''
x_0 = x1[x1[:, 2] == 0, :]
x_1 = x1[x1[:, 2] == 1, :]
x_2 = x1[x1[:, 2] == 2, :]

plt.scatter(x_0[:, 0], x_0[:, 1])
plt.scatter(x_1[:, 0], x_1[:, 1])
plt.scatter(x_2[:, 0], x_2[:, 1])
plt.legend(['cluster0', 'cluster1', 'cluster2'])
plt.show()

계층적 클러스터링 : iris

* cluster2.py

import pandas as pd
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt

iris = load_iris()
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
print(iris_df.head(3))
'''
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                5.1               3.5                1.4               0.2
1                4.9               3.0                1.4               0.2
2                4.7               3.2                1.3               0.2
'''
print(iris_df.loc[0:4, ['sepal length (cm)', 'sepal width (cm)']])
'''
   sepal length (cm)  sepal width (cm)
0                5.1               3.5
1                4.9               3.0
2                4.7               3.2
3                4.6               3.1
4                5.0               3.6
'''

from scipy.spatial.distance import pdist, squareform

#dist_vec = pdist(iris_df.loc[:, ['sepal length (cm)', 'sepal width (cm)']], metric = 'euclidean')
dist_vec = pdist(iris_df.loc[0:4, ['sepal length (cm)', 'sepal width (cm)']], metric = 'euclidean')
print(dist_vec)   # 데이터간 거리
# [0.53851648 0.5        0.64031242 0.14142136 0.28284271 0.31622777
#  0.60827625 0.14142136 0.5        0.64031242]

row_dist = pd.DataFrame(squareform(dist_vec)) # 테이블 형태로 변경
print('row_dist :\n', row_dist)
'''
           0         1         2         3         4
0  0.000000  0.538516  0.500000  0.640312  0.141421
1  0.538516  0.000000  0.282843  0.316228  0.608276
2  0.500000  0.282843  0.000000  0.141421  0.500000
3  0.640312  0.316228  0.141421  0.000000  0.640312
4  0.141421  0.608276  0.500000  0.640312  0.000000
'''

from scipy.cluster.hierarchy import linkage, dendrogram
row_clusters = linkage(dist_vec, method='complete') # 응집형 계층적 군집 분석
print('row_clusters :\n', row_clusters)
'''
[[0.         4.         0.14142136 2.        ]
 [2.         3.         0.14142136 2.        ]
 [1.         6.         0.31622777 3.        ]
 [5.         7.         0.64031242 5.        ]]
'''

df = pd.DataFrame(row_clusters, columns=['id1', 'id2', 'dist', 'count'])
print(df)
'''
   id1  id2      dist  count
0  0.0  4.0  0.141421    2.0
1  2.0  3.0  0.141421    2.0
2  1.0  6.0  0.316228    3.0
3  5.0  7.0  0.640312    5.0
'''
row_dend = dendrogram(row_clusters)  # dendrodgram
plt.ylabel('dist test')
plt.show()

from sklearn.cluster import AgglomerativeClustering

ac = AgglomerativeClustering(n_clusters = 2, affinity='euclidean', linkage='complete')
x = iris_df.loc[0:4, ['sepal length (cm)', 'sepal width (cm)']]
labels = ac.fit_predict(x)
print('클러스터 결과 :', labels) # 결과 : [1 0 0 0 1]

plt.hist(labels)
plt.grid(True)
plt.show()

비계층적 군집분석

yganalyst.github.io/ml/ML_clustering/

[클러스터링] 비계층적(K-means, DBSCAN) 군집분석

비계층적 군집분석 방법인 K-means와 DBSCAN에 대해 알아보자

yganalyst.github.io

'BACK END > Deep Learning' 카테고리의 다른 글

[딥러닝] DBScan (0)	2021.03.22
[딥러닝] k-means (0)	2021.03.22
[딥러닝] Neural Network (0)	2021.03.19
[딥러닝] KNN (0)	2021.03.18
[딥러닝] RandomForest (0)	2021.03.17

[딥러닝] Neural Network

2021. 3. 19. 12:02

인공 신경망

- 이론

brunch.co.kr/@gdhan/6

인공신경망 개념(Neural Network)

[인공지능 이야기] 생물학적 신경망, 인공신경망, 퍼셉트론, MLP | 인공신경망은 두뇌의 신경세포, 즉 뉴런이 연결된 형태를 모방한 모델이다. 인공신경망(ANN, Artificial Neural Network)은 간략히 신경

brunch.co.kr

x1 -> w1(가중치) -> [뉴런]

x2 -> w2(가중치) -> w1*x1 + w2*x2 + ... -> output : y

...

↖예측값과 실제값 비교 feedback하여 가중치 조절↙

cost(손실) 값과 weight(가중치)값을 비교하여 cost 값이 최소가 되는 지점의 weight 산출.

편미분으로 산출하여 기울기가 0인 지점 산출.

learning rate (학습률) : feedback하여 값을 산출할 경우 다음 feedback 간 간격 비율.

epoch(학습 수) : feedback 수

=> 다중 선형회귀

=> y1 = w*x + b (추세선)

=> 로지스틱 회귀

=> y2 = 1 / (1 + e^(y1) )

=> MLP

단층 신경망(뉴런, Node)

: 입력자료에 각각의 가중치를 곱해 더한 값을 대상으로 임계값(활성화 함수)을 기준하여 이항 분류가 가능. 예측도 가능

단층 신경망으로 논리회로 분류

* neural1.py

def or_func(x1, x2):
    w1, w2, theta = 0.5, 0.5, 0.3
    sigma = w1 * x1 + w2 * x2 + 0
    if sigma <= theta:
        return 0
    elif sigma > theta:
        return 1

print(or_func(0, 0)) # 0
print(or_func(1, 0)) # 1
print(or_func(0, 1)) # 1
print(or_func(1, 1)) # 1
print()

def and_func(x1, x2):
    w1, w2, theta = 0.5, 0.5, 0.7
    sigma = w1 * x1 + w2 * x2 + 0
    if sigma <= theta:
        return 0
    elif sigma > theta:
        return 1
    
print(and_func(0, 0)) # 0
print(and_func(1, 0)) # 0
print(and_func(0, 1)) # 0
print(and_func(1, 1)) # 1
print()

def xor_func(x1, x2):
    w1, w2, theta = 0.5, 0.5, 0.5
    sigma = w1 * x1 + w2 * x2 + 0
    if sigma <= theta:
        return 0
    elif sigma > theta:
        return 1
    
print(xor_func(0, 0)) # 0
print(xor_func(1, 0)) # 1
print(xor_func(0, 1)) # 1
print(xor_func(1, 1)) # 1
print()
# 만족하지 못함

import numpy as np
from sklearn.linear_model import Perceptron

feature = np.array([[0,0], [0,1], [1,0], [1,1]])
#print(feature)
#label = np.array([0, 0, 0, 1]) # and
#label = np.array([0, 1, 1, 1]) # or
label = np.array([1, 1, 1, 0]) # nand
#label = np.array([0, 1, 1, 0]) # xor

ml = Perceptron(max_iter = 100).fit(feature, label) # max_iter: 학습 수
print(ml.predict(feature))
# [0 0 0 1] and
# [0 1 1 1] or
# [1 0 0 0] nand => 만족하지못함
# [0 0 0 0] xor => 만족하지못함

from sklearn.linear_model import Perceptron

Perceptron(max_iter = ).fit(x, y) : 단순인공 신경망. max_iter - 학습 수

- Perceptron api

scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html

sklearn.linear_model.Perceptron — scikit-learn 0.24.1 documentation

scikit-learn.org

MLP

: 다층 신경망 논리 회로 분류

x1	x2	nand	or	xor
0	0	1	0	0
0	1	1	1	1
1	0	1	1	1
1	1	0	1	0

x1 --> nand -> xor -> y

x2 or

* neural4_mlp1.py

import numpy as np
from sklearn.neural_network import MLPClassifier

feature = np.array([[0,0], [0,1], [1,0], [1,1]])
#label = np.array([0, 0, 0, 1]) # and
label = np.array([0, 1, 1, 1]) # or
#label = np.array([1, 1, 1, 0]) # nand
#label = np.array([0, 1, 1, 0]) # xor

#ml = MLPClassifier(hidden_layer_sizes=30).fit(feature, label) # hidden_layer_sizes - node 수
#ml = MLPClassifier(hidden_layer_sizes=30, max_iter=400, verbose=1, learning_rate_init=0.1).fit(feature, label)
ml = MLPClassifier(hidden_layer_sizes=(10, 10, 10), max_iter=400, verbose=1, learning_rate_init=0.1).fit(feature, label)
# verbose - 진행가정 확인. # max_iter default 200. max_iter - 학습수. learning_rate_init - 학습 진행률. 클수록 세밀한 분석을 되나 속도는 저하
print(ml)
print(ml.predict(feature))
# [0 0 0 1] and
# [0 1 1 1] or
# [1 1 1 0] nand
# [0 1 1 0] xor => 모두 만족

from sklearn.neural_network import MLPClassifier

MLPClassifier(hidden_layer_sizes=, max_iter=, verbose=, learning_rate_init=).fit(x, y) : 다층 신경망.

hidden_layer_sizes : node 수

verbose : 진행가정 log 추가

max_iter : 학습 수 (default 200)

learning_rate_init : 학습 진행률. (클수록 세밀한 분석을 되나 속도는 저하)

- MLPClassifier api (deep learning)

scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html

sklearn.neural_network.MLPClassifier — scikit-learn 0.24.1 documentation

scikit-learn.org

'BACK END > Deep Learning' 카테고리의 다른 글

[딥러닝] k-means (0)	2021.03.22
[딥러닝] 클러스터링 (0)	2021.03.19
[딥러닝] KNN (0)	2021.03.18
[딥러닝] RandomForest (0)	2021.03.17
[딥러닝] Decision Tree (0)	2021.03.17

[딥러닝] KNN

2021. 3. 18. 15:26

KNN

: K 최근접 이웃 알고리즘

- 이론

onikaze.tistory.com/368

Machine Learning - (2) kNN 모델

이 글을 읽기 전에 반드시 참고하셔야 할 부분이 있음을 알려드립니다. 인터넷 상에 제 글이 검색이 되어 다른 분들도 한 번 혹은 그 이상은 거쳐가는 곳인 것은 사실이지만, 어디까지나 저는 Mac

onikaze.tistory.com

- anaconda prompt

pip install mglearn

=> 모듈 다운로드

* knn1.py

import mglearn     # pip install mglearn
import matplotlib.pyplot as plt
plt.rc('font', family='malgun gothic')

# -------------------------
# Classification
mglearn.plots.plot_knn_classification(n_neighbors=1)
plt.show()

mglearn.plots.plot_knn_classification(n_neighbors=3)
plt.show()

mglearn.plots.plot_knn_classification(n_neighbors=5)
plt.show()

=> 가장 간단한 k-NN 알고리즘은 가장 가까운 훈련 데이터 포인트 하나를 최근접 이웃으로 찾아 예측에 사용합니다.
=> 단순히 이 훈련 데이터 포인트의 출력이 예측이 됩니다.

import mglearn

mglearn.plots.plot_knn_classification(n_neighbors=) : classification knn 알고리즘. n_neighbors - k값.

# Regression
mglearn.plots.plot_knn_regression(n_neighbors=1)
plt.show()

mglearn.plots.plot_knn_regression(n_neighbors=3)
plt.show()

import mglearn

mglearn.plots.plot_knn_regression(n_neighbors=) : regression knn 알고리즘. n_neighbors - k값.

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

X, y = mglearn.datasets.make_forge() # forge dataset load
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=12) # train, test로 나눔
print(X_train, ' ', X_train.shape)  # [[ 8.92229526 -0.63993225] ...   (19, 2)
print(X_test, ' ', X_test.shape)    #  (7, 2)
print(y_train)  # [0 0 1 1 0 1 0 1 1 1 0 1 0 0 0 1 0 1 0]

model = KNeighborsClassifier(n_neighbors=3)
model.fit(X_train, y_train)
print("test 예측: {}".format(model.predict(X_test)))
# test 예측: [0 0 1 0 1 1 1]
print("test 정확도: {:.2f}".format(model.score(X_test, y_test)))
# test 정확도: 0.86
print("train 정확도: {:.2f}".format(model.score(X_train, y_train)))
# train 정확도: 0.95

fig, axes = plt.subplots(1, 3, figsize=(10, 5))

for n_neighbors, ax in zip([1, 3, 9], axes):
    model2 = KNeighborsClassifier(n_neighbors=n_neighbors).fit(X, y)
    mglearn.plots.plot_2d_separator(model2, X, fill=True, eps=0.5, ax=ax, alpha=.4)
    mglearn.discrete_scatter(X[:, 0], X[:, 1], y, ax=ax)
    ax.set_title("{} 이웃".format(n_neighbors))
    ax.set_xlabel("특성 0")
    ax.set_ylabel("특성 1")
    axes[0].legend(loc=1)

plt.show()

import mglearn

mglearn.datasets.make_forge() : forge dataset

from sklearn.neighbors import KNeighborsClassifier

KNeighborsClassifier(n_neighbors=) : knn classification 알고리즘

model.score(x, y) : 정확도

mglearn.plots.plot_2d_separator(model2, X, fill=True, eps=0.5, ax=ax, alpha=.4)

mglearn.discrete_scatter(X[:, 0], X[:, 1], y, ax=ax)

왼쪽 그림을 보면 이웃을 하나 선택했을 때는 결정 경계가 훈련 데이터에 가깝게 따라가고 있습니다.
이웃의 수를 늘릴수록 결정 경계는 더 부드러워집니다. 부드러운 경계는 더 단순한 모델을 의미합니다.
다시 말해 이웃을 적게 사용하면 모델의 복잡도가 높아지고([그림]의 오른쪽) 많이 사용하면 복잡도는 낮아집니다([그림]의 왼쪽).

훈련 데이터 전체 개수를 이웃의 수로 지정하는 극단적인 경우에는 모든 테스트 포인트가 같은 이웃(모든 훈련 데이터)을 가지게 되므로 테스트 포인트에 대한 예측은 모두 같은 값이 됩니다.
즉 훈련 세트에서 가장 많은 데이터 포인트를 가진 클래스가 예측값이 됩니다.
일반적으로 KNeighbors 분류기에 중요한 매개변수는 두 개입니다. 데이터 포인트 사이의 거리를 재는 방법과 이웃의 수입니다.
실제로 이웃의 수는 3개나 5개 정도로 적을 때 잘 작동하지만, 이 매개변수는 잘 조정해야 합니다.
거리 재는 방법은 기본적으로 유클리디안 거리 방식을 사용합니다.

breast_cancer dataset으로 실습

from sklearn.datasets import load_breast_cancer

cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, stratify=cancer.target, random_state=66)

training_accuracy = []
test_accuracy = []
# 1에서 10까지 n_neighbors를 적용
neighbors_settings = range(1, 11)

for n_neighbors in neighbors_settings:
    clf = KNeighborsClassifier(n_neighbors=n_neighbors)  # 모델 생성
    clf.fit(X_train, y_train)
    # train dataset 정확도 저장
    training_accuracy.append(clf.score(X_train, y_train))
    # test dataset 정확도 저장
    test_accuracy.append(clf.score(X_test, y_test))

import numpy as np
print("평균 정확도 :", np.mean(test_accuracy))
# 평균 정확도 : 0.918881118881119
plt.plot(neighbors_settings, training_accuracy, label="훈련 정확도")
plt.plot(neighbors_settings, test_accuracy, label="테스트 정확도")
plt.ylabel("정확도")
plt.xlabel("n_neighbors")
plt.legend()
plt.show()

from sklearn.datasets import load_breast_cancer
load_breast_cancer()

이 그림은 n_neighbors 수(x축)에 따른 훈련 세트와 테스트 세트 정확도(y축)를 보여줍니다.
실제 이런 그래프는 매끈하게 나오지 않지만, 여기서도 과대적합과 과소적합의 특징을 볼 수 있습니다
(이웃의 수가 적을수록 모델이 복잡해지므로 [그림]의 그래프가 수평으로 뒤집힌 형태입니다).
최근접 이웃의 수가 하나일 때는 훈련 데이터에 대한 예측이 완벽합니다.
하지만 이웃의 수가 늘어나면 모델은 단순해지고 훈련 데이터의 정확도는 줄어듭니다.
이웃을 하나 사용한 테스트 세트의 정확도는 이웃을 많이 사용했을 때보다 낮습니다.
이것은 1-최근접 이웃이 모델을 너무 복잡하게 만든다는 것을 설명해줍니다.
반대로 이웃을 10개 사용했을 때는 모델이 너무 단순해서 정확도는 더 나빠집니다.
정확도가 가장 좋을 때는 중간 정도인 여섯 개를 사용한 경우입니다.

참고 : 파이썬 라이브러리를 활용한 머신러닝 (한빛미디어 출판사)의 일부분을 사용했습니다.

* knn2.py

from sklearn.neighbors import KNeighborsClassifier

kmodel = KNeighborsClassifier(n_neighbors = 3, weights = 'distance')

train = [
    [5, 3, 2],
    [1, 3, 5],
    [4, 5, 7]
    ]
label = [0, 1, 1]

import matplotlib.pyplot as plt

plt.plot(train, 'o')
plt.xlim([-1, 5])
plt.ylim([0, 10])
plt.show()

scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

sklearn.neighbors.KNeighborsClassifier — scikit-learn 0.24.1 documentation

scikit-learn.org

from sklearn.neighbors import KNeighborsClassifier
KNeighborsClassifier(n_neighbors = 3, weights = 'distance')

kmodel.fit(train, label)
pred = kmodel.predict(train)
print('pred :', pred)                        # pred : [0 1 1]
print('acc :', kmodel.score(train, label))   # acc : 1.0

new_data = [[1, 2, 8], [6, 4, 1]]
new_pred = kmodel.predict(new_data)
print('new_pred :', new_pred)                # new_pred : [1 0]

* regression_test.py

 # 대표적인 분류/예측 모델로 Regression 연습
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import r2_score

adver = pd.read_csv('../testdata/Advertising.csv', usecols=[1,2,3,4])
print(adver.head(2))
'''
      tv  radio  newspaper  sales
0  230.1   37.8       69.2   22.1
1   44.5   39.3       45.1   10.4
'''

x = np.array(adver.loc[:, 'tv':'newspaper'])
y = np.array(adver.sales)
print(x[:2]) # [[230.1  37.8  69.2] [ 44.5  39.3  45.1]]
print(y[:2]) # [22.1 10.4]

# KNeighborsRegressor
kmodel = KNeighborsRegressor(n_neighbors=3).fit(x, y)
print(kmodel)
kpred = kmodel.predict(x)
print('pred :', kpred[:5]) # pred : [20.4        10.43333333  8.56666667 18.2        14.2       ]
print('r2 :', r2_score(y, kpred))  # r2 : 0.968012077694316
print()

# LinearRegression
lmodel = LinearRegression().fit(x, y)
print(lmodel)
lpred = lmodel.predict(x)
print('pred :', lpred[:5]) # pred : [20.52397441 12.33785482 12.30767078 17.59782951 13.18867186]
print('r2 :', r2_score(y, lpred))  # r2 : 0.8972106381789522
print()

# RandomForestRegressor
rmodel = RandomForestRegressor(n_estimators=100, criterion='mse').fit(x, y)
print(rmodel)
rpred = rmodel.predict(x)
print('pred :', rpred[:5]) # pred : [21.942 10.669  8.859 18.281 13.44 ]
print('r2 :', r2_score(y, rpred))  # r2 : 0.9971466378876895
print()

# XGBRegressor
xmodel = XGBRegressor(n_estimators=100).fit(x, y)
print(xmodel)
xpred = xmodel.predict(x)
print('pred :', xpred[:5]) # pred : [22.095655  10.40437    9.302584  18.499216  12.9007015]
print('r2 :', r2_score(y, xpred))  # r2 : 0.9999996661140423
print()

'BACK END > Deep Learning' 카테고리의 다른 글

[딥러닝] 클러스터링 (0)	2021.03.19
[딥러닝] Neural Network (0)	2021.03.19
[딥러닝] RandomForest (0)	2021.03.17
[딥러닝] Decision Tree (0)	2021.03.17
[딥러닝] 나이브 베이즈 (0)	2021.03.17

[딥러닝] RandomForest

2021. 3. 17. 17:29

RandomForest

: 앙상블 기법(여러개의 Decision Tree를 묶어 하나의 모델로 사용)

: 정량적인 분석 모델

RandomForestClassifier 분류 모델 연습

* randomForest1.py

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import pandas as pd
from sklearn import model_selection
import numpy as np
from sklearn.metrics._scorer import accuracy_scorer

df = pd.read_csv('https://raw.githubusercontent.com/pykwon/python/master/testdata_utf8/titanic_data.csv')
print(df.head(3), df.shape) # (891, 12)
'''
   PassengerId  Survived  Pclass  ...     Fare Cabin  Embarked
0            1         0       3  ...   7.2500   NaN         S
1            2         1       1  ...  71.2833   C85         C
2            3         1       3  ...   7.9250   NaN         S
'''
print(df.columns)
# Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
#        'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
#       dtype='object')
print(df.info())
 #   Column       Non-Null Count  Dtype  
# ---  ------       --------------  -----  
#  0   PassengerId  891 non-null    int64  
#  1   Survived     891 non-null    int64  
#  2   Pclass       891 non-null    int64  
#  3   Name         891 non-null    object 
#  4   Sex          891 non-null    object 
#  5   Age          714 non-null    float64
#  6   SibSp        891 non-null    int64  
#  7   Parch        891 non-null    int64  
#  8   Ticket       891 non-null    object 
#  9   Fare         891 non-null    float64
#  10  Cabin        204 non-null    object 
#  11  Embarked     889 non-null    object 

print(df.isnull().any())
# PassengerId    False
# Survived       False
# Pclass         False
# Name           False
# Sex            False
# Age             True
# SibSp          False
# Parch          False
# Ticket         False
# Fare           False
# Cabin           True
# Embarked        True

df.isnull().any() : null 값 확인.

df = df.dropna(subset=['Pclass','Age','Sex'])
print(df.head(3), df.shape) # (714, 12)

df_x = df[['Pclass','Age','Sex']]
print(df_x.head(3))
'''
   Pclass   Age     Sex
0       3  22.0    male
1       1  38.0  female
2       3  26.0  female
'''

df.dropna(subset=['칼럼1', '칼럼2',..]) : 칼럼에 결측치가 있으면 제거.

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

df_x.loc[:, 'Sex'] = LabelEncoder().fit_transform(df_x['Sex']) # female : 0, male : 1
df_x['Sex'] = df_x['Sex'].apply(lambda x: 1 if x=='male' else 0) # 위와 동일

print(df_x.head(3), df_x.shape) # (714, 3)
'''
   Pclass   Age  Sex
0       3  22.0    1
1       1  38.0    0
2       3  26.0    0
'''

df_y = df['Survived']
print(df_y.head(3), df_y.shape) # (714,)
'''
0    0
1    1
2    1
'''

df_x2 = pd.DataFrame(OneHotEncoder().fit_transform(df_x['Pclass'].values[:,np.newaxis]).toarray(),\
                     columns = ['f_class', 's_class', 't_class'], index=df_x.index)
print(df_x2.head(3))
'''
   f_class  s_class  t_class
0      0.0      0.0      1.0
1      1.0      0.0      0.0
2      0.0      0.0      1.0
'''

df_x = pd.concat([df_x, df_x2], axis=1)
print(df_x.head(3))
'''
   Pclass   Age  Sex  f_class  s_class  t_class
0       3  22.0    1      0.0      0.0      1.0
1       1  38.0    0      1.0      0.0      0.0
2       3  26.0    0      0.0      0.0      1.0
'''

from sklearn.preprocessing import LabelEncoder

LabelEncoder().fit_transform(df['범주형 칼럼']) : 범주형 데이터를 수치형으로 변환.

from sklearn.preprocessing import OneHotEncoder

OneHotEncoder().fit_transform(df['칼럼명']).toarray() : One hot encoding

np.newaxis : 차원 증가.

pd.concat([칼럼, .. ], axis=1) : 열 방향 합치기

# train / test
(train_x,  test_x, train_y, test_y) = train_test_split(df_x, df_y)

# model
from sklearn.metrics import accuracy_score

model = RandomForestClassifier(n_estimators=500, criterion='entropy')
fit_model = model.fit(train_x, train_y)

pred = fit_model.predict(test_x)
print('예측값:', pred[:10])           # [0 0 0 1 0 0 0 1 0 0]
print('실제값:', test_y[:10].ravel()) # [0 1 0 0 1 0 0 1 0 0]

print('acc :', sum(test_y == pred) / len(test_y))
print('acc :', accuracy_score(test_y, pred))

from sklearn.ensemble import RandomForestClassifier

RandomForestClassifier(n_estimators=100, criterion='entropy') : n_estimators : 트리 수, criterion : 분할 품질 측정 방법.

from sklearn.metrics import accuracy_score

accuracy_score(실제값, 예측값) : 정확도 산출.

ravel() : 차원 축소.

- RandomForestClassifier API

scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

sklearn.ensemble.RandomForestClassifier — scikit-learn 0.24.1 documentation

scikit-learn.org

보스톤 지역의 주택 평균가격 예측

* randomForest_regressor.py

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston

boston = load_boston()
print(boston.DESCR)
# MEDV     Median value of owner-occupied homes in $1000's

# DataFrame 값으로 변환
dfx = pd.DataFrame(boston.data, columns = boston.feature_names)
# dataset에서 독립변수 값만 추출
dfy = pd.DataFrame(boston.target, columns = ['MEDV'])
# dataset에서 종속변수 값 추출

print(dfx.head(3), dfx.shape) # (506, 13)
'''
      CRIM    ZN  INDUS  CHAS    NOX  ...  RAD    TAX  PTRATIO       B  LSTAT
0  0.00632  18.0   2.31   0.0  0.538  ...  1.0  296.0     15.3  396.90   4.98
1  0.02731   0.0   7.07   0.0  0.469  ...  2.0  242.0     17.8  396.90   9.14
2  0.02729   0.0   7.07   0.0  0.469  ...  2.0  242.0     17.8  392.83   4.03
'''
print(dfy.head(3), dfy.shape) # (506, 1)
'''
   MEDV
0  24.0
1  21.6
2  34.7 
'''
df = pd.concat([dfx, dfy], axis=1)
print(df.head(3))
'''
      CRIM    ZN  INDUS  CHAS    NOX  ...    TAX  PTRATIO       B  LSTAT  MEDV
0  0.00632  18.0   2.31   0.0  0.538  ...  296.0     15.3  396.90   4.98  24.0
1  0.02731   0.0   7.07   0.0  0.469  ...  242.0     17.8  396.90   9.14  21.6
2  0.02729   0.0   7.07   0.0  0.469  ...  242.0     17.8  392.83   4.03  34.7
'''

from sklearn.datasets import load_boston

load_boston() : boston 부동산관련 dataset.

- 상관계수

pd.set_option('display.max_columns', 100) # 데이터 프레임 출력시 생략 값 출력.
print(df.corr()) # 상관계수 확인
# RM       average number of rooms per dwelling.                   상관계수 : 0.695360
# AGE      proportion of owner-occupied units built prior to 1940. 상관계수 : -0.376955
# LSTAT    % lower status of the population                        상관계수 : -0.737663

pd.set_option('display.max_columns', 100) :데이터 프레임 출력시 컬럼 생략 값 출력.

- 시각화

import seaborn as sns
cols = ['MEDV', 'RM', 'AGE', 'LSTAT']
sns.pairplot(df[cols])
plt.show()

import seaborn as sns

sns.pairplot(데이터) : 변수 간 산점 분포도 출력.

- sklearn에 맞게 데이터 변환

x = df[['LSTAT']].values # sklearn에서 득립변수는 2차원
y = df['MEDV'].values
print(x[:2])             # [[4.98] [9.14]]
print(y[:2])             # [24.  21.6]

- DecisionTreeRegressor

# 실습 1
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score

model = DecisionTreeRegressor(max_depth=3).fit(x, y)
print('predict :', model.predict(x)[:5]) # predict : [30.47142857 25.84701493 37.315625   43.98888889 30.47142857]
print('real :', y[:5])                   # real : [24.  21.6 34.7 33.4 36.2]
r2 = r2_score(y, model.predict(x))
print('결정계수(R2, 설명력) :', r2)          # 결정계수(R2, 설명력) : 0.6993833085636556

from sklearn.tree import DecisionTreeRegressor

DecisionTreeRegressor(max_depth=).fit(x, y) : 결정 트리 회귀

from sklearn.metrics import r2_score

r2_score(실제값, 예측값) : r square 값 산출

- RandomForestRegressor

# 실습 2
from sklearn.ensemble import RandomForestRegressor

model2 = RandomForestRegressor(n_estimators=1000, criterion='mse', random_state=123).fit(x, y) # criterion='mse' 평균 제곱오차
print('predict2 :', model2.predict(x)[:5]) # predict : [24.7535     22.0408     35.2609581  38.8436     32.00298571]
print('real :', y[:5])                     # real : [24.  21.6 34.7 33.4 36.2]
r2_1 = r2_score(y, model2.predict(x))
print('결정계수(R2, 설명력) :', r2_1)          # 결정계수(R2, 설명력) : 0.9096858991691069

from sklearn.ensemble import RandomForestRegressor

RandomForestRegressor(n_estimators=, criterion='mse', random_state=).fit(x, y) : criterion='mse' 평균 제곱오차

- 학습/검정 자료로 분리

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=123)
model2.fit(x_train, y_train)

r2_train = r2_score(y_train, model2.predict(x_train))
print('train에 대한 설명력 :', r2_train)  # train에 대한 설명력 : 0.9090659680794153

r2_test = r2_score(y_test, model2.predict(x_test))
print('test에 대한 설명력 :', r2_test)    # test에 대한 설명력 : 0.5779609792473676
# 독립변수의 수를 늘려주면 결과는 개선됨.

from sklearn.model_selection import train_test_split

train_test_split(x, y, test_size=, random_state=) : train/test로 분리

- 시각화

from matplotlib import style 
style.use('seaborn-talk')
plt.scatter(x, y, c='lightgray', label='train data')
plt.scatter(x_test, model2.predict(x_test), c='r', label='predict data, $R^2=%.2f$'%r2_test)
plt.xlabel('LSTAT')
plt.ylabel('MEDV')
plt.legend()
plt.show()

from matplotlib import style

print(plt.style.available) : 사용 가능 스타일 출력.

style.use('스타일명') : matplot 스타일 사용.

- 새로운 값으로 예측

import numpy as np
print(x_test[:3])                        # [[10.11] [ 6.53] [ 3.76]]
x_new = [[50.11], [26.53], [1.76]]
print('예상 집값 :', model2.predict(x_new)) # 예상 집값 : [ 9.6527  11.0907  45.34095]

배깅 / 부스팅

배깅(Bagging) - Random Forest
  : 데이터에서 여러 bootstrap 자료 생성, 모델링 후 결합하여 최종 예측 모형을 만드는 알고리즘
    boostrap aggregating의 약어로 데이터를 가방(bag)에 쓸어 담아 복원 추출하여 여러 개의 표본을 만들어 이를 기반으로 각각의 모델을 개발한 후에 결과를 하나로 합쳐 하나의 모델을 만들어 내는 것이다.
    배깅을 통해서 얻을 수 있는 효과는 '알고리즘의 안정성'이다.
    단일 seed 하나의 값을 기준으로 데이터를 추출하여 모델을 생성해 나는 것보다, 여러 개의 다양한 표본을 사용함으로써 모델을 만드는 것이 모집단을 잘 대표할 수 있게 된다.
    또한 명목형 변수 (Categorical data)의 경우 투표(voting) 방식, 혹은 가장 높은 확률값으로 예측 결과값을 합치며 연속형 변수(numeric data)의 경우에는 평균(average)으로 값을 집계한다.
    또한 배깅은 병렬 처리를 사용할 수 있는데, 독립적인 데이터 셋으로 독립된 모델을 만들기 때문에 모델 생성에 있어서 매우 효율적이다.

부스팅(Boosting) - XGBoost
  : 오분류 개체들에 가중치를 적용하여 새로운 분류 규칙 생성 반복 기반 최종 예측 모형 생성
    좀 더 알아보자면 Boosting이란 약한 분류기를 결합하여 강한 분류기를 만드는 과정이다.
    분류기 A, B, C 가 있고, 각각의 0.3 정도의 accuracy를 보여준다고 하자.
    A, B, C를 결합하여 더 높은 정확도, 예를 들어 0.7 정도의 accuracy를 얻는 게 앙상블 알고리즘의 기본 원리다.
    Boosting은 이 과정을 순차적으로 실행한다.
    A 분류기를 만든 후, 그 정보를 바탕으로 B 분류기를 만들고, 다시 그 정보를 바탕으로 C 분류기를 만든다.
   그리고 최종적으로 만들어진 분류기들을 모두 결합하여 최종 모델을 만드는 것이 Boosting의 원리다.
   대표적인 알고리즘으로 에이다부스트가 있다. AdaBoost는 Adaptive Boosting의 약자이다.
   Adaboost는 ensemble-based classifier의 일종으로 weak classifier를 반복적으로 적용해서, data의 특징을 찾아가는 알고리즘.

- anaconda prompt

pip install xgboost

* xgboost1.py

# RandomForest vs xgboost
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
import numpy as np
import xgboost as xgb       # pip install xgboost  
 
if __name__ == '__main__':
    iris = datasets.load_iris()
    print('아이리스 종류 :', iris.target_names)
    print('데이터 열 이름 :', iris.feature_names)
 
    # iris data로 Dataframe
    data = pd.DataFrame(
        {
            'sepal length': iris.data[:, 0],
            'sepal width': iris.data[:, 1],
            'petal length': iris.data[:, 2],
            'petal width': iris.data[:, 3],
            'species': iris.target
        }
    )
    print(data.head(2))
    '''
           sepal length  sepal width  petal length  petal width  species
    0           5.1          3.5           1.4          0.2        0
    1           4.9          3.0           1.4          0.2        0
    '''
 
    x = data[['sepal length', 'sepal width', 'petal length', 'petal width']]
    y = data['species']
 
    # 테스트 데이터 30%
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=123)
 
    # 학습 진행
    model = RandomForestClassifier(n_estimators=100)  # RandomForestClassifier - Bagging 방법 : 병렬 처리
    model = xgb.XGBClassifier(booster='gbtree', max_depth=4, n_estimators=100) # XGBClassifier - Boosting : 직렬처리
    # 속성 - booster: 의사결정 기반 모형(gbtree), 선형 모형(linear)
    #    - max_depth [기본값: 6]: 과적합 방지를 위해서 사용되며 CV를 사용해서 적절한 값이 제시되어야 하고 보통 3-10 사이 값이 적용된다.

    model.fit(x_train, y_train)
 
    # 예측
    y_pred = model.predict(x_test)
    print('예측값 : ', y_pred[:5])
    # 예측값 :  [1 2 2 1 0]

    print('실제값 : ', np.array(y_test[:5]))
    # 실제값 :  [1 2 2 1 0]
 
    print('정확도 : ', metrics.accuracy_score(y_test, y_pred))
    # 정확도 :  0.9333333333333333

import xgboost as xgb

xgb.XGBClassifier(booster='gbtree', max_depth=, n_estimators=) : XGBoost 분류 - Boosting(직렬처리)
booster : 의사결정 기반 모형(gbtree), 선형 모형(linear)
max_depth : 과적합 방지를 위해서 사용되며 CV를 사용해서 적절한 값이 제시되어야 하고 보통 3-10 사이 값이 적용됨.

(default: 6)

'BACK END > Deep Learning' 카테고리의 다른 글

[딥러닝] Neural Network (0)	2021.03.19
[딥러닝] KNN (0)	2021.03.18
[딥러닝] Decision Tree (0)	2021.03.17
[딥러닝] 나이브 베이즈 (0)	2021.03.17
[딥러닝] PCA (0)	2021.03.16

[딥러닝] Decision Tree

2021. 3. 17. 13:17

의사결정 나무(Decision Tree)

: CART - classification과 Regression 모두 가능
: 여러 규칙을 순차적으로 적용하면서 분류나 예측을 진행하는 단순 알고리즘 사용 모델

Random Forest

앙상블 모델

base 모델로 Decision Tree

* tree1.py

import pydotplus
from sklearn import tree

# height, hair로 남녀 구분
x = [[180, 15],
     [177, 42],
     [156, 35],
     [174, 5],
     [166, 33]]

y = ['man', 'women', 'women', 'man', 'women']
label_names = ['height', 'hair Legnth']

model = tree.DecisionTreeClassifier(criterion='entropy', random_state=0)
print(model)
fit = model.fit(x, y)
print('acc :{:.3f}'.format(fit.score(x, y))) # acc :1.000

mydata = [[171, 8]]
pred =  fit.predict(mydata)
print('pred :', pred) # pred : ['man']

from sklearn import tree

tree.DecisionTreeClassifier() :

scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

sklearn.tree.DecisionTreeClassifier — scikit-learn 0.24.1 documentation

scikit-learn.org

# 시각화 - graphviz 툴을 사용
import collections

dot_data = tree.export_graphviz(model, feature_names=label_names, out_file=None,\
                                filled = True, rounded=True)
graph = pydotplus.graph_from_dot_data(dot_data)
colors = ('red', 'orange')
edges = collections.defaultdict(list) # list type 변수

for e in graph.get_edge_list():
    edges[e.get_source()].append(int(e.get_destination()))

for e in edges:
    edges[e].sort()
    for i in range(2):
        dest = graph.get_node(str(edges[e][i]))[0]
        dest.set_fillcolor(colors[i])

graph.write_png('tree.png') # 이미지 저장

import matplotlib.pyplot as plt

img = plt.imread('tree.png')
plt.imshow(img)
plt.show()

* tree2_iris.py

...

# 의사결정 나무 모델
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(criterion='entropy', max_depth=5)

...

...
# 트리의 특성 중요도 : 전체 트리 결정에 각 특성이 어느정도 중요한지 평가
print('특성 중요도 : \n{}'.format(model.feature_importances_))

def plot_feature_importances(model):
    n_features = x.shape[1] # 4
    plt.barh(range(n_features), model.feature_importances_, align='center')
    #plt.yticks(np.range(n_features), iris.featrue_names[2:4])
    plt.xlabel('특성중요도')
    plt.ylabel('특성')
    plt.ylim(-1, n_features)

plot_feature_importances(model)
plt.show()

# graphviz
from sklearn import tree
from io import StringIO
import pydotplus

dot_data = StringIO() # 파일 흉내를 내는 역할
tree.export_graphviz(model, out_file = dot_data,\
                     feature_names = iris.feature_names[2:4])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png('tree2.png')

import matplotlib.pyplot as plt

img = plt.imread('tree2.png')
plt.imshow(img)
plt.show()

'BACK END > Deep Learning' 카테고리의 다른 글

[딥러닝] KNN (0)	2021.03.18
[딥러닝] RandomForest (0)	2021.03.17
[딥러닝] 나이브 베이즈 (0)	2021.03.17
[딥러닝] PCA (0)	2021.03.16
[딥러닝] SVM (0)	2021.03.16

[딥러닝] 나이브 베이즈

2021. 3. 17. 13:02

나이브 베이즈(Naive Bayes) 분류 모델

: feature가 주어졌을 때 label의 확률을 구함. P(L|Feature)

P(A|B) = P(B|A)P(A)/P(B)

P(A|B) : 사건B가 발생한 상태에서 사건A가 발생할 조건부 확률

P(label|feature)

* bayes1.py

from sklearn.naive_bayes import GaussianNB
import numpy as np
from sklearn import metrics

x = np.array([1,2,3,4,5])
x = x[:, np.newaxis] # np.newaxis 차원 확대
print(x)
'''
[[1]
 [2]
 [3]
 [4]
 [5]]
'''
y = np.array([1,3,5,7,9])
print(y)

model = GaussianNB().fit(x, y)
pred = model.predict(x)
print(pred) # [1 3 5 7 9]
print('acc :', metrics.accuracy_score(y, pred)) # acc : 1.0

from sklearn.naive_bayes import GaussianNB

GaussianNB()

# new data
new_x = np.array([[0.5],[2.3], [12], [0.1]])
new_pred = model.predict(new_x)
print(new_pred) # [1 3 9 1]

- One-hot encoding : 데이터를 0과 1로 변환(2진수)

: feature 데이터를 One-hot encoding

: 모델의 성능향상

x = '1,2,3,4,5'
x = x.split(',')
x = np.eye(len(x))
print(x)
'''
[[1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 1.]]
'''
y = np.array([1,3,5,7,9])

model = GaussianNB().fit(x, y)
pred = model.predict(x)
print(pred) # [1 3 5 7 9]
print('acc :', metrics.accuracy_score(y, pred)) # acc : 1.0

from sklearn.preprocessing import OneHotEncoder
x = '1,2,3,4,5'
x = x.split(',')
x = np.array(x)
x = x[:, np.newaxis]
'''
[['1']
 ['2']
 ['3']
 ['4']
 ['5']]
'''

one_hot = OneHotEncoder(categories = 'auto')
x = one_hot.fit_transform(x).toarray()
print(x)
'''
[[1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 1.]]
'''
y = np.array([1,3,5,7,9])

model = GaussianNB().fit(x, y)
pred = model.predict(x)
print(pred) # [1 3 5 7 9]
print('acc :', metrics.accuracy_score(y, pred)) # acc : 1.0

* bayes3_text.py

# 나이브베이즈 분류모델로 텍스트 분류
from sklearn.datasets import fetch_20newsgroups

data = fetch_20newsgroups()
print(data.target_names)

categories = ['talk.religion.misc', 'soc.religion.christian',
              'sci.space', 'comp.graphics']

train = fetch_20newsgroups(subset='train', categories=categories)
test = fetch_20newsgroups(subset='test', categories=categories)
print(train.data[5])  # 데이터 중 대표항목

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

# 각 문자열의 콘텐츠를 숫자벡터로 전환
model = make_pipeline(TfidfVectorizer(), MultinomialNB())  # 작업을 연속적으로 진행
model.fit(train.data, train.target)
labels = model.predict(test.data)

from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

mat = confusion_matrix(test.target, labels)  # 오차행렬 보기
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
            xticklabels=train.target_names, yticklabels=train.target_names)
plt.xlabel('true label')
plt.ylabel('predicted label')
plt.show()

# 하나의 문자열에 대해 예측한 범주 변환용 유틸 함수 작성
def predict_category(s, train=train, model=model):
    pred = model.predict([s])
    return train.target_names[pred[0]]

print(predict_category('sending a payload to the ISS'))
print(predict_category('discussing islam vs atheism'))
print(predict_category('determining the screen resolution'))

# 참고 도서 : 파이썬 데이터사이언스 핸드북 ( 출판사 : 위키북스)

'BACK END > Deep Learning' 카테고리의 다른 글

[딥러닝] RandomForest (0)	2021.03.17
[딥러닝] Decision Tree (0)	2021.03.17
[딥러닝] PCA (0)	2021.03.16
[딥러닝] SVM (0)	2021.03.16
[딥러닝] 로지스틱 회귀 (0)	2021.03.15

[딥러닝] PCA

2021. 3. 16. 16:28

특성공학중 PCA(Principal Component Analysis)
: 특성을 단순히 선택하는 것이 아니라 특성들의 조합으로 새로운 특성을 생성

: PCA(주성분 분석)는 특성 추출(Feature Extraction) 기법에 속함

iris dataset으로 차원 축소 (4개의 열을 2(sepal, petal))

* pca_test.py

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.datasets import load_iris
plt.rc('font', family='malgun gothic')

iris = load_iris()
n = 10
x = iris.data[:n, :2] # sepal 자료로 패턴확인
print('차원 축소 전  x:\n', x, x.shape, type(x)) # (10, 2) <class 'numpy.ndarray'>
'''
 [[5.1 3.5]
 [4.9 3. ]
 [4.7 3.2]
 [4.6 3.1]
 [5.  3.6]
 [5.4 3.9]
 [4.6 3.4]
 [5.  3.4]
 [4.4 2.9]
 [4.9 3.1]]
'''
print(x.T)
# [[5.1 4.9 4.7 4.6 5.  5.4 4.6 5.  4.4 4.9]
#  [3.5 3.  3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1]]

from sklearn.datasets import load_iris

load_iris() : ndarray type의 iris dataset load.

# 시각화
plt.plot(x.T, 'o:')
plt.xticks(range(2), labels=['꽃받침 길이', '꽃받침 폭'])
plt.xlim(-0.5, 2)
plt.ylim(2.5, 6)
plt.title('iris 특성')
plt.legend(['표본{}'.format(i + 1) for i in range(n)])
plt.show()

# 시각화2 : 산포도
plt.figure(figsize=(8, 8))
df = pd.DataFrame(x)
ax = sns.scatterplot(df[0], df[1], data=df , marker='s', s = 100, color=".2")
for i in range(n):
    ax.text(x[i, 0] - 0.05, x[i, 1] + 0.03, '표본{}'.format(i + 1))
    
plt.xlabel('꽃받침 길이')
plt.ylabel('꽃받침 폭')
plt.title('iris 특성')
plt.show()

# PCA
pca1 = PCA(n_components = 1)
x_row = pca1.fit_transform(x) # 1차원 근사데이터를 반환. 비 지도 학습
print('x_row :\n', x_row, x_row.shape) # (10, 1)
'''
[[ 0.30270263]
 [-0.1990931 ]
 [-0.18962889]
 [-0.33097106]
 [ 0.30743473]
 [ 0.79976625]
 [-0.11185966]
 [ 0.16136046]
 [-0.61365539]
 [-0.12605597]]
'''

x2 = pca1.inverse_transform(x_row)
print('복귀 후 값:\n', x2, x2.shape) # (10, 2)
'''
 [[5.06676112 3.53108532]
 [4.7240094  3.1645881 ]
 [4.73047393 3.17150049]
 [4.63393012 3.06826822]
 [5.06999338 3.53454152]
 [5.40628057 3.89412635]
 [4.78359423 3.22830091]
 [4.97021731 3.42785306]
 [4.44084251 2.86180369]
 [4.77389743 3.21793233]]
'''
print(x_row[0]) # [0.30270263]
print(x2[0, :]) # [5.06676112 3.53108532]

# 시각화2 : 산포도 - 사용
df = pd.DataFrame(x)
ax = sns.scatterplot(df[0], df[1], data=df , marker='s', s = 100, color=".2")
for i in range(n):
    d = 0.03 if x[i, 1] > x2[i, 1] else -0.04
    ax.text(x[i, 0] - 0.05, x[i, 1] + 0.03, '표본{}'.format(i + 1))
    plt.plot([x[i, 0], x2[i, 0]], [x[i, 1], x2[i, 1]], "k--")
plt.plot(x2[:, 0], x2[:, 1], "o-", markersize=10, color="b")
plt.plot(x[:, 0].mean(), x[:, 1].mean(), markersize=10, marker="D")
plt.axvline(x[:, 0].mean(), c='r') # 세로선
plt.axhline(x[:, 1].mean(), c='r') # 가로선
plt.xlabel('꽃받침 길이')
plt.ylabel('꽃받침 폭')
plt.title('iris 특성')
plt.show()

x = iris.data
pca2 = PCA(n_components = 2)
x_row2 = pca2.fit_transform(x)
print('x_row2 :\n', x_row2, x_row2.shape)

x4 = pca2.inverse_transform(x_row2)
print('최초자료 :', x[0])         # 최초자료 : [5.1 3.5 1.4 0.2]
print('차원축소 :', x_row2[0])    # 차원축소 : [-2.68412563  0.31939725]
print('최초복귀 :', x4[0, :])     # 최초복귀 : [5.08303897 3.51741393 1.40321372 0.21353169]

print()
iris2 = pd.DataFrame(x_row2, columns=['sepal', 'petal'])
iris1 = pd.DataFrame(x, columns=['sepal_Length', 'sepal_width', 'petal_Length', 'petal_width'])
print(iris2.head(3)) # 차원 축소
'''
      sepal     petal
0 -2.684126  0.319397
1 -2.714142 -0.177001
2 -2.888991 -0.144949
'''
print(iris1.head(3)) # 본래 데이터
'''
   sepal_Length  sepal_width  petal_Length  petal_width
0           5.1          3.5           1.4          0.2
1           4.9          3.0           1.4          0.2
2           4.7          3.2           1.3          0.2
'''

'BACK END > Deep Learning' 카테고리의 다른 글

[딥러닝] Decision Tree (0)	2021.03.17
[딥러닝] 나이브 베이즈 (0)	2021.03.17
[딥러닝] SVM (0)	2021.03.16
[딥러닝] 로지스틱 회귀 (0)	2021.03.15
[딥러닝] 다항회귀 (0)	2021.03.12

[딥러닝] SVM

2021. 3. 16. 12:23

SVM(Support Vector Machine)

: 두 데이터 사이에 구분을 위해 사용

: 각 데이터의 중심을 기준으로 초평면(Optimal Hyper Plane) 구한다.

: 초평면과 가까운 데이터를 support vector라 한다.

: XOR 처리 가능

XOR 연산 처리(분류)

* svm1.py

xor_data = [
    [0,0,0],
    [0,1,1],
    [1,0,1],
    [1,1,0],
]
#print(xor_data)

import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn import svm

xor_df = pd.DataFrame(xor_data)

feature = np.array(xor_df.iloc[:, 0:2])
label = np.array(xor_df.iloc[:, 2])
print(feature)
'''
[[0 0]
 [0 1]
 [1 0]
 [1 1]]
'''
print(label) # [0 1 1 0]

model = LogisticRegression() # 선형분류 모델
model.fit(feature, label)
pred = model.predict(feature)
print('pred :', pred)
# pred : [0 0 0 0]

model = svm.SVC()             # 선형, 비선형(kernel trick 사용) 분류모델
model.fit(feature, label)
pred = model.predict(feature)
print('pred :', pred)
# pred : [0 1 1 0]

# Sopport vector 확인해보기 
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
import numpy as np

plt.rc('font', family='malgun gothic')

X, y = make_blobs(n_samples=50, centers=2, cluster_std=0.5, random_state=4)
y = 2 * y - 1

plt.scatter(X[y == -1, 0], X[y == -1, 1], marker='o', label="-1 클래스")
plt.scatter(X[y == +1, 0], X[y == +1, 1], marker='x', label="+1 클래스")
plt.xlabel("x1")
plt.ylabel("x2")
plt.legend()
plt.title("학습용 데이터")
plt.show()

from sklearn.svm import SVC
model = SVC(kernel='linear', C=1.0).fit(X, y)  # tuning parameter  값을 변경해보자.

xmin = X[:, 0].min()
xmax = X[:, 0].max()
ymin = X[:, 1].min()
ymax = X[:, 1].max()
xx = np.linspace(xmin, xmax, 10)
yy = np.linspace(ymin, ymax, 10)
X1, X2 = np.meshgrid(xx, yy)

z = np.empty(X1.shape)
for (i, j), val in np.ndenumerate(X1):    # 배열 좌표와 값 쌍을 생성하는 반복기를 반환
    x1 = val
    x2 = X2[i, j]
    p = model.decision_function([[x1, x2]])
    z[i, j] = p[0]

levels = [-1, 0, 1]
linestyles = ['dashed', 'solid', 'dashed']
plt.scatter(X[y == -1, 0], X[y == -1, 1], marker='o', label="-1 클래스")
plt.scatter(X[y == +1, 0], X[y == +1, 1], marker='x', label="+1 클래스")
plt.contour(X1, X2, z, levels, colors='k', linestyles=linestyles)
plt.scatter(model.support_vectors_[:, 0], model.support_vectors_[:, 1], s=300, alpha=0.3)

x_new = [10, 2]
plt.scatter(x_new[0], x_new[1], marker='^', s=100)
plt.text(x_new[0] + 0.03, x_new[1] + 0.08, "테스트 데이터")

plt.xlabel("x1")
plt.ylabel("x2")
plt.legend()
plt.title("SVM 예측 결과")
plt.show()

# Support Vectors 값 출력
print(model.support_vectors_)
'''
[[9.03715314 1.71813465]
 [9.17124955 3.52485535]]
'''

* svm2_iris.py

BMI의 계산방법을 이용하여 많은 양의 자료를 생성한 후 분류 모델로 처리

계산식    신체질량지수(BMI)=체중(kg)/[신장(m)]2
판정기준    저체중    20 미만
정상    20 - 24
과체중    25 - 29
비만    30 이상

* svm3_bmi.py

print(67/((170 / 100) * (170 / 100)))

import random

def calc_bmi(h,w):
    bmi = w / (h / 100)**2
    if bmi < 18.5: return 'thin'
    if bmi < 23: return 'normal'
    return 'fat'
print(calc_bmi(170, 65))

fp = open('bmi.csv', 'w')
fp.write('height, weight, label\n')

cnt = {'thin':0, 'normal':0, 'fat':0}

for i in range(50000):
    h = random.randint(150, 200)
    w = random.randint(35, 100)
    label = calc_bmi(h, w)
    cnt[label] += 1
    fp.write('{0},{1},{2}\n'.format(h, w, label))
fp.close()
print('good')

# BMI dataset으로 분류
from sklearn import svm, metrics
from sklearn.model_selection import train_test_split
import pandas as pd
import matplotlib.pyplot as plt

tbl = pd.read_csv('bmi.csv')

# 칼럼을 정규화
label = tbl['label']
print(label)
w = tbl['weight'] / 100
h = tbl['height'] / 200
wh = pd.concat([w, h], axis=1)
print(wh.head(5), wh.shape)
'''
   weight  height
0    0.69   0.850
1    0.51   0.835
2    0.70   0.830
3    0.71   0.945
4    0.50   0.980 (50000, 2)
'''
label = label.map({'thin':0, 'normal':1, 'fat':2})
'''
0    2
1    0
2    2
3    1
4    0
'''
print(label[:5], label.shape) # (50000,)

# train/test
data_train, data_test, label_train, label_test = train_test_split(wh, label)
print(data_train.shape, data_test.shape) # (37500, 2) (12500, 2)

# model
model = svm.SVC(C=0.01).fit(data_train, label_train)
#model = svm.LinearSVC().fit(data_train, label_train)
print(model)

# 학습한 데이터의 결과가 신뢰성이 있는지 확인하기 위해 교차검증 p221
from sklearn import model_selection
cross_vali = model_selection.cross_val_score(model, wh, label, cv=3)
# k ford classification
# train 7, test 3 => train으로 3등분 하여 재검증
# 검증 학습 학습
# 학습 검증 학습
# 학습 학습 검증
print('각각의 검증 결과:', cross_vali)          # [0.96754065 0.96400072 0.96783871]
print('평균 검증 결과:', cross_vali.mean())    # 0.9664600275737195

pred = model.predict(data_test)
ac_score = metrics.accuracy_score(label_test, pred)
print('분류 정확도 :', ac_score) # 분류 정확도 : 0.96816
print(metrics.classification_report(label_test, pred))
'''
              precision    recall  f1-score   support

           0       0.98      0.97      0.98      4263
           1       0.91      0.94      0.93      2644
           2       0.98      0.98      0.98      5593

    accuracy                           0.97     12500
   macro avg       0.96      0.96      0.96     12500
weighted avg       0.97      0.97      0.97     12500
'''

# 시각화
tbl2 = pd.read_csv('bmi.csv', index_col = 2)
print(tbl2[:3])
'''
       height  weight
label                
fat       170      69
thin      167      51
fat       166      70
'''

def scatter_func(lbl, color):
    b = tbl2.loc[lbl]
    plt.scatter(b['weight'], b['height'], c=color, label=lbl)


fig = plt.figure()
scatter_func('fat', 'red')
scatter_func('normal', 'yellow')
scatter_func('thin', 'blue')
plt.legend()
plt.savefig('bmi_test.png')
plt.show()

SVM 모델로 이미지 분류

* svm4.py

from sklearn.datasets import fetch_lfw_people

fetch_lfw_people(min_faces_per_person = 60) : 인물 사진 data load. min_faces_per_person : 최초

scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_lfw_people.html

sklearn.datasets.fetch_lfw_people — scikit-learn 0.24.1 documentation

scikit-learn.org

import matplotlib.pyplot as plt
from sklearn.metrics._classification import classification_report

faces = fetch_lfw_people(min_faces_per_person = 60) 
print(faces)

print(faces.DESCR)
print(faces.data)
print(faces.data.shape) # (729, 2914)
print(faces.target)
print(faces.target_names)
print(faces.images.shape) # (729, 62, 47)

print(faces.images[0])
print(faces.target_names[faces.target[0]])
plt.imshow(faces.images[0], cmap='bone') # cmap : 색
plt.show()

fig, ax = plt.subplots(3, 5)
print(fig)          # Figure(640x480)
print(ax.flat)      # <numpy.flatiter object at 0x00000235198C5D30>
print(len(ax.flat)) # 15
for i, axi in enumerate(ax.flat):
    axi.imshow(faces.images[i], cmap='bone')
    axi.set(xticks=[], yticks=[], xlabel=faces.target_names[faces.target[i]])
plt.show()

- 주성분 분석으로 이미지 차원을 축소시켜 분류작업을 진행

from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.pipeline import make_pipeline

m_pca = PCA(n_components=150, whiten=True, random_state = 0)
m_svc = SVC(C=1)
model = make_pipeline(m_pca, m_svc)
print(model)
# Pipeline(steps=[('pca', PCA(n_components=150, random_state=0, whiten=True)),
#                 ('svc', SVC(C=1))])

- train/test

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(faces.data, faces.target, random_state=1)
print(x_train[0], x_train.shape) # (546, 2914)
print(y_train[0], y_train.shape) # (546,)

model.fit(x_train, y_train)  # train data로 모델 fitting
pred = model.predict(x_test)
print('pred :', pred)   # pred : [1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 ..
print('read :', y_test) # read : [0 1 1 1 1 1 1 1 1 1 1 0 1 1 1 ..

- 분류 정확도

from sklearn.metrics import classification_report

print(classification_report(y_test, pred, target_names = faces.target_names)) # 분류 정확도
'''
                   precision    recall  f1-score   support

  Donald Rumsfeld       1.00      0.25      0.40        20
    George W Bush       0.80      1.00      0.89       132
Gerhard Schroeder       1.00      0.45      0.62        31

         accuracy                           0.83       183
        macro avg       0.93      0.57      0.64       183
     weighted avg       0.86      0.83      0.79       183
=> f1-score/accuracy -> 0.83
'''

from sklearn.metrics import confusion_matrix, accuracy_score

mat = confusion_matrix(y_test, pred)
print('confusion_matrix :\n', mat)
'''
 [[  5  15   0]
 [  0 132   0]
 [  0  17  14]]
 '''
print('acc :', accuracy_score(y_test, pred)) # 0.82513

- 분류결과를 시각화

# x_test[0] 하나 미리보기.
plt.subplots(1, 1)
print(x_test[0], ' ', x_test[0].shape)
# [ 24.333334  33.        72.666664 ... 201.66667  201.33333  155.33333 ]   (2914,)
print(x_test[0].reshape(62, 47)) # 1차원을 2차원으로 변환해야 이미지 출력 가능
plt.imshow(x_test[0].reshape(62, 47), cmap='bone')
plt.show()

fig, ax = plt.subplots(4, 6)
for i, axi in enumerate(ax.flat):
    axi.imshow(x_test[i].reshape(62, 47), cmap='bone')
    axi.set(xticks=[], yticks=[])
    axi.set_ylabel(faces.target_names[pred[i]].split()[-1], color='black' if pred[i] == y_test[i] else 'red')
    fig.suptitle('pred result', size = 14)
plt.show()

- 5차 행렬 시각화

import seaborn as sns
sns.heatmap(mat.T, square = True, annot=True, fmt='d', cbar=False, \
            xticklabels=faces.target_names, yticklabels=faces.target_names)
plt.xlabel('true(read) label')
plt.ylabel('predicted label')
plt.show()

'BACK END > Deep Learning' 카테고리의 다른 글

[딥러닝] 나이브 베이즈 (0)	2021.03.17
[딥러닝] PCA (0)	2021.03.16
[딥러닝] 로지스틱 회귀 (0)	2021.03.15
[딥러닝] 다항회귀 (0)	2021.03.12
[딥러닝] 단순선형 회귀, 다중선형 회귀 (0)	2021.03.11

[딥러닝] 로지스틱 회귀

2021. 3. 15. 11:34

- 로지스틱 회귀분석

: 이항분류 분석

: logit(), glm()
: 독립변수 : 연속형, 종속변수 : 범주형

- 출력된 연속형 자료에 대해 odds -> odds ratio -> logit function -> sigmoid function으로 이항분류

- odds(오즈)

: 확률을 바꾼 값. 성공확률(혹은 1일)이 실패확률(0일)에 비해 몇 배 더 높은가를 나타낸다.

- odds ratio(오즈비)

: 두 개의 오즈 비율. 확률 p의 범위가 (0,1)이라면 Odds(p)의 범위는 (0, ∞)이 된다.

- logit(로짓)

: 오즈비에 로그를 취한 값. Odds ratio에 로그함수를 취한 log(Odds(p))은 입력값의 범위가 (-∞ ~ ∞)이 된다. 즉, 범위가 실수 전체다. 이러한 입력 값의 범위를 (0 ~ 1)로 조정한다.

- sigmoid(시그모이드)

: log(Odds(p))의 범위가 실수이므로 이 값에 대한 선형회귀분석을 하는 것은 의미가 있다. 왜냐하면 오즈비(두 개의 odd 비율)에 로그를 씌우면 오즈비 값들이 정규분포를 이루기 때문이다. log(Odds(p))=wx+b로 선형회귀분석을 실시해서 w와 b를 얻을 수 있다. 위 식을 이용한 것이 sigmoid function이다. 이를 통해 0.5을 기준으로 1과 0의 양분된 값을 된다.

* logistic1.py

import math
import numpy as np
from sklearn.metrics._scorer import accuracy_scorer

def sigFunc(x):
    return 1 / ( 1 + math.exp(-x)) # math.exp(x) : e^x

print(sigFunc(0.6))
print(sigFunc(0.2))
print(sigFunc(6))
print(sigFunc(-6))
print(np.around(sigFunc(6)))   # 1.0
print(np.around(sigFunc(-6)))  # 0.0

import statsmodels.api as sm

mtcars = sm.datasets.get_rdataset('mtcars').data
print(mtcars.head(3)) # mtcars data read
'''
                mpg  cyl   disp   hp  drat     wt   qsec  vs  am  gear  carb
Mazda RX4      21.0    6  160.0  110  3.90  2.620  16.46   0   1     4     4
Mazda RX4 Wag  21.0    6  160.0  110  3.90  2.875  17.02   0   1     4     4
Datsun 710     22.8    4  108.0   93  3.85  2.320  18.61   1   1     4     1
'''
print(mtcars['am'].unique()) # [1 0]

import statsmodels.api as sm

sm.datasets.get_rdataset('데이터명').data : 내장 데이터 셋의 데이터 read.

- 방법1 : logit()

import statsmodels.formula.api as smf

formula = 'am ~ mpg + hp'           # 연비, 마력  ->  자/수동 상관관계
result = smf.logit(formula=formula, data=mtcars).fit()
print(result)
'''
Optimization terminated successfully.
         Current function value: 0.300509
         Iterations 9
<statsmodels.discrete.discrete_model.BinaryResultsWrapper object at 0x000001F6244B8040>
'''
print(result.summary())
# p-value < 0.05  =>  유효

pred = result.predict(mtcars[:10])
#print('예측값 : \n', pred)
print('예측값 : \n', np.around(pred))
'''
예측값 : 
 Mazda RX4            0.0
Mazda RX4 Wag        0.0
Datsun 710           1.0
Hornet 4 Drive       0.0
Hornet Sportabout    0.0
Valiant              0.0
Duster 360           0.0
Merc 240D            1.0
Merc 230             1.0
Merc 280             0.0
'''

print('실제값 : \n', mtcars['am'][:10])
'''
실제값 : 
 Mazda RX4            1
Mazda RX4 Wag        1
Datsun 710           1
Hornet 4 Drive       0
Hornet Sportabout    0
Valiant              0
Duster 360           0
Merc 240D            0
Merc 230             0
Merc 280             0
'''

import statsmodels.formula.api as smf

smf.logit(formula='종속변수 ~ 독립변수 + ...', data=데이터).fit() : 로지스틱 회귀 모델 생성

model.predict(데이터) : 모델에 대한 예측 값 산출

- 분류정확도

conf_tab = result.pred_table() # confusion matrix
print(conf_tab)
'''
       예측값   p        n
실제값 참 [[16.(TP)  3.(FN)]
      거짓 [ 3.(FP)  10.(TN)]]
'''
print('분류 정확도 :', (16+10) / len(mtcars)) # 0.8125
print('분류 정확도 :', (conf_tab[0][0] + conf_tab[1][1])/ len(mtcars)) # 0.8125

from sklearn.metrics import accuracy_score
pred2 = result.predict(mtcars)
print('분류 정확도 :', accuracy_score(mtcars['am'], np.around(pred2))) # 0.8125

model.pred_table() : confusion matrix 생성

from sklearn.metrics import accuracy_score

accuracy_score(실제 값, 예측 값) : 분류 정확도 산출

		예측값
		positive	negative
실제값	참	TP	FN
실제값	거짓	FP	TN

=> TP, TN : 예측값과 실제값이 일치
=> 정확도(accuracy) = TP + TN / 전체 개수

=> 정밀도(pecision) = TP / (TP + FP)

=> 재현율(recall) = TP / (TP + FN)

=> 특이도 = TN / (FP + TN)

=> F1 score = 2 x 재현율 x 정밀도 / (재현율 + 정밀도)

- 방법2 : glm()

import statsmodels.formula.api as smf
import statsmodels.api as sm

result2 = smf.glm(formula=formula, data=mtcars, family=sm.families.Binomial()).fit()
print(result2)
print(result2.summary())

glm_pred = result2.predict(mtcars[:5])
print('glm 예측값 :\n', glm_pred)
'''
 Mazda RX4            0.250047
Mazda RX4 Wag        0.250047
Datsun 710           0.558034
Hornet 4 Drive       0.355600
Hornet Sportabout    0.397097
'''
print('실제값 :\n', mtcars['am'][:5])
glm_pred2 = result2.predict(mtcars)
print('분류 정확도 :', accuracy_score(mtcars['am'], np.around(glm_pred2))) # 0.8125

smf.glm(formula='종속변수 ~ 독립변수 +...', data=데이터, family=sm.families.Binomial()).fit() : 로지스틱 회귀 모델 생성

- 새로운 값을 분류

new_df = mtcars.iloc[:2].copy()
new_df['mpg'] = [10, 30]
new_df['hp'] = [100, 130]
print(new_df)
'''
               mpg  cyl   disp   hp  drat     wt   qsec  vs  am  gear  carb
Mazda RX4       10    6  160.0  100   3.9  2.620  16.46   0   1     4     4
Mazda RX4 Wag   30    6  160.0  130   3.9  2.875  17.02   0   1     4     4
'''

glm_pred_new = result2.predict(new_df)
print('새로운 값 분류 결과 :\n', np.around(glm_pred_new))
print('새로운 값 분류 결과 :\n', np.rint(glm_pred_new))
'''
 Mazda RX4        0.0
Mazda RX4 Wag    1.0
'''

import pandas as pd
new_df2 = pd.DataFrame({'mpg':[10, 35], 'hp':[100, 145]})
glm_pred_new2 = result2.predict(new_df2)
print('새로운 값 분류 결과 :\n', np.around(glm_pred_new2))
'''
 0    0.0
1    1.0
'''

np.around(숫자) : 반올림

np.rint(숫자) : 반올림

- 로지스틱 회귀분석

: 날씨 예보 - 강수 예보

* logistic2.py

import pandas as pd
from sklearn.model_selection._split import train_test_split
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np

data = pd.read_csv('../testdata/weather.csv')
print(data.head(2), data.shape, data.columns) # (366, 12)
'''
         Date  MinTemp  MaxTemp  Rainfall  ...  Cloud  Temp  RainToday  RainTomorrow
0  2016-11-01      8.0     24.3       0.0  ...      7  23.6         No           Yes
1  2016-11-02     14.0     26.9       3.6  ...      3  25.7        Yes           Yes
Index(['Date', 'MinTemp', 'MaxTemp', 'Rainfall', 'Sunshine', 'WindSpeed',
       'Humidity', 'Pressure', 'Cloud', 'Temp', 'RainToday', 'RainTomorrow']
'''

data2 = pd.DataFrame()
data2 = data.drop(['Date', 'RainToday'], axis=1)
data2['RainTomorrow'] = data2['RainTomorrow'].map({'Yes':1, 'No':0})
print(data2.head(5))
'''
   MinTemp  MaxTemp  Rainfall  Sunshine  ...  Pressure  Cloud  Temp  RainTomorrow
0      8.0     24.3       0.0       6.3  ...    1015.0      7  23.6             1
1     14.0     26.9       3.6       9.7  ...    1008.4      3  25.7             1
2     13.7     23.4       3.6       3.3  ...    1007.2      7  20.2             1
3     13.3     15.5      39.8       9.1  ...    1007.0      7  14.1             1
4      7.6     16.1       2.8      10.6  ...    1018.5      7  15.4             0
'''

데이터.drop([칼럼1, ... ], axis=1) : 칼럼 단위 자르기

데이터.map({'key1':value1, 'key2':value2}) : 데이터의 key와 동일할 경우 value로 set.

- train (모델을 학습) / test (모델을 검증)로 분리 : 과적합 분리

train, test = train_test_split(data2, test_size=0.3, random_state = 42) # 샘플링, random_state : seed no
print(train.shape, test.shape) # (256, 10) (110, 10)

from sklearn.model_selection._split import train_test_split

train_test_split(데이터, test_size=0.3, random_state = seed넘버) : 데이터를 train, test로 test_size 비율로 분할.

- 분류 모델

#my_formula = 'RainTomorrow ~ MinTemp + MaxTemp + ...'
col_sel = "+".join(train.columns.difference(['RainTomorrow'])) # difference(x) : x 제외
my_formula = 'RainTomorrow ~ ' + col_sel
print(my_formula) 
# RainTomorrow ~ Cloud+Humidity+MaxTemp+MinTemp+Pressure+Rainfall+Sunshine+Temp+WindSpeed

model = smf.logit(formula=my_formula, data = train).fit()
#model = smf.glm(formula=my_formula, data = train, family=sm.families.Binomial()).fit()

print(model)
print(model.params)
print('예측값:\n', np.around(model.predict(test)[:5]))
'''
 193    0.0
33     0.0
15     0.0
310    0.0
57     0.0
'''
print('실제값:\n', test['RainTomorrow'][:5])
'''
 193    0
33     0
15     0
310    0
57     0
'''

구분자.join(데이터.difference([x, .. ])) : 데이터 사이에 구분자를 포함하여 결합. difference(x) : join시 x는 제외.

- 정확도

con_mat = model.pred_table() # smf.logit()에서 지원, smf.glm()에서 지원하지않음.
print('con_mat : \n', con_mat)
'''
 [[197.   9.]
 [ 21.  26.]]
'''
print('train 분류 정확도 :', (con_mat[0][0] + con_mat[1][1])/ len(train)) # 0.87109375

from sklearn.metrics import accuracy_score
pred = model.predict(test) # sigmoid function에 의해 출력
print('test 분류 정확도 :', accuracy_score(test['RainTomorrow'], np.around(pred))) # 0.87272727

model.pred_table() : 분류 정확도 테이블 생성. logit()에서 지원. gim()은 지원하지않음.

from sklearn.metrics import accuracy_score

accuracy_score(실제값, np.around(예측값)) : 정확도 산출

verginica, setosa + versicolor로 분리해 구분 결정간격 시각화

* logistic3.py

from sklearn import datasets
from sklearn.linear_model import LogisticRegression
import numpy as np

iris = datasets.load_iris()
print(iris)
print(iris.keys())
# dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])
print(iris.target)

x = iris['data'][:, 3:] # petal width로 실습
print(x[:5])
# [0.2 0.2 0.2 0.2 0.2]

y = (iris['target'] == 2).astype(np.int)
print(y[:5])
# [0 0 0 0 0]
print()

log_reg = LogisticRegression().fit(x,y) # 모델생성
print(log_reg)

x_new = np.linspace(0, 3, 1000).reshape(-1,1) # 0 ~ 3 사이 1000개의 난수 발생
print(x_new.shape) # (1000, 1)
y_proba = log_reg.predict_proba(x_new) # 확률값
print(y_proba)
'''
[[9.99250016e-01 7.49984089e-04]
 [9.99240201e-01 7.59799387e-04] ...
 
'''

import matplotlib.pyplot as plt
plt.plot(x_new, y_proba[:, 1], 'r-', label='verginica')
plt.plot(x_new, y_proba[:, 0], 'b--', label='setosa + versicolor')
plt.xlabel('petal width')
plt.legend()
plt.show()

print(log_reg.predict([[1.5],[1.7]]))       # [0 1]
print(log_reg.predict([[2.5],[0.7]]))       # [1 0]
print(log_reg.predict_proba([[2.5],[0.7]])) # [[0.02563061 0.97436939]  [0.98465572 0.01534428]]

LogisticRegression으로 iris의 꽃의 종류를 분류

* logistic4

from sklearn import datasets
from sklearn.linear_model import LogisticRegression
import numpy as np
from sklearn.model_selection._split import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import pandas as pd

iris = datasets.load_iris()
print(iris.data[:3])
'''
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]]
'''
print(np.corrcoef(iris.data[:, 2], iris.data[:, 3]))

x = iris.data[:, [2, 3]] # feature(독립변수, x) : petal length, petal width
y = iris.target # label, class
print(type(x), type(y), x.shape, y.shape) # ndarray, ndarray (150, 2) (150,)
print(set(y)) # {0, 1, 2}

- train / test 분리

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state=0)
print(x_train.shape, x_test.shape, y_train.shape, y_test.shape) # (105, 2) (45, 2) (105,) (45,)

- scaling(표준화 : 단위가 다른 feature가 두개 이상인 경우 표준화를 진행하여 모델의 성능을 향상시킨다)

print(x_train[:3])
'''
[[3.5 1. ]
 [5.5 1.8]
 [5.7 2.5]]
'''
sc = StandardScaler()
sc.fit(x_train)
sc.fit(x_test)
x_train = sc.transform(x_train)
x_test = sc.transform(x_test)
print(x_train[:3])
'''
[[-0.05624622 -0.18650096]
 [ 1.14902997  0.93250481]
 [ 1.26955759  1.91163486]]
'''
# 표준화 값을 원래 값으로 복귀
# inver_x_train = sc.inverse_transform(x_train)
# print(inver_x_train[:3])

- 분류 모델

: logit(), glm() : 이항분류 - 활성화 함수 - sigmoid : 출력 값이 0.5 기준으로 크고 작음에 따라 1, 2로 변경
: LogisticRegression : 다항분류 - 활성화 함수 - softmax : 복수의 확률값 중 가장 큰 값을 채택

model = LogisticRegression(C=1.0, random_state = 0) # C속성 : 모델에 패널티를 적용(L2 정규화) - 과적합 방지
model.fit(x_train, y_train) # 지도학습

- 분류 예측

y_pred = model.predict(x_test) # 검정자료는 test
print('예측값 :', y_pred)
print('실제값 :', y_test)

- 분류 정확도

print('총 개수 : %d, 오류수:%d'%(len(y_test), (y_test != y_pred).sum())) # 총 개수 : 45, 오류수:2
print('분류 정확도 출력 1: %.3f'%accuracy_score(y_test, y_pred))          # 분류 정확도 출력 1: 0.956

con_mat = pd.crosstab(y_test, y_pred, rownames = ['예측치'], colnames=['실제치'])
print(con_mat)
'''
실제치   0   1   2
예측치            
0    16   0   0
1     0  17   1
2     0   1  10
'''

print('분류 정확도 출력 2:', (con_mat[0][0] + con_mat[1][1] + con_mat[2][2]) / len(y_test))
# 분류 정확도 출력 2: 0.9555555555555556

print('분류 정확도 출력 3:', model.score(x_test, y_test))   # test
# 분류 정확도 출력 3: 0.9555555555555556
print('분류 정확도 출력 3:', model.score(x_train, y_train)) # train
# 분류 정확도 출력 3: 0.9523809523809523

- 새로운 값으로 예측

new_data = np.array([[5.1, 2.4], [1.1, 1.4], [8.1, 8.4]])
# 표준화
sc.fit(new_data)
new_data = sc.transform(new_data)
new_pred = model.predict(new_data)
print('새로운 값으로 예측 :', new_pred) #  [1 0 2]

- 붓꽃 자료에 대한 로지스틱 회귀 결과를 차트로 그리기

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from matplotlib import font_manager, rc

plt.rc('font', family='malgun gothic')      
plt.rcParams['axes.unicode_minus']= False

def plot_decision_region(X, y, classifier, test_idx=None, resolution=0.02, title=''):
    markers = ('s', 'x', 'o', '^', 'v')  # 점 표시 모양 5개 정의
    colors = ('r', 'b', 'lightgreen', 'gray', 'cyan')
    cmap = ListedColormap(colors[:len(np.unique(y))])
    #print('cmap : ', cmap.colors[0], cmap.colors[1], cmap.colors[2])

    # decision surface 그리기
    x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    x2_min, x2_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    xx, yy = np.meshgrid(np.arange(x1_min, x1_max, resolution), np.arange(x2_min, x2_max, resolution))

    # xx, yy를 ravel()를 이용해 1차원 배열로 만든 후 전치행렬로 변환하여 퍼셉트론 분류기의 
    # predict()의 인자로 입력하여 계산된 예측값을 Z로 둔다.
    Z = classifier.predict(np.array([xx.ravel(), yy.ravel()]).T)
    Z = Z.reshape(xx.shape)   # Z를 reshape()을 이용해 원래 배열 모양으로 복원한다.

    # X를 xx, yy가 축인 그래프 상에 cmap을 이용해 등고선을 그림
    plt.contourf(xx, yy, Z, alpha=0.5, cmap=cmap)
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())

    X_test = X[test_idx, :]
    for idx, cl in enumerate(np.unique(y)):
        plt.scatter(x=X[y==cl, 0], y=X[y==cl, 1], c=cmap(idx), marker=markers[idx], label=cl)

    if test_idx:
        X_test = X[test_idx, :]
        plt.scatter(X_test[:, 0], X_test[:, 1], c=[], linewidth=1, marker='o', s=80, label='testset')

    plt.xlabel('꽃잎 길이')
    plt.ylabel('꽃잎 너비')
    plt.legend(loc=2)
    plt.title(title)
    plt.show()

x_combined_std = np.vstack((x_train, x_test))
y_combined = np.hstack((y_train, y_test))
plot_decision_region(X=x_combined_std, y=y_combined, classifier=model, test_idx=range(105, 150), title='scikit-learn제공')

- 정규화

- 표준화

ROC curve

: 분류모델 성능 평가

* logistic5.py

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
import numpy as np
import pandas as pd

x, y = make_classification(n_samples=16, n_features=2, n_informative=2, n_redundant=0, random_state=12)
# : dataset
# n_samples : 표준 데이터수, n_features : 독립변수 수
print(x)
'''
[[-1.03701295 -0.8840986 ]
 [-1.181542    1.35572706]
 [-1.57888668 -0.13665031]
 [-2.04426219  0.79930258]
 [-1.42777756  0.2448902 ]
 [ 1.26492389  1.54672358]
 [ 2.53102266  1.99835068]
 [-1.66485782  0.71855249]
 [ 0.96918839 -1.25885923]
 [-3.23328615  1.58405095]
 [ 1.79298809  1.77564192]
 [ 1.34738938  0.66463162]
 [-0.35655805  0.33163742]
 [ 1.39723888  1.23611398]
 [ 0.93616267 -1.36918874]
 [ 0.69830946 -2.46962002]]
'''
print(y)
# [0 1 0 0 1 1 1 0 0 1 1 0 1 1 0 0]

model = LogisticRegression().fit(x, y) # 모델
y_hat = model.predict(x)               # 예측
print(y_hat)
# [0 1 0 1 0 1 1 1 0 1 1 1 0 1 0 0]

f_value = model.decision_function(x)
# 결정/판별/불확실성 추정 합수. ROC curve의 판별 경계선 설정을 위한 sample data 제공
print(f_value)
'''
[ 0.37829565  1.6336573  -1.42938156  1.21967832  2.06504666 -4.11896895
 -1.04677034 -1.21469968  1.62496692 -0.43866584 -0.92693183 -0.76588836
  0.09428499  1.62617134 -2.08158634  2.36316277]
'''

df = pd.DataFrame(np.vstack([f_value, y_hat, y]).T, columns= ['f', 'y_hat', 'y'])
df.sort_values("f", ascending=False).reset_index(drop=True)
print(df)
'''
           f  y_hat    y
0  -1.902803    0.0  0.0
1   1.000982    1.0  1.0
2  -1.008356    0.0  0.0
3   0.143868    1.0  0.0
4  -0.487168    0.0  1.0
5   1.620022    1.0  1.0
6   2.401185    1.0  1.0 ...
'''

# ROC
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y, y_hat, labels=[1, 0]))
# [[6 2]
#  [3 5]]
accuracy = (6 + 5) / (6 + 2 + 3 + 5)
print('accuracy : ', accuracy) # accuracy :  0.6875
recall = 6 / (6 + 3)           # 재현율 TPR
print('recall : ', recall)     # recall :  0.6666666666666666
fallout = 3 / (3 + 5)          # 위 양선율 FPR
print('fallout : ', fallout)   # fallout :  0.375

from sklearn import metrics
acc_sco = metrics.accuracy_score(y, y_hat)
cl_rep = metrics.classification_report(y, y_hat)
print('acc_sco : ', acc_sco)   # acc_sco :  0.6875
print('cl_rep : \n', cl_rep)
'''
               precision    recall  f1-score   support

           0       0.71      0.62      0.67         8
           1       0.67      0.75      0.71         8

    accuracy                           0.69        16
   macro avg       0.69      0.69      0.69        16
weighted avg       0.69      0.69      0.69        16
'''

from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y, model.decision_function(x))
print('fpr :', fpr)             # fpr : [0.    0.    0.    0.375 0.375 1.   ]
print('tpr :', tpr)             # tpr : [0.    0.125 0.75  0.75  1.    1.   ]
print('thresholds', thresholds) # thresholds [ 3.40118546  2.40118546  0.98927765  0.09570707 -0.48716822 -3.71164276]

import matplotlib.pyplot as plt
plt.plot(fpr, tpr, 'o-', label='Logistic Regression')
plt.plot([0, 1], [0, 1], 'k--', label='random guess')
plt.plot([fallout], [recall], 'ro', ms=10)
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.title('ROC')
plt.show()

# AUC (Area Under the Curve) : ROC 커브의 면적
from sklearn.metrics import auc
print('auc :', auc(fpr, tpr)) # auc : 0.90625

'BACK END > Deep Learning' 카테고리의 다른 글

[딥러닝] PCA (0)	2021.03.16
[딥러닝] SVM (0)	2021.03.16
[딥러닝] 다항회귀 (0)	2021.03.12
[딥러닝] 단순선형 회귀, 다중선형 회귀 (0)	2021.03.11
[딥러닝] 선형회귀 (0)	2021.03.10

[딥러닝] 다항회귀

2021. 3. 12. 12:51

다항회귀

선형회귀 모델을 다항회귀로 변환

* linear_reg10.py

import numpy as np
import matplotlib.pyplot as plt

x = np.array([1,2,3,4,5])
y = np.array([4,2,1,3,7])
plt.scatter(x, y) # 산포도
plt.show()

선형회귀 모델

from sklearn.linear_model import LinearRegression
x = x[:, np.newaxis] # 입력을 matrix로 주어야함으로 차원 확대
print(x)
'''
[[1]
 [2]
 [3]
 [4]
 [5]]
'''
model = LinearRegression().fit(x, y) # 선형회귀 모델
y_pred = model.predict(x) # 예측값
print(y_pred)
# [2.  2.7 3.4 4.1 4.8]

plt.scatter(x, y) # 산포도
plt.plot(x, y_pred, c='red') # 추세선 그래프
plt.show()

다항식 특징을 추가

# 비선형인 경우 다항식 특징을 추가해서 작업한다.
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=3, include_bias = False) # degree : 열 개수, include_bias : 편향
print(poly)
x2 = poly.fit_transform(x) # 특징 행렬 생성
print(x2)
'''
[[  1.   1.   1.]
 [  2.   4.   8.]
 [  3.   9.  27.]
 [  4.  16.  64.]
 [  5.  25. 125.]]
  ----제곱------>
'''

model2 = LinearRegression().fit(x2, y) # 선형회귀 모델
y_pred2 = model2.predict(x2) # 예측값
print(y_pred2)
# [4.04285714 1.82857143 1.25714286 2.82857143 7.04285714]

plt.scatter(x, y) # 산포도
plt.plot(x, y_pred2, c='red') # 추세선 그래프
plt.show()

선형회귀 모델을 다항회귀로 변환

: 다항식 추가. 특징행렬 생성.

* linear_reg11.py

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics._regression import mean_squared_error, r2_score

x = np.array([258, 270, 294, 320, 342, 368, 396, 446, 480, 586])[:, np.newaxis]
print(x)
'''
[[258]
 [270]
 [294]
 [320]
 [342]
 [368]
 [396]
 [446]
 [480]
 [586]] 
'''
y = np.array([236, 234, 253, 298, 314, 342, 360, 368, 391, 390])

# 비교목적으로 일반회귀 모델 클래스와 다항식 모델 클래스
lr = LinearRegression()
pr = LinearRegression()
polyf = PolynomialFeatures(degree=2) # 특징 행렬 생성
x_quad = polyf.fit_transform(x)
print(x_quad)
'''
[[1.00000e+00 2.58000e+02 6.65640e+04]
 [1.00000e+00 2.70000e+02 7.29000e+04]
 [1.00000e+00 2.94000e+02 8.64360e+04]
'''
lr.fit(x, y)
x_fit = np.arange(250, 600, 10)[:, np.newaxis]
print(x_fit)
'''
[[250]
 [260]
 [270]
 [280]
 [290]
'''

y_lin_fit = lr.predict(x_fit)
print(y_lin_fit)
# [250.63869122 256.03244588 261.42620055 ...

pr.fit(x_quad, y)
y_quad_fit = pr.predict(polyf.fit_transform(x_fit))
print(y_quad_fit)
# [215.50100168 228.03388862 240.11490613 ...

# 시각화
plt.scatter(x, y, label='train points')
plt.plot(x_fit, y_lin_fit, label='linear fit', linestyle='--', c='red')
plt.plot(x_fit, y_quad_fit, label='quadratic fit', linestyle='-', c='blue')
plt.legend()
plt.show()
print()

# MSE(평균 제곱오차)와 R2(결정계수) 확인
y_lin_pred = lr.predict(x)
print('y_lin_pred :\n', y_lin_pred)
'''
 [254.95369495 261.42620055 274.37121174 288.39497387 300.26123414
 314.28499627 329.38750933 356.35628266 374.69504852 431.86884797]
'''
y_quad_pred = pr.predict(x_quad)
print('y_quad_pred :\n', y_quad_pred)
'''
 [225.56346079 240.11490613 267.26572086 293.74195218 313.75904653
 334.59594733 353.61955374 378.77882554 389.43443486 389.12615204]
'''

print('train MSE 비교 : 선형모델은 %.3f, 다항모델은 %.3f'%(mean_squared_error(y, y_lin_pred), mean_squared_error(y, y_quad_pred)))
# train MSE 비교 : 선형모델은 570.885, 다항모델은 58.294

print('train 결정계수 비교 : 선형모델은 %.3f, 다항모델은 %.3f'%(r2_score(y, y_lin_pred), r2_score(y, y_quad_pred)))
# train 결정계수 비교 : 선형모델은 0.831, 다항모델은 0.983

'BACK END > Deep Learning' 카테고리의 다른 글

[딥러닝] SVM (0)	2021.03.16
[딥러닝] 로지스틱 회귀 (0)	2021.03.15
[딥러닝] 단순선형 회귀, 다중선형 회귀 (0)	2021.03.11
[딥러닝] 선형회귀 (0)	2021.03.10
[딥러닝] 공분산, 상관계수 (0)	2021.03.10

[딥러닝] 단순선형 회귀, 다중선형 회귀

2021. 3. 11. 10:36

단순선형 회귀 Simple Linear Regression

: ols()

: 독립변수 - 연속형, 종속변수 - 연속형.
: 독립변수 1개

* linear_reg4.py

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.rc('font', family='malgun gothic')

df = pd.read_csv('../testdata/drinking_water.csv')
print(df.head(3), '\n', df.describe())
'''
   친밀도  적절성  만족도
0    3    4    3
1    3    3    2
2    4    4    4 
'''

print(df.corr()) # 적절성/만족도 상관계수 : 0.766853

print('----------------------------------------------------------------------')
import statsmodels.formula.api as smf

model = smf.ols(formula='만족도 ~ 적절성', data=df).fit()
print(model.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    만족도   R-squared:                       0.588
Model:                            OLS   Adj. R-squared:                  0.586
Method:                 Least Squares   F-statistic:                     374.0
Date:                Thu, 11 Mar 2021   Prob (F-statistic):           2.24e-52
Time:                        10:07:49   Log-Likelihood:                -207.44
No. Observations:                 264   AIC:                             418.9
Df Residuals:                     262   BIC:                             426.0
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.7789      0.124      6.273      0.000       0.534       1.023
적절성            0.7393      0.038     19.340      0.000       0.664       0.815
==============================================================================
Omnibus:                       11.674   Durbin-Watson:                   2.185
Prob(Omnibus):                  0.003   Jarque-Bera (JB):               16.003
Skew:                          -0.328   Prob(JB):                     0.000335
Kurtosis:                       4.012   Cond. No.                         13.4
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

[해석]

상관계수 ** 2 = 결정계수

print(0.766853 ** 2) # 0.588063523609
R-squared : 결정계수(설명력), 상관계수 R의 제곱 : 0.588
: 1 - (SSE(explain sum of square-추세선과 데이터간 y값) / SST(total sum of square - 평균과 추세선간 y값

차이) )

: 1 - (SSE / SST)

=> over fitting : R2가 1에 아주 가까우면(기존 데이터와 추사) 새로운 데이터에 대해 설명력이 좋지않다.
적절성의 p-value : 0.000 < 0.05 => 모델은 유효하다.
std err(표준 오차) : 0.038
Intercept(y절편) : 0.7789
coef(기울기) : 0.7393
t = 기울기/ 표준오차 : 19.340

print(0.7393 / 0.038) # 19.455263157894738
F-statistic = t**2 : 374.0

print(19.340 ** 2) # 374.0356

독립변수가 많을 경우 R-squared과 Adj. R-squared의 차이가 클 경우 독립변수 이상치를 확인해야한다.
Kurtosis : 4.012 => 3보다 클경우 평균에 데이터가 몰려있다.

print(model.params) # y절편과 기울기 산출
# Intercept    0.778858
#적절성          0.739276

print(model.rsquared) # 0.5880630629464404
print()
print(model.pvalues)
'''
Intercept    1.454388e-09
적절성          2.235345e-52
'''
#print(model.predict()) # 예측값
print(df.만족도[0],' ', model.predict()[0]) # 3   3.7359630488589186

# 새로운 값 예측
print(df.적절성[:5])
'''
3   3.7359630488589186
0    4
1    3
2    4
3    2
4    2
'''

print(df.만족도[:5])
'''
0    3
1    2
2    4
3    2
4    2
'''

print(model.predict()[:5]) # [3.73596305 2.99668687 3.73596305 2.25741069 2.25741069]
print()

new_df = pd.DataFrame({'적절성':[6,5,4,3,22]})
new_pred = model.predict(new_df)
print('new_pred :\n', new_pred)
'''
 0     5.214515
1     4.475239
2     3.735963
3     2.996687
4    17.042934
'''

plt.scatter(df.적절성, df.만족도)
slope, intercept = np.polyfit(df.적절성, df.만족도, 1) # R의 abline 기능
plt.plot(df.적절성, df.적절성 * slope + intercept, 'b') # 추세선
plt.show()

다중 선형회귀 Multiple Linear Regression

: 독립변수가 복수

model2 = smf.ols(formula='만족도 ~ 적절성 + 친밀도', data=df).fit()
print(model2.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    만족도   R-squared:                       0.598
Model:                            OLS   Adj. R-squared:                  0.594
Method:                 Least Squares   F-statistic:                     193.8
Date:                Thu, 11 Mar 2021   Prob (F-statistic):           2.61e-52
Time:                        11:19:33   Log-Likelihood:                -204.37
No. Observations:                 264   AIC:                             414.7
Df Residuals:                     261   BIC:                             425.5
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.6673      0.131      5.096      0.000       0.409       0.925
적절성            0.6852      0.044     15.684      0.000       0.599       0.771
친밀도            0.0959      0.039      2.478      0.014       0.020       0.172
==============================================================================
Omnibus:                       13.103   Durbin-Watson:                   2.174
Prob(Omnibus):                  0.001   Jarque-Bera (JB):               17.256
Skew:                          -0.382   Prob(JB):                     0.000179
Kurtosis:                       3.992   Cond. No.                         18.8
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

단순 선형 회귀

: iris dataset, ols() 사용. 상관관계가 약한/강한 변수로 모델 작성.

* linear_reg5.py

import pandas as pd
import statsmodels.formula.api as smf
import seaborn as sns
iris = sns.load_dataset('iris')
print(iris.head(3))
'''
   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
'''

print(iris.corr())
'''
              sepal_length  sepal_width  petal_length  petal_width
sepal_length      1.000000    -0.117570      0.871754     0.817941
sepal_width      -0.117570     1.000000     -0.428440    -0.366126
petal_length      0.871754    -0.428440      1.000000     0.962865
petal_width       0.817941    -0.366126      0.962865     1.000000
'''

# 단순 선형회귀 모델 : 상관관계 r = -0.117570(sepal_length/sepal_width)
result = smf.ols(formula = 'sepal_length ~ sepal_width', data=iris).fit()
#print(result.summary()) #  R2 : 0.014
print(result.rsquared)# 0.01382 < 0.05      => 의미없는 모델
print(result.pvalues) # 1.518983e-01 > 0.05

result2 = smf.ols(formula = 'sepal_length ~ petal_length', data=iris).fit()
print(result2.summary()) #  R2 : 0.760      => 설명력
print(result2.rsquared)# 0.7599 > 0.05      => 의미있는 모델
print(result2.pvalues) # 1.038667e-47 < 0.05
print()

pred = result2.predict()
print('실제값 :', iris.sepal_length[0]) # 실제값 : 5.1
print('예측값 :', pred[0])              # 예측값 : 4.879094603339241

# 새로운 데이터로 예측
print(iris.petal_length[1:5])
new_data = pd.DataFrame({'petal_length':[1.4, 0.5, 8.5, 12.123]})
print(new_data)
'''
   petal_length
0         1.400
1         0.500
2         8.500
3        12.123
'''
y_pred_new = result2.predict(new_data)
print('새로운 데이터로 sepal_length예측 :\n', y_pred_new)
'''
 0    4.879095
1    4.511065
2    7.782443
3    9.263968
'''

다중 선형 회귀

result3 = smf.ols(formula = 'sepal_length ~ petal_length + petal_width', data=iris).fit()
print(result3.summary()) #  R2 : 0.760      => 설명력

                            OLS Regression Results                            
==============================================================================
Dep. Variable:           sepal_length   R-squared:                       0.766
Model:                            OLS   Adj. R-squared:                  0.763
Method:                 Least Squares   F-statistic:                     241.0
Date:                Thu, 11 Mar 2021   Prob (F-statistic):           4.00e-47
Time:                        12:09:43   Log-Likelihood:                -75.023
No. Observations:                 150   AIC:                             156.0
Df Residuals:                     147   BIC:                             165.1
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept        4.1906      0.097     43.181      0.000       3.999       4.382
petal_length     0.5418      0.069      7.820      0.000       0.405       0.679
petal_width     -0.3196      0.160     -1.992      0.048      -0.637      -0.002
==============================================================================
Omnibus:                        0.383   Durbin-Watson:                   1.826
Prob(Omnibus):                  0.826   Jarque-Bera (JB):                0.540
Skew:                           0.060   Prob(JB):                        0.763
Kurtosis:                       2.732   Cond. No.                         25.3
==============================================================================

print('R-squared :', result3.rsquared)# 0.7662 > 0.05      => 의미있는 모델
print('p-value', result3.pvalues)
# petal_length    9.414477e-13
# petal_width     4.827246e-02
# y = 0.5418 * x1 -0.3196 * x2 + 4.1906

# 새로운 데이터로 예측
new_data2 = pd.DataFrame({'petal_length':[8.5, 12.12], 'petal_width':[8.5, 12.5]})
y_pred_new2 = result3.predict(new_data2)
print('새로운 데이터로 sepal_length예측 :\n', y_pred_new2)
'''
 0    6.079508
1    6.762540
'''

선형 회귀 분석

: mtcars dataset, ols() 사용. 모델작성 후 추정치 얻기

* linear_reg6.py

import statsmodels.api
import statsmodels.formula.api as smf
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

plt.rc('font', family='malgun gothic')

mtcars = statsmodels.api.datasets.get_rdataset('mtcars').data
print(mtcars)
'''
                      mpg  cyl   disp   hp  drat  ...   qsec  vs  am  gear  carb
Mazda RX4            21.0    6  160.0  110  3.90  ...  16.46   0   1     4     4
Mazda RX4 Wag        21.0    6  160.0  110  3.90  ...  17.02   0   1     4     4
'''
print(mtcars.columns) # Index(['mpg', 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec', 'vs', 'am', 'gear', 'carb'], dtype='object')
print(mtcars.describe())
print(np.corrcoef(mtcars.hp, mtcars.mpg)) # 상관계수 : -0.77616837
print(np.corrcoef(mtcars.wt, mtcars.mpg)) # 상관계수 : -0.86765938
print(mtcars.corr())

# 시각화
plt.scatter(mtcars.hp, mtcars.mpg)
plt.xlabel('마력 수')
plt.ylabel('연비')
slope, intercept = np.polyfit(mtcars.hp, mtcars.mpg, 1) # 1차원
plt.plot(mtcars.hp, mtcars.hp * slope + intercept, 'r')
plt.show()

# 단순선형 회귀
result = smf.ols('mpg ~ hp', data=mtcars).fit()
print(result.summary())
print(result.conf_int(alpha=0.05)) # 33.435772
print(result.summary().tables[0])  # coef * x + Intercept
print('마력수  110에 대한 연비 예측 :', -0.0682 * 110 + 30.0989) # 22.5969
print('마력수  50에 대한 연비 예측 :', -0.0682 * 50 + 30.0989)   # 26.6889
# 마력이 증가하면 연비는 줄어든다. 음의 상관관계이므로 결과는 반비례한다. 참고 자료로만 활용해야한다.

# 다중선형 회귀
result2 = smf.ols('mpg ~ hp + wt', data=mtcars).fit()
print(result2.summary())
print(result2.conf_int(alpha=0.05))
print(result2.summary().tables[0])
print('마력수 110 + 무게 5에 대한 연비 예측 :', ((-0.0318 * 110) +(-3.8778 * 5) + 37.2273)) # 14.3403

print('추정치 구하기 차체 무게를 입력해 연비를 추정')
result3 = smf.ols('mpg ~ wt', data=mtcars).fit()
print(result3.summary())
print('결정계수 :', result3.rsquared) # 0.7528327936582646 > 0.05 설명력이 우수한 모델
pred = result3.predict()

# 1개의 자료로 실제값과 예측값(추정값) 저장 후 비교
print(mtcars.mpg[0])
print(pred[0]) # 모든 자동차 차체 무게에 대한 연비 추정치 출력

data = {
    'mpg':mtcars.mpg,
    'mpg_pred':pred
    }
df = pd.DataFrame(data)
print(df)
'''
                      mpg   mpg_pred
Mazda RX4            21.0  23.282611
Mazda RX4 Wag        21.0  21.919770
Datsun 710           22.8  24.885952
'''

# 새로운 차체 무게로 연비 추정하기
mtcars.wt = float(input('차체 무게 입력:'))
new_pred = result3.predict(pd.DataFrame(mtcars.wt))
print('차체 무게 {}일때 예상연비{}이다'.format(mtcars.wt[0], new_pred[0]))
# 차체 무게 1일때 예상연비31.940654594619367이다

# 여러 차제 무게에 대한 연비 추정
new_wt = pd.DataFrame({'wt':[6, 3, 0.5]})
new_pred2 = result3.predict(pd.DataFrame(new_wt))
print('예상연비 : \n', np.round(new_pred2.values, 2)) #  [ 5.22 21.25 34.61]

선형 회귀 분석

: 여러매체의 광고비에 따른 판매량 데이터, ols() 사용. 모델작성 후 추정치 얻기

* linear_reg7

import statsmodels.api
import statsmodels.formula.api as smf
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


adf_df = pd.read_csv('../testdata/Advertising.csv', usecols=[1,2,3,4])
print(adf_df.head(3), ' ', adf_df.shape) # (200, 4)
print(adf_df.index, adf_df.columns)
print(adf_df.info())
'''
      tv  radio  newspaper  sales
0  230.1   37.8       69.2   22.1
1   44.5   39.3       45.1   10.4
2   17.2   45.9       69.3    9.3
'''

print('상관계수 r : \n', adf_df.loc[:, ['sales', 'tv']].corr())
'''
           sales        tv
sales  1.000000  0.782224
tv     0.782224  1.000000
'''
# r : 0.782224 > 0.05 => 강한 양의 상관관계이고, 인과관계임을 알 수 있다.
print()

lm = smf.ols(formula='sales ~ tv', data=adf_df).fit()
print(lm.summary()) # R-squared : 0.612, p : 1.47e-42
print(lm.params)
print(lm.pvalues)
print(lm.rsquared)

# 시각화
plt.scatter(adf_df.tv, adf_df.sales)
plt.xlabel('tv')
plt.ylabel('sales')
x = pd.DataFrame({'tv':[adf_df.tv.min(), adf_df.tv.max()]})
y_pred = lm.predict(x)
plt.plot(x, y_pred, c='red')
plt.title('Linear Regression')
sns.regplot(adf_df.tv, adf_df.sales, scatter_kws = {'color':'r'})
plt.xlim(-50, 350)
plt.ylim(ymin=0)
plt.show()

# 예측 : 새로운 tv값으로 sales를 추정
x_new = pd.DataFrame({'tv':[230.1, 44.5, 100]})
pred = lm.predict(x_new)
print('추정값 :\n', pred)
'''
0    17.970775
1     9.147974
2    11.786258
'''

print('\n다중 선형회귀 모델 ')
lm_mul = smf.ols(formula = 'sales ~ tv + radio + newspaper', data = adf_df).fit()
#  + newspaper 포함시와 미포함시의 R2값 변화가 없어 제거 필요.
print(lm_mul.summary())
print(adf_df.corr())

# 예측2 : 새로운 tv, radio값으로 sales를 추정
x_new2 = pd.DataFrame({'tv':[230.1, 44.5, 100], 'radio':[30.1, 40.1, 50.1],\
                      'newspaper':[10.1, 10.1, 10.1]})
pred2 = lm.predict(x_new2)
print('추정값 :\n', pred2)
'''
0    17.970775
1     9.147974
2    11.786258
'''

회귀분석모형의 적절성을 위한 조건

: 아래의 조건 위배 시에는 변수 제거나 조정을 신중히 고려해야 함.

- 정규성 : 독립변수들의 잔차항이 정규분포를 따라야 한다.
- 독립성 : 독립변수들 간의 값이 서로 관련성이 없어야 한다.
- 선형성 : 독립변수의 변화에 따라 종속변수도 변화하나 일정한 패턴을 가지면 좋지 않다.
- 등분산성 : 독립변수들의 오차(잔차)의 분산은 일정해야 한다. 특정한 패턴 없이 고르게 분포되어야 한다.
- 다중공선성 : 독립변수들 간에 강한 상관관계로 인한 문제가 발생하지 않아야 한다.

# 잔차항
fitted = lm_mul.predict(adf_df)     # 예측값
print(fitted)
'''
0      20.523974
1      12.337855
2      12.307671
'''
residual = adf_df['sales'] - fitted # 잔차

import seaborn as sns
print('선형성 - 예측값과 잔차가 비슷하게 유지')
sns.regplot(fitted, residual, lowess = True, line_kws = {'color':'red'})
plt.plot([fitted.min(), fitted.max()], [0, 0], '--', color='grey')
plt.show() # 선형성을 만족하지 못한다.

print('정규성- 잔차가 정규분포를 따르는 지 확인')
import scipy.stats as stats
sr = stats.zscore(residual)
(x, y), _ = stats.probplot(sr)
sns.scatterplot(x, y)
plt.plot([-3, 3], [-3, 3], '--', color="grey")
plt.show() # 선형성을 만족하지 못한다. 
print('residual test :', stats.shapiro(residual))
# residual test : ShapiroResult(statistic=0.9176644086837769, pvalue=3.938041004403203e-09)
# pvalue=3.938041004403203e-09 < 0.05 => 정규성을 만족하지못함.

print('독립성 - 잔차가 자기상관(인접 관측치의 오차가 상관되어 있음)이 있는지 확인')
# 모델.summary() Durbin-Watson:2.084 => 잔차항이 독립성을 만족하는 지 확인. 2에 가까우면 자기상관이 없다.(서로 독립- 잔차끼리 상관관계가 없다)
# 0에 가까우면 양의 상관, 4에 가까우면 음의 상관.

print('등분산성 - 잔차의 분산이 일정한지 확인')
sns.regplot(fitted, np.sqrt(np.abs(sr)), lowess = True, line_kws = {'color':'red'})
plt.show()
# 추세선이 수평선을 그리지않으므로 등분산성을 만족하지 못한다.

print('다중공선성 - 독립변수들 간에 강한 상관관계 확인')
# VIF(Variance Inflation Factors - 분산 팽창 요인) 값이 10을 넘으면 다중공선성이 발생하는 변수라고 할 수 있다.
from statsmodels.stats.outliers_influence import variance_inflation_factor
print(variance_inflation_factor(adf_df.values, 0)) # 23.198876299003153
print(variance_inflation_factor(adf_df.values, 1)) # 12.570312383503682
print(variance_inflation_factor(adf_df.values, 2)) # 3.1534983754953845
print(variance_inflation_factor(adf_df.values, 3)) # 55.3039198336228

# DataFrame으로 보기
vif_df = pd.DataFrame()
vif_df['vid_value'] = [variance_inflation_factor(adf_df.values, i) for i in range(adf_df.shape[1])]
print(vif_df)
'''
   vid_value
0  23.198876
1  12.570312
2   3.153498
3  55.303920
'''

print('참고 : cooks distance - 극단값을 나타내는 지료 확인')
from statsmodels.stats.outliers_influence import OLSInfluence
cd, _ = OLSInfluence(lm_mul).cooks_distance
print(cd.sort_values(ascending=False).head())
'''
130    0.272956
5      0.128306
75     0.056313
35     0.051275
178    0.045921
'''

import statsmodels.api as sm
sm.graphics.influence_plot(lm_mul, criterion='cooks')
plt.show()

print(adf_df.iloc[[130, 5, 75, 35, 178]]) # 극단 값으로 작업에서 제외 권장.
'''
        tv  radio  newspaper  sales
130    0.7   39.6        8.7    1.6
5      8.7   48.9       75.0    7.2
75    16.9   43.7       89.4    8.7
35   290.7    4.1        8.5   12.8
178  276.7    2.3       23.7   11.8
'''

* linear_reg8.py

from sklearn.linear_model import LinearRegression
import statsmodels.api

mtcars = statsmodels.api.datasets.get_rdataset('mtcars').data
print(mtcars[:3])
'''
                mpg  cyl   disp   hp  drat     wt   qsec  vs  am  gear  carb
Mazda RX4      21.0    6  160.0  110  3.90  2.620  16.46   0   1     4     4
Mazda RX4 Wag  21.0    6  160.0  110  3.90  2.875  17.02   0   1     4     4
Datsun 710     22.8    4  108.0   93  3.85  2.320  18.61   1   1     4     1
'''

# hp(마력수)가 mpg(연비)에 영향을 미치지는 지, 인과관계가 있다면 연비에 미치는 영향값(추정치, 예측치)을 예측 (정량적 분석)
x = mtcars[['hp']].values
y = mtcars[['mpg']].values
print(x[:3])
'''
[[110]
 [110]
 [ 93]]
'''
print(y[:3])
'''
[[21. ]
 [21. ]
 [22.8]]
'''

import matplotlib.pyplot as plt
plt.scatter(x, y) # 산포도 출력
plt.show()

fit_model = LinearRegression().fit(x, y)   # 모델 생성
print('slope :', fit_model.coef_[0])       # 기울기 : [-0.06822828]
print('intercept :', fit_model.intercept_) # y절편 : [30.09886054]
# newY = fit_model.coef_[0] * newX + fit_model.intercept_

pred = fit_model.predict(x)
print(pred[:3])
print('예측값 :', pred[:3].flatten()) # 예측값 : [22.59374995 22.59374995 23.75363068]
print('실제값 :', y[:3].flatten())    # 실제값 : [21.  21.  22.8]
print()

# 모델 성능 파악 시 R2 또는 RMSE
from sklearn.metrics import mean_squared_error
import numpy as np

lin_mse = mean_squared_error(y, pred)   # 평균 제곱 오차
lin_rmse = np.sqrt(lin_mse)             # 루트
print("평균 제곱 오차 : ", lin_mse)          # 평균 제곱 오차 :  13.989822298268805
print("평균 제곱근 편차(RMSE) : ", lin_rmse) # 평균 제곱근 편차(RMSE) :  3.7402970868994894
print()

# 마력에 따른 연비 추정치
new_hp = [[100]]
new_pred = fit_model.predict(new_hp)
print('%s 마력인 경우 연비 추정치는 %s'%(new_hp[0][0], new_pred[0][0]))
# 100 마력인 경우 연비 추정치는 23.27603273246613

선형회귀 분석 : Linear Regression
과적합 방지를 위해 Ridgo, Lasso, ElasticNet

* linear_reg9.py

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris

iris = load_iris()
print(iris)
'''
[[5.1, 3.5, 1.4, 0.2],
[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5. , 3.6, 1.4, 0.2],
[5.4, 3.9, 1.7, 0.4],
[4.6, 3.4, 1.4, 0.3],
'''
print(iris.feature_names) # ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
print(iris.target)       
print(iris.target_names) # ['setosa' 'versicolor' 'virginica']
print()

iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
iris_df['target'] = iris.target
iris_df['target_names'] = iris.target_names[iris.target]
print(iris_df.head(3), ' ', iris_df.shape) # (150, 6)
'''
   sepal length (cm)  sepal width (cm)  ...  target  target_names
0                5.1               3.5  ...       0        setosa
1                4.9               3.0  ...       0        setosa
2                4.7               3.2  ...       0        setosa
'''

출처 : https://www.educative.io/edpresso/overfitting-and-underfitting

# train / test 분리 : 과적합 방지 방법 중 1
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(iris_df, test_size = 0.3) # data를 train 0.7, test 0.3 배율로 나눔
print(train_set.head(2), ' ', train_set.shape) # (105, 6)
print(test_set.head(2), ' ', test_set.shape) # (45, 6)

# 선형회귀
# 정규화 선형회귀 방법은 선형회귀계수(weight)에 대한 제약조건을 추가함으로 해서, 모형이 과도라게 최적화(오버피팅)되는 현상을 방지할 수 있다.
from sklearn.linear_model import LinearRegression as lm
import matplotlib.pyplot as plt

print(train_set.iloc[:, [2]]) # petal.length
print(train_set.iloc[:, [3]]) # petal.width

model_ols = lm().fit(X=train_set.iloc[:, [2]], y=train_set.iloc[:, [3]])
print(model_ols.coef_[0])     # [0.41268804]
print(model_ols.intercept_)   # [-0.35472987]
pred = model_ols.predict(model_ols.predict(test_set.iloc[:, [2]]))
print('ols_pred :\n', pred[:5])
'''
 [[ 0.31044183]
 [ 0.49464775]
 [ 0.3606798 ]
 [ 0.09274392]
 [-0.25892194]]
'''

print('ols_real :\n', test_set.iloc[:, [3]][:5])
'''
      petal width (cm)
138               1.8
143               2.3
142               1.9
79                1.0
45                0.3
'''

# 회귀분석 방법 - Ridge: alpha값을 조정(가중치 제곱합을 최소화)하여 과대/과소적합을 피한다. 다중공선성 문제 처리에 효과적.
from sklearn.linear_model import Ridge
model_ridge = Ridge(alpha=10).fit(X=train_set.iloc[:, [2]], y=train_set.iloc[:, [3]])

#점수
print(model_ridge.score(X=train_set.iloc[:, [2]], y=train_set.iloc[:, [3]])) #0.91923658601
print(model_ridge.score(X=test_set.iloc[:, [2]], y=test_set.iloc[:, [3]]))   #0.935219182367
print('ridge predict : ', model_ridge.predict(test_set.iloc[:, [2]]))
plt.scatter(train_set.iloc[:, [2]], train_set.iloc[:, [3]],  color='red')
plt.plot(test_set.iloc[:, [2]], model_ridge.predict(test_set.iloc[:, [2]]))
plt.show()

print('\nLasso')
# 회귀분석 방법 - Lasso: alpha값을 조정(가중치 절대값의 합을 최소화)하여 과대/과소적합을 피한다.
from sklearn.linear_model import Lasso
model_lasso = Lasso(alpha=0.1, max_iter=1000).fit(X=train_set.iloc[:, [0,1,2]], y=train_set.iloc[:, [3]])

#점수
print(model_lasso.score(X=train_set.iloc[:, [0,1,2]], y=train_set.iloc[:, [3]])) #0.921241848687
print(model_lasso.score(X=test_set.iloc[:, [0,1,2]], y=test_set.iloc[:, [3]]))   #0.913186971647
print('사용한 특성수 : ', np.sum(model_lasso.coef_ != 0))   # 사용한 특성수 :  1
plt.scatter(train_set.iloc[:, [2]], train_set.iloc[:, [3]],  color='red')
plt.plot(test_set.iloc[:, [2]], model_ridge.predict(test_set.iloc[:, [2]]))
plt.show()

# 회귀분석 방법 4 - Elastic Net 회귀모형 : Ridge + Lasso
# 가중치 제곱합을 최소화, 거중치 절대값의 합을 최소화, 두가지를 동시에 제약조건으로 사용
from sklearn.linear_model import ElasticNet

'BACK END > Deep Learning' 카테고리의 다른 글

[딥러닝] 로지스틱 회귀 (0)	2021.03.15
[딥러닝] 다항회귀 (0)	2021.03.12
[딥러닝] 선형회귀 (0)	2021.03.10
[딥러닝] 공분산, 상관계수 (0)	2021.03.10
[딥러닝] 이항검정 (0)	2021.03.10

[딥러닝] 선형회귀

2021. 3. 10. 15:13

회귀분석 Regression

: 각각의 데이터에 대한 잔차 제곱합이 최소가 되는 추세선을 만들고, 이를 통해 독립 변수가 종속변수에 얼마나 영향을 주는지 인과관계를 분석.

: 독립변수 - 연속형, 종속변수 - 연속형.

: 두 변수는 상관관계 및 인과관계가 있어야한다. (상관계수 > 0.3)

: 정량적인 모델 생성.

- 기계 학습(지도학습) : 학습을 통해 모델 생성 후, 새로운 데이터에 대한 예측 및 분류

선형회귀 Linear Regression

최소 제곱법(Least Square Method)

Y = a + b * X

선형회귀분석의 기존 가정 충족 조건

선형성 : 독립변수(feature)의 변화에 따라 종속변수도 일정 크기로 변화해야 한다.
정규성 : 잔차항이 정규분포를 따라야 한다.
독립성 : 독립변수의 값이 서로 관련되지 않아야 한다.
등분산성 : 그룹간의 분산이 유사해야 한다. 독립변수의 모든 값에 대한 오차들의 분산은 일정해야 한다.
다중공선성 : 다중회귀 분석 시 3 개 이상의 독립변수 간에 강한 상관관계가 있어서는 안된다.

- 최소 제곱해를 선형 행렬 방정식으로 얻기

* linear_reg1.py

import numpy.linalg as lin
import numpy as np
import matplotlib.pyplot as plt

x = np.array([0, 1, 2, 3])
y = np.array([-1, 0.2, 0.9, 2.1])

plt.plot(x, y)
plt.grid(True)
plt.show()

A = np.vstack([x, np.ones(len(x))]).T
print(A)
'''
[[0. 1.]
 [1. 1.]
 [2. 1.]
 [3. 1.]]
'''

# y = mx + c
m, c = np.linalg.lstsq(A, y, rcond=None)[0]
print('기울기 :', m, ' y절편:', c) # 기울기 : 0.9999999999999997   y절편 : -0.949999999999999

plt.plot(x, y, 'o', label='Original data', markersize=10)
plt.plot(x, m*x + c, 'r', label='Fitted line')
plt.legend()
plt.show()

np.vstack(x, y) : x에 y 행 추가

np.ones(x) : x X x의 1로 채워진 행렬 생성

x.T : 행열 변경

import numpy.linalg as lin

lin.lstsq() : 최소제곱법

# yhat = 0.9999999999999997 * x -0.949999999999999
print(0.9999999999999997 * 1 -0.949999999999999) # 0.05000000000000071
print(0.9999999999999997 * 3 -0.949999999999999) # 2.0500000000000003
print(0.9999999999999997 * 123 -0.949999999999999) # 122.04999999999995

모델 생성

* linear_reg2.py

방법 1 : make_regression을 사용, model X

import statsmodels.api as sm
from sklearn.datasets import make_regression
import numpy as np

np.random.seed(12)
x, y, coef = make_regression(n_samples=50, n_features=1, bias=100, coef=True)
print('x :{}, y:{}, coef:{}'.format(x, y, coef))
'''
x :[[-1.70073563]
 [-0.67794537]
 y:[ -52.17214291   39.34130801  

  '''
# 기울기 coef:89.47430739278907
# 회귀식 y = a + bx      y = 100 + 89.47430739278907 * x
y_pred = 100 + 89.47430739278907  * -1.70073563
print('y_pred :', y_pred) # y_pred : -52.17214255248879

xx = x
yy = y

방법 2 : Linear Regression을 사용. model O

from sklearn.linear_model import LinearRegression
model = LinearRegression()
fit_model = model.fit(xx, yy) # 학습 데이터로 모형 추정 : y절편, 기울기 get.
print(fit_model.coef_)      # 기울기 89.47430739
print(fit_model.intercept_) # y절편 100.0

# 예측값 확인 함수
y_pred2 = fit_model.predict(xx[[0]])
print('y_pred2 :', y_pred2)

y_pred2_new = fit_model.predict([[66]])
print('y_pred2_new :', y_pred2_new)

방법 3 : ols 사용. model O.

import statsmodels.formula.api as smf
import pandas as pd

x1 = xx.flatten() # 차원 축소
print(x1.shape)
y1 = yy
print(y1)

data = np.array([x1, y1])
df = pd.DataFrame(data.T)
df.columns = ['x1', 'y1']
print(df.head(3))
'''
         x1          y1
0 -1.700736  -52.172143
1 -0.677945   39.341308
2  0.318665  128.512356
'''

model2 = smf.ols(formula='y1 ~ x1', data=df).fit()
print(model2.summary())

# 예측값 확인 함수
print(x1[:2]) # [-1.70073563 -0.67794537]
new_df = pd.DataFrame({'x1':[-1.70073563, -0.67794537]}) # 기존 자료로 검증
new_pred = model2.predict(new_df)
print('new_pred :\n', new_pred)
'''
new_pred :
 0   -52.172143
1    39.341308
'''

new2_df = pd.DataFrame({'x1':[123, -2.34567]}) # 새로운 값에 대한 예측 결과 확인
new2_pred = model2.predict(new2_df)
print('new2_pred :\n', new2_pred)
'''
new2_pred :
 0    11105.339809
1     -109.877199
'''

방법 4 : linregress 사용. model O

* liner_reg3.py

from scipy import stats
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

score_iq = pd.read_csv('../testdata/score_iq.csv')
print(score_iq.head(3))
'''
     sid  score   iq  academy  game  tv
0  10001     90  140        2     1   0
1  10002     75  125        1     3   3
2  10003     77  120        1     0   4
'''
print(score_iq.info())

# iq가 score에 영향을 주는 지 검정
# iq로 score(시험점수) 값 예측 - 정량적 분석

x = score_iq.iq
y = score_iq.score

# 상관계수
print(np.corrcoef(x, y)) # numpy  0.88222034
print(score_iq.corr())   # pandas 0.882220

# 두 변수는 인과 관계가 있다고 보고, 선형회귀 분석 진행.
model = stats.linregress(x, y)
print(model)
print('p-value :', model.pvalue)  # p-value : 2.8476895206683644e-50
print('기울기 :', model.slope)      # 기울기 : 0.6514309527270075
print('y절편 :', model.intercept)  # y절편 : -2.8564471221974657
# pvalue=2.8476895206683644e-50 < 0.05 이므로 현재 모델은 유의하다.
# iq가 score에 영향을 준다.
# y = 0.6514309527270075 * x -2.8564471221974657 
print('예측결과 :', 0.6514309527270075 * 140 -2.8564471221974657)
# 예측결과 : 88.34388625958358
print('예측결과 :', 0.6514309527270075 * 125 -2.8564471221974657)
# 예측결과 : 78.57242196867847
print('예측결과 :', 0.6514309527270075 * 80 -2.8564471221974657)
# 예측결과 : 49.25802909596313
print('예측결과 :', 0.6514309527270075 * 155 -2.8564471221974657)
# 예측결과 : 98.11535055048869
print('예측결과 :', model.slope * 155 + model.intercept)
# 예측결과 : 98.11535055048869

# linregress는 predict()가 지원되지않음. numpy의 polyval 이용.
#print('예측결과 :', np.polyval([model.slope, model.intercept], np.array(score_iq['iq'])))
new_df = pd.DataFrame({'iq':[55, 66, 77, 88, 155]})
print('예측결과 :\n', np.polyval([model.slope, model.intercept], new_df))
'''
예측결과 :
 [[32.97225528]
 [40.13799576]
 [47.30373624]
 [54.46947672]
 [98.11535055]]
'''

np.polyval([기울기, y절편], data) : numpy predict함수

'BACK END > Deep Learning' 카테고리의 다른 글

[딥러닝] 다항회귀 (0)	2021.03.12
[딥러닝] 단순선형 회귀, 다중선형 회귀 (0)	2021.03.11
[딥러닝] 공분산, 상관계수 (0)	2021.03.10
[딥러닝] 이항검정 (0)	2021.03.10
[딥러닝] ANOVA (0)	2021.03.08

[딥러닝] 공분산, 상관계수

2021. 3. 10. 11:33

공분산

: 공분산 : 두 개 이상의 확률변수에 대한 관계를 알려주는 값.
: 값의 범위가 정해져 있지않아 어떤 값을 기준으로 정하기가 모호하다.

* relation1.py

import numpy as np
print(np.cov(np.arange(1, 6), np.arange(2, 7)))          # 공분산 2.5
'''
[[2.5 2.5]
 [2.5 2.5]]
'''
print(np.cov(np.arange(10, 60), np.arange(20, 70)))      # 212.5
print(np.cov(np.arange(1, 6), np.arange(6, 1, -1)))      # -2.5
print(np.cov(np.arange(1, 6), (3, 3, 3, 3, 3)))          # 0

np.cor(x, y) : 공분산

피어슨 상관계수 공식

: -1에서 1사이 값을 가진다.

: 절대값이 1에 가까울 수록 두 데이터가 관련이 높다.

: 양의 값일 경우 독립 변수가 증가할 수록 종속변수도 증가하는 데이터.

: 음의 값일 경우 독립 변수가 증가할 수록 종속변수도 감소하는 데이터.

: 선형 데이터만 사용 가능

r (범위)		관계
-1.0	-0.7	강한 음적 선형관계
-0.7	-0.3	뚜렷한 음적 선형관계
-0.3	-0.1	약한 음적 선형관계
-0.1	0.1	거의 무시될 수 있는 선형관계
0.1	0.3	약한 양적 선형관계
0.3	0.7	뚜렷한 양적 선형관계
0.7	1.0	강한 양적 선형관계

print(np.corrcoef(np.arange(1, 6), np.arange(2, 7)))     # 1
'''
[[1. 1.]
 [1. 1.]]
 '''
print(np.corrcoef(np.arange(10, 60), np.arange(20, 70))) # 1
print()

np.corrcoef(x, y) : 상관계수

x = [8,3,6,6,9,4,3,9,3,4]
x = [800,300,600,600,900,400,300,900,300,400]
print('x 평균 :', np.mean(x)) # x 평균 : 5.5
print('x 분산 :', np.var(x))  # x 분산 : 5.45

y = [6,2,4,6,9,5,1,8,4,5]
y = [600,200,400,600,900,500,100,800,400,500]
print('y 평균 :', np.mean(y)) # y 평균 : 5.0
print('y 분산 :', np.var(y))  # y 분산 : 5.4
print()

# 두 변수 간의 관계확인
print('x, y 공분산 :', np.cov(x,y)[0, 1]) #  두 변수 간에 데이터 크기에 따라 동적
# x, y 공분산 : 5.222222222222222
print('x, y 상관계수 :', np.corrcoef(x, y)[0, 1]) # 두 변수 간에 데이터 크기에 따라 정적 
# x, y 상관계수 : 0.8663686463212853

import matplotlib.pyplot as plt
plt.plot(x, y, 'o')
plt.show()

m = [-3, -2, -1, 0, 1, 2 , 3]
n = [9, 4, 1, 0, 1, 4, 9]

plt.plot(m, n, '+')
plt.show()
print('m, n 상관계수 :', np.corrcoef(m, n)[0, 1])
# m, n 상관계수 : 0.0
# 선형인 데이터만 사용 가능.

상관분석

: 두 변수 간에 상관관계의 강도를 분석
: 이론적 타당성(독립성) 확인. 독립변수 대상 변수들은 서로 간에 독립적이어야 함.
: 독립변수 대상 변수들은 다중 공선성이 발생할 수 있는데, 이를 확인
: 밀도를 수치로 표현. 관계의 친밀함을 수치로 표현.

- 어떤 상품에 대한 친밀도, 적절성, 만족도에 대한 상관관계 확인.

* relation2.py

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.rc('font', family='malgun gothic')


df = pd.read_csv('../testdata/drinking_water.csv')
print(df.head(3), '\n', df.describe())
'''
   친밀도  적절성  만족도
0    3    4    3
1    3    3    2
2    4    4    4
'''
print()
print(np.std(df.친밀도))  # 0.968505126935272
print(np.std(df.적절성))  # 0.8580277077642035
print(np.std(df.만족도))  # 0.8271724742228969

print('* 공분산')
print(np.cov(df.친밀도, df.적절성))
'''[[0.94156873 0.41642182]
 [0.41642182 0.73901083]]
'''
print(np.cov(df.친밀도, df.만족도))
print(np.cov(df.적절성, df.만족도))
print()
print(df.cov()) # pandas
'''
          친밀도       적절성       만족도
친밀도  0.941569  0.416422  0.375663
적절성  0.416422  0.739011  0.546333
만족도  0.375663  0.546333  0.686816
'''

print('* 상관계수')
print(np.corrcoef(df.친밀도, df.적절성))
'''
[[1.         0.49920861]
 [0.49920861 1.        ]]
'''
print(np.corrcoef(df.친밀도, df.만족도))
print(np.corrcoef(df.적절성, df.만족도))

print(df.corr()) # pandas
'''
          친밀도       적절성       만족도
친밀도  1.000000  0.499209  0.467145
적절성  0.499209  1.000000  0.766853
만족도  0.467145  0.766853  1.000000
'''
co_re = df.corr() # default : pearson
print(co_re['만족도'].sort_values(ascending=False))
'''
만족도    1.000000
적절성    0.766853
친밀도    0.467145
'''
print(df.corr())
'''
          친밀도       적절성       만족도
친밀도  1.000000  0.499209  0.467145
적절성  0.499209  1.000000  0.766853
만족도  0.467145  0.766853  1.000000
'''
print(df.corr(method='pearson'))  # 변수가 등간/비율 척도. 정규성을 따를 경우 사용.
print(df.corr(method='spearman')) # 변수가 서열척도. 정규성을 따르지 않을 경우 사용.
print(df.corr(method='kendall'))

# 시각화
df.plot(kind="scatter", x='만족도', y='적절성')
plt.show()

from pandas.plotting import scatter_matrix
attr = ['친밀도', '적절성', '만족도']
scatter_matrix(df[attr], figsize=(10, 6)) # 히스토그램
plt.show()

# heatmap : 밀도를 색으로 표현
import seaborn as sns
sns.heatmap(df.corr())
plt.show()

# hitmap에 텍스트 표시 추가사항 적용해 보기
corr = df.corr()
# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)  # 상관계수값 표시
mask[np.triu_indices_from(mask)] = True
# Draw the heatmap with the mask and correct aspect ratio
vmax = np.abs(corr.values[~mask]).max()
fig, ax = plt.subplots()     # Set up the matplotlib figure

sns.heatmap(corr, mask=mask, vmin=-vmax, vmax=vmax, square=True, linecolor="lightgray", linewidths=1, ax=ax)

for i in range(len(corr)):
    ax.text(i + 0.5, len(corr) - (i + 0.5), corr.columns[i], ha="center", va="center", rotation=45)
    for j in range(i + 1, len(corr)):
        s = "{:.3f}".format(corr.values[i, j])
        ax.text(j + 0.5, len(corr) - (i + 0.5), s, ha="center", va="center")
ax.axis("off")
plt.show()

공공 데이터(외국인 관광객의 국내 관광지 입장자료로 상관관계 분석)

import json
import matplotlib.pyplot as plt
plt.rc('font', family='malgun gothic')
import pandas as pd


def Start():
    # 서울 관광지 정보
    fname = '서울특별시_관광지.json'
    jsonTP = json.loads(open(fname, 'r', encoding='utf-8').read()) # str => json
    #print(jsonTP)
    
    tour_table = pd.DataFrame(jsonTP, columns=('yyyymm', 'resNm', 'ForNum')) # 년월, 관광지명, 입장객수
    tour_table = tour_table.set_index('yyyymm')
    print(tour_table)
    '''
                    resNm  ForNum
    yyyymm                   
    201101        창덕궁   14137
    201101        운현궁       0
    '''
    resNm = tour_table.resNm.unique()
    print('resNum :', resNm[:5]) # 5개 샘플. resNum : ['창덕궁' '운현궁' '경복궁' '창경궁' '종묘']
    
    
    # 중국인 관광정보
    cdf = '중국인방문객.json'
    jdata = json.loads(open(cdf, 'r', encoding='utf-8').read())
    #print(jdata)
    
    china_table = pd.DataFrame(jdata, columns=('yyyymm', 'visit_cnt')) # 년월, 방문객수
    china_table = china_table.rename(columns={'visit_cnt':'china'})
    china_table = china_table.set_index('yyyymm')
    print(china_table)
    '''
             china
    yyyymm        
    201101   91252
    201102  140571
    201103  141457
    '''
    
    # 일본인 관광정보
    jdf = '일본인방문객.json'
    jdata = json.loads(open(jdf, 'r', encoding='utf-8').read())
    #print(jdata)
    
    japan_table = pd.DataFrame(jdata, columns=('yyyymm', 'visit_cnt')) # 년월, 방문객수
    japan_table = japan_table.rename(columns={'visit_cnt':'japan'})
    japan_table = japan_table.set_index('yyyymm')
    print(japan_table)
    '''
             japan
    yyyymm        
    201101  209184
    201102  230362
    '''
    
    # 미국인 관광정보
    udf = '미국인방문객.json'
    jdata = json.loads(open(udf, 'r', encoding='utf-8').read())
    #print(jdata)
    
    usa_table = pd.DataFrame(jdata, columns=('yyyymm', 'visit_cnt')) # 년월, 방문객수
    usa_table = usa_table.rename(columns={'visit_cnt':'usa'})
    usa_table = usa_table.set_index('yyyymm')
    print(usa_table)
    '''
              usa
    yyyymm       
    201101  43065
    201102  41077
    '''
    
    all_table = pd.merge(china_table, japan_table, left_index=True, right_index=True)
    all_table = pd.merge(all_table, usa_table, left_index=True, right_index=True)
    print(all_table)
    '''
                 china   japan    usa
    yyyymm                       
    201101   91252  209184  43065
    201102  140571  230362  41077
    '''
    r_list = []
    for tourPoint in resNm[:5]:
        r_list.append(SetScatterGraph(tour_table, all_table, tourPoint))
        #print(r_list)
        
    r_df = pd.DataFrame(r_list, columns=['관광지명', '중국','일본','미국'])
    r_df = r_df.set_index('관광지명')
    print(r_df)
    '''
                중국        일본        미국
    관광지명                              
    창덕궁  -0.058791  0.277444  0.402816
    운현궁   0.445945  0.302615  0.281258
    경복궁   0.525673 -0.435228  0.425137
    창경궁   0.451233 -0.164586  0.624540
    종묘   -0.583422  0.529870 -0.121127
    '''
    r_df.plot(kind='bar', rot=60)
    plt.show()

def SetScatterGraph(tour_table, all_table, tourPoint):
    tour = tour_table[tour_table['resNm'] == tourPoint]
    #print(tour)
    merge_table = pd.merge(tour, all_table, left_index=True, right_index=True)
    print(merge_table) # 광광지 자료중 앞에 5개만 참여
    '''
               resNm  ForNum   china   japan    usa
    yyyymm                                     
    201101   창덕궁   14137   91252  209184  43065
    201102   창덕궁   18114  140571  230362  41077
    '''
    # 시각화 + 상관관계
    fig = plt.figure()
    fig.suptitle(tourPoint + ' 상관관계 분석')
    
    plt.subplot(1, 3, 1)
    plt.xlabel('중국인 수')
    plt.ylabel('외국인 입장수')
    lamb1 = lambda p:merge_table['china'].corr(merge_table['ForNum'])
    r1 = lamb1(merge_table)
    print('r1 :', r1)
    plt.title('r={:.3f}'.format(r1))
    plt.scatter(merge_table['china'], merge_table['ForNum'], s=6, c='black')
    
    plt.subplot(1, 3, 2)
    plt.xlabel('일본인 수')
    plt.ylabel('외국인 입장수')
    lamb2 = lambda p:merge_table['japan'].corr(merge_table['ForNum'])
    r2 = lamb2(merge_table)
    print('r2 :', r2)
    plt.title('r={:.3f}'.format(r2))
    plt.scatter(merge_table['japan'], merge_table['ForNum'], s=6, c='red')
    
    plt.subplot(1, 3, 3)
    plt.xlabel('미국인 수')
    plt.ylabel('외국인 입장수')
    lamb3 = lambda p:merge_table['usa'].corr(merge_table['ForNum'])
    r3 = lamb3(merge_table)
    print('r3 :', r3)
    plt.title('r={:.3f}'.format(r3))
    plt.scatter(merge_table['usa'], merge_table['ForNum'], s=6, c='blue')
    
    plt.show()
    return [tourPoint, r1, r2, r3]
    '''
        r1 : -0.05879110406006314
    r2 : 0.2774443570141011
    r3 : 0.4028160633050156
    r1 : 0.44594488384450376
    r2 : 0.30261521828798604
    r3 : 0.2812576500158649
    r1 : 0.5256734293511215
    r2 : -0.43522818613412334
    r3 : 0.4251372638704492
    r1 : 0.4512325398089607
    r2 : -0.16458589402253013
    r3 : 0.6245403780269381
    r1 : -0.5834218986767473
    r2 : 0.5298702802205213
    r3 : -0.1211266682929496
    '''
if __name__ == '__main__':
    Start()

'BACK END > Deep Learning' 카테고리의 다른 글

[딥러닝] 단순선형 회귀, 다중선형 회귀 (0)	2021.03.11
[딥러닝] 선형회귀 (0)	2021.03.10
[딥러닝] 이항검정 (0)	2021.03.10
[딥러닝] ANOVA (0)	2021.03.08
[딥러닝] T 검정 (0)	2021.03.05

[딥러닝] 이항검정

2021. 3. 10. 10:17

이항검정

: 결과가 두 가지 값을 가지는 확률변수의 분포를 판단하는데 효과적
: 이산변량을 대상으로 한다.

형식 : stats.binom_test() : 명목척도의 비율을 바탕으로 이항분포 검정

귀무가설 : 직원을 대상으로 고객 대응 교육 후 고객 안내 서비스 만족율이 80%다.
대립가설 : 직원을 대상으로 고객 대응 교육 후 고객 안내 서비스 만족율이 80%가 아니다.

* binom.py

import pandas as pd
import scipy.stats as stats
from pandas.core.reshape.pivot import crosstab

data = pd.read_csv("../testdata/one_sample.csv")
print(data.head(3))
print()
'''
   no    gender  survey time
0   1         2       1  5.1
1   2         2       0  5.2
2   3         2       1  4.7
'''

ctab = crosstab(index=data['survey'], columns = 'count')
ctab.index = ['불만족', '만족']
print(ctab)
'''
col_0  count
불만족       14
만족       136
'''

print('양측 검정(기존 80% 만족율 기준 검증을 실시) : 방향성이 없다.')
x = stats.binom_test([136, 14], p = 0.8, alternative="two-sided")
print(x)

stats.binom_test(x, p = , alternative="two-sided") : 양측 검정

p-value : 0.00067 < 0.05 이므로 귀무가설 기각.
대립가설 : 직원을 대상으로 고객 대응 교육 후 고객 안내 서비스 만족율이 80%가 아니다.
양측검정에서는 크다, 작다로 방향성으로 제시 하지않는다.

print('\n양측 검정(기존 80% 불만족율 기준 검증을 실시) : 방향성이 없다.')
x = stats.binom_test([14, 136], p = 0.2, alternative="two-sided")
print(x)

stats.binom_test(x, p = , alternative="two-sided") : 양측 검정(조건 반전)

p-value : 0.00067 < 0.05 이므로 귀무가설 기각.

print('\n단측 검정(기존 80% 만족율 기준 검증을 실시) : 방향성이 있다.')
x = stats.binom_test([136, 14], p = 0.8, alternative="greater")
print(x)# p-value : 0.000317 < 0.05 이므로 귀무가설 기각.

stats.binom_test(x, p = , alternative="greater") : 단측검정

p-value : 0.000317 < 0.05 이므로 귀무가설 기각.

print('\n단측 검정(기존 80% 불만족율 기준 검증을 실시) : 방향성이 있다.')
x = stats.binom_test([14, 136], p = 0.2, alternative="less")
print(x)# p-value : 0.000317 < 0.05 이므로 귀무가설 기각.

stats.binom_test(x, p = , alternative="less") : 단측검정(조건 반전)

p-value : 0.000317 < 0.05 이므로 귀무가설 기각.

비율 검정

: 집단의 비율이 어떤 특정한 값과 같은지를 검정

* one-sample
a회사에는 100명 중 45명이 흡연을 한다. 국가통계에서는 국민 흡연율은 35%라고 한다. 비율의 동일여부를 검정하라.

귀무가설 : a회사의 흡연율과 국민 흡연율의 비율은 같다.
대립가설 : a회사의 흡연율과 국민 흡연율의 비율은 다르다.

import numpy as np
from statsmodels.stats.proportion import proportions_ztest

count = np.array([45])
nobs = np.array([100])
val = 0.35

z, p = proportions_ztest(count=count, nobs=nobs, value=val)
print('z : {}, p : {}'.format(z, p)) # z : [2.01007563], p : [0.04442318]

proportions_ztest(count=count, nobs=nobs, value=val) : 비율 검정.

p-value : 0.04442318 < 0.05 이므로 귀무가설 기각.
대립가설 : a회사의 흡연율과 국민 흡연율의 비율은 다르다.

* two-sample
a회사 직원 300명중 100명이 햄버거를 취식, b회사 직원 400명중 170명이 햄버거를 취식시 두 집단의 햄버거 취식비율의 차이 검정.

귀무 가설 : 차이가 없다.
대립 가설 : 차이가 있다.

count = np.array([100, 170])
nobs = np.array([300, 400])

z, p = proportions_ztest(count=count, nobs=nobs, value=0)
print('z : {}, p : {}'.format(z, p)) # z : -2.4656701201792273, p : 0.013675721698622408

proportions_ztest(count=count, nobs=nobs, value=0) : 비율 검정.

p-value : 0.013675 < 0.05 이므로 귀무가설 기각.
대립 가설 : 차이가 있다.

'BACK END > Deep Learning' 카테고리의 다른 글

[딥러닝] 선형회귀 (0)	2021.03.10
[딥러닝] 공분산, 상관계수 (0)	2021.03.10
[딥러닝] ANOVA (0)	2021.03.08
[딥러닝] T 검정 (0)	2021.03.05
[딥러닝] 카이제곱 (0)	2021.03.04

[딥러닝] ANOVA

2021. 3. 8. 16:17

ANOVA(analysis of variance)

: 독립변수 범주형(3개 이상), 종속변수 연속형

독립변수 x	종속변수 y	분석 방법
범주형	범주형	카이제곱 검정(교차분석)	일원 카이제곱 : 변인 1 개 import scipy.stats as stats stats.chisquare()
범주형	범주형	카이제곱 검정(교차분석)	일원 카이제곱 : 변인 2 개 이상 stats.chi2_contingency()
범주형	연속형	T 검정 : 범주형 값 2개 이하	단일 표본 검정 (one sample t-test) : 집단 1개 stats.ttest_1samp(데이터, popmean=모집단 평균)
			독립 표본 검정(independent samples t test) : 두 집단 정규분포/ 분산 동일 stats.ttest_ind(데이터,..., , equal_var=False)
			대응 표본 검정(paired samples t test) : 동일한 관찰 대상의 처리 전과 처리 후 비교 stats.ttest_rel(데이터, .. )
		ANOVA :범주형 값 3개 이상	일원 분산분석(one-way anova) : 1개의 요인에 집단이 3개 import statsmodels.api as sm from statsmodels.formula.api import ols model = ols('종속변수 ~ 독립변수', data).fit() sm.stats.anova_lm(model, type=2) model = ols('종속변수 ~ 독립변수1 + 독립변수2', data).fit() stats.f_oneway(gr1, gr2, gr3)
연속형	범주형	로지스틱 회귀 분석
연속형	연속형	회귀분석, 구조 방정식

일원 분산분석(one-way anova)

: 1개의 요인에 집단이 3개

실습 1 : 세 가지 교육방법을 적용하여 1개월 동안 교육받은 교육생 80 명을 대상으로 실기시험을 실시 .

귀무가설 : 교육생을 대상으로 3가지 교육방법에 따른 실기시험 평균의 차이가 없다.
대립가설 : 교육생을 대상으로 3가지 교육방법에 따른 실기시험 평균의 차이가 있다.

* anova1.py

import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt

data = pd.read_csv('../testdata/three_sample.csv')
print(data.head(3), data.shape)
'''
   no  method  survey  score
0   1       1       1     72
1   2       3       1     87
2   3       2       1     78 (80, 4)
'''
print(data.describe())

- 이상치 제거

data = data.query('score <= 100')
plt.boxplot(data.score)
#plt.show()

# 독립성 : 상관관계를 확인가능

# 등분산성
result = data[['method', 'score']]
m1 = result[result['method'] == 1]
m2 = result[result['method'] == 2]
m3 = result[result['method'] == 3]
#print(m1)
score1 = m1['score']
score2 = m2['score']
score3 = m3['score']
print('등분산성 :', stats.levene(score1, score2, score3).pvalue)    # 등분산성 : 0.11322 > 0.05 이므로 만족
print('등분산성 :', stats.fligner(score1, score2, score3).pvalue)   # 등분산성 : 0.10847
print('등분산성 :', stats.bartlett(score1, score2, score3).pvalue)  # 등분산성 : 0.15251

- 정규성

print(stats.shapiro(score1))
print('정규성 확인 :', stats.ks_2samp(score1, score2).pvalue) # pvalue=0.3096 > 0.05 이므로 만족
print('정규성 확인 :', stats.ks_2samp(score1, score3).pvalue) # pvalue=0.7162 > 0.05 이므로 만족
print('정규성 확인 :', stats.ks_2samp(score2, score3).pvalue) # pvalue=0.7724 > 0.05 이므로 만족

stats.ks_2samp(data1, data2).pvalue : 정규성 확인

print('교육방법별 건수')
data2 = pd.crosstab(index = data['method'], columns = 'count')
print(data2)
'''
교육방법별 건수
col_0   count
method       
1          26
2          28
3          24
'''

print('교육방법별 만족 여부 건수')
data3 = pd.crosstab(data['method'], data['survey'])
data3.index = ['방법1', '방법2', '방법3']
data3.columns = ['만족', '불만족']
print(data3)
'''
교육방법별 만족 여부 건수
     만족  불만족
방법1   9   17
방법2  10   18
방법3   8   16
'''

- ANOVA

import statsmodels.api as sm
from statsmodels.formula.api import ols
model = ols('score ~ method', data).fit()
table = sm.stats.anova_lm(model, type=2)
print(table)
'''            df        sum_sq     mean_sq         F    PR(>F)
method     1.0     27.980888   27.980888  0.122228  0.727597
Residual  76.0  17398.134497  228.922822       NaN       NaN
'''
print(model.summary())

import statsmodels.api as sm

from statsmodels.formula.api import ols

model = ols('종속변수 ~ 독립변수', data).fit() : model
sm.stats.anova_lm(model, type=2) : anova

p-value : 0.727597 > 0.05 이므로 귀무가설 채택.
# 귀무가설 : 교육생을 대상으로 3가지 교육방법에 따른 실기시험 평균의 차이가 없다.

# ANOVA 다중회귀 : 독립변수 2
model2 = ols('score ~ method + survey', data).fit()
table2 = sm.stats.anova_lm(model2, type=2)
print(table2)
'''
            df        sum_sq     mean_sq         F    PR(>F)
method     1.0     27.980888   27.980888  0.120810  0.729131
survey     1.0     27.324458   27.324458  0.117976  0.732201
Residual  75.0  17370.810039  231.610801       NaN       NaN
'''
# mean_sq = sum_sq / df

import numpy as np
print()
print(np.mean(score1)) # 67.38461538461539
print(np.mean(score2)) # 68.35714285714286
print(np.mean(score3)) # 68.875
print()

model = ols('종속변수 ~ 독립변수1 + 독립변수2', data).fit() : 다중회귀

사후 검정(Post Hoc Test)

: 가능한 모든 쌍을 비교하며 예상되는 표준 오류보다 큰 두 가지 방법의 차이를 정확하게 식별하는 데 사용 할 수 있다.
: 그룹 간에 평균값 차이가 의미가 있는 지 확인.

from statsmodels.stats.multicomp import pairwise_tukeyhsd
tResult = pairwise_tukeyhsd(data, data.method)
print(tResult)
'''
 Multiple Comparison of Means - Tukey HSD, FWER=0.05 
=====================================================
group1 group2 meandiff p-adj   lower    upper  reject
-----------------------------------------------------
     1      2   3.8352 0.7997 -11.3827  19.053  False
     1      3  -0.0577    0.9 -15.8743 15.7589  False
     2      3  -3.8929 0.8018  -19.436 11.6503  False
-----------------------------------------------------
'''
tResult.plot_simultaneous()
plt.show()

from statsmodels.stats.multicomp import pairwise_tukeyhsd

result = pairwise_tukeyhsd(data, data.독립변수) : 사후검정.

result.plot_simultaneous() : 시각화.

일원분산으로 집단 간의 평균 차이 검증

강남구 소재 GS 편의점 3개 지역 , 알바생의 급여에 대한 평균의 차이를 검정

귀무가설 : 3개 지역 급여에 대한 평균에 차이가 없다.
대립가설 : 3개 지역 급여에 대한 평균에 차이가 있다.

* anova2.py

import scipy.stats as stats
import pandas as pd
import  numpy as np
import urllib.request
import matplotlib.pyplot as plt
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

url = "https://raw.githubusercontent.com/pykwon/python/master/testdata_utf8/group3.txt"
data = np.genfromtxt(urllib.request.urlopen(url), delimiter=',')
print(data)
'''
[[243.   1.]
 [251.   1.]
 [275.   1.]
 [291.   1.]
 ...
 
'''

np.genfromtxt(urllib.request.urlopen(url), delimiter=',') : url 읽기

# 세 개 집단
gr1 = data[data[:, 1] == 1, 0]
gr2 = data[data[:, 1] == 2, 0]
gr3 = data[data[:, 1] == 3, 0]
print(gr1, np.average(gr1))
# [243. 251. 275. 291. 347. 354. 380. 392.] 316.625
print(gr2, np.average(gr2))
# [206. 210. 226. 249. 255. 273. 285. 295. 309.] 256.44444444444446
print(gr3, np.average(gr3))
# [241. 258. 270. 293. 328.] 278.0

# 정규성
print(stats.shapiro(gr1)) # pvalue=0.3336 > 0.05 정규성 만족
print(stats.shapiro(gr2)) # pvalue=0.6561 > 0.05 정규성 만족
print(stats.shapiro(gr3)) # pvalue=0.8324 > 0.05 정규성 만족

# 등분산성
print(stats.bartlett(gr1, gr2, gr3)) # pvalue=0.3508 > 0.05 등분산성 만족

# 시각화
plot_data = [gr1, gr2, gr3]
plt.boxplot(plot_data)
#plt.show()

# 방법 1
df = pd.DataFrame(data, columns=['value', 'group'])
print(df)
'''
    value  group
0   243.0    1.0
1   251.0    1.0
2   275.0    1.0
3   291.0    1.0
4   347.0    1.0
5   354.0    1.0
'''
model = ols('value ~ C(group)', df).fit() # C(변수명 + ..) : 범주형임을 명시적으로 표시
print(anova_lm(model))
print()

# 방법 2
f_statistic, p_val = stats.f_oneway(gr1, gr2, gr3)
print('f_statistic : {}, p_val : {}'.format(f_statistic, p_val))
# f_statistic : 3.7113359882669763, p_val : 0.043589334959178244

model = ols('종속변수 ~ C(독립변수, ... )', df).fit() : C(변수명 + ..) : 범주형임을 명시적으로 표시

f_statistic, p_val = stats.f_oneway(gr1, gr2, gr3) : 일원 분산 검정

일원 분산분석

어느 음식점의 매출자료와 날씨 자료를 이용하여 온도에 따른 매출의 평균의 차이에 대한 검정.

온도를 3 그룹으로 분리.

귀무가설 : 온도에 따른 매출액 평균에 차이가 없다.
대립가설 : 온도에 따른 매출액 평균에 차이가 있다.

* anova3.py

import numpy as np
import scipy.stats as stats
import pandas as pd
import matplotlib.pyplot as plt

# 매출자료
sales_data = pd.read_csv('https://raw.githubusercontent.com/pykwon/python/master/testdata_utf8/tsales.csv', dtype={'YMD':'object'})
print(sales_data.head(3)) # 328행
'''
        YMD    AMT  CNT
0  20190514      0    1
1  20190519  18000    1
2  20190521  50000    4
'''
print(sales_data.info())

# 날씨 자료
wt_data = pd.read_csv('https://raw.githubusercontent.com/pykwon/python/master/testdata_utf8/tweather.csv')
print(wt_data.head(3)) # 702행
'''
   stnId          tm  avgTa  minTa  maxTa  sumRn  maxWs  avgWs  ddMes
0    108  2018-06-01   23.8   17.5   30.2    0.0    4.3    1.9    0.0
1    108  2018-06-02   23.4   17.6   30.1    0.0    4.5    2.0    0.0
2    108  2018-06-03   24.0   16.9   30.8    0.0    4.2    1.6    0.0
'''
print(wt_data.info())

# 날짜를 기준으로 join
wt_data.tm = wt_data.tm.map(lambda x : x.replace('-','')) # wt_data.tm에서 '-' 제거 
#print(wt_data.head(3))
frame = sales_data.merge(wt_data, how='left', left_on='YMD', right_on='tm') # join
print(frame.head(3), frame.shape) # (328, 12)
'''
        YMD    AMT  CNT  stnId        tm  ...  maxTa  sumRn  maxWs  avgWs  ddMes
0  20190514      0    1    108  20190514  ...   26.9    0.0    4.1    1.6    0.0
1  20190519  18000    1    108  20190519  ...   21.6   22.0    2.7    1.2    0.0
2  20190521  50000    4    108  20190521  ...   23.8    0.0    5.9    2.9    0.0
'''
print(frame.columns)

# 분석에 참여할 칼럼만 추출
data = frame.iloc[:, [0,1,7,8]]
print(data.head(3))
'''
        YMD    AMT  maxTa  sumRn
0  20190514      0   26.9    0.0
1  20190519  18000   21.6   22.0
2  20190521  50000   23.8    0.0
'''

- 일별 최고온도를 구간설정을 통해 연속형 변수를 명목형(범주형) 변수로 변경

print(data.maxTa.describe())
plt.boxplot(data.maxTa)
plt.show()

# 온도 추움, 보통, 더움(0, 1, 2)
data['Ta_gubun'] = pd.cut(data.maxTa, bins = [-5, 8, 24, 37], labels = [0, 1, 2])
print(data.head(5))
#print(data.isnull().sum())
#data = data[data.Ta_gubun.notna()] # na가 있다면 제거

print(data['Ta_gubun'].unique())

# 상관관계
print(data.corr())
'''
            AMT     maxTa     sumRn
AMT    1.000000 -0.660066 -0.080907
maxTa -0.660066  1.000000  0.119268
sumRn -0.080907  0.119268  1.000000
'''

# 3그룹으로 데이터를 나눈 후 등분산성, 정규성 검정.
x1 = np.array(data[data.Ta_gubun == 0].AMT)
x2 = np.array(data[data.Ta_gubun == 1].AMT)
x3 = np.array(data[data.Ta_gubun == 2].AMT)
print(x1)
print(x2)
print(x3)

print(stats.levene(x1, x2, x3)) # pvalue=0.0390 < 0.05 등분산성 만족 X
print(stats.ks_2samp(x1, x2).pvalue) # 9.28938415079017e-09   < 0.05 정규성 만족 X
print(stats.ks_2samp(x1, x3).pvalue) # 1.198570472122961e-28  < 0.05 정규성 만족 X
print(stats.ks_2samp(x2, x3).pvalue) # 1.4133139103478243e-13 < 0.05 정규성 만족 X

# 온도별 매출액 평균
spp = data.loc[:, ['AMT', 'Ta_gubun']]
print(spp.groupby('Ta_gubun').mean())
print(pd.pivot_table(spp, index = ['Ta_gubun'], aggfunc = 'mean'))
'''
                   AMT
Ta_gubun              
0         1.032362e+06
1         8.181069e+05
2         5.537109e+05
'''

# ANOVA 진행
sp = np.array(spp)
group1 = sp[sp[:, 1] == 0, 0]
group2 = sp[sp[:, 1] == 1, 0]
group3 = sp[sp[:, 1] == 2, 0]
print(group1)
print(group2)
print(group3)

print(stats.f_oneway(group1, group2, group3))

pvalue=2.360737101089604e-34 < 0.05 이므로 귀무가설 기각.
# 대립가설 : 온도에 따른 매출액 평균에 차이가 있다.

등분산성 만족 하지않을 경우 Welch's ANOVA를 사용

anaconda prompt 접속

pip install pingouin

from pingouin import welch_anova
df = data
print(welch_anova(data = df, dv = 'AMT', between='Ta_gubun')) # p-unc = 7.907874e-35 < 0.05
'''
     Source  ddof1     ddof2           F         p-unc       np2
0  Ta_gubun      2  189.6514  122.221242  7.907874e-35  0.379038
'''

정규성을 만족하지 못한 경우 kruskal-wallis test 사용

print(stats.kruskal(group1, group2, group3))
#KruskalResult(statistic=132.7022591443371, pvalue=1.5278142583114522e-29)

pvalue < 0.05 이므로 귀무가설 기각.
#결론 : 온도에 따른 매출액의 차이가 있다.

# 사후 검정
from statsmodels.stats.multicomp import pairwise_tukeyhsd
posthoc = pairwise_tukeyhsd(spp['AMT'], spp['Ta_gubun'])
print(posthoc)
'''
       Multiple Comparison of Means - Tukey HSD, FWER=0.05       
=================================================================
group1 group2   meandiff   p-adj    lower        upper     reject
-----------------------------------------------------------------
     0      1 -214255.4486 0.001 -296759.7083  -131751.189   True
     0      2 -478651.3813 0.001 -561488.5315 -395814.2311   True
     1      2 -264395.9327 0.001 -333329.5099 -195462.3555   True
-----------------------------------------------------------------
'''
posthoc.plot_simultaneous()
plt.show()

이원 분산분석

: 요인 2개

귀무가설 : 태아와 관측자수는 태아의 머리둘레의 평균과 관련이 없다.
대립가설 : 태아와 관측자수는 태아의 머리둘레의 평균과 관련이 있다.

* anova.py

import scipy.stats as stats
import pandas as pd
import  numpy as np
import urllib.request
import matplotlib.pyplot as plt
plt.rc('font', family="malgun gothic")
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

url = "https://raw.githubusercontent.com/pykwon/python/master/testdata_utf8/group3_2.txt"
data = pd.read_csv(urllib.request.urlopen(url), delimiter=',')
print(data)
'''
    머리둘레  태아수  관측자수
0   14.3    1     1
1   14.0    1     1
2   14.8    1     1
3   13.6    1     2
4   13.6    1     2
 
'''

data.boxplot(column='머리둘레', by='태아수', grid=False)
plt.show()

reg = ols('data["머리둘레"] ~ C(data["태아수"]) + C(data["관측자수"])', data = data).fit()
result = anova_lm(reg, type=2)
print(result)
'''
                   df      sum_sq     mean_sq            F        PR(>F)
C(data["태아수"])    2.0  324.008889  162.004444  2023.182239  1.006291e-32
C(data["관측자수"])   3.0    1.198611    0.399537     4.989593  6.316641e-03
Residual         30.0    2.402222    0.080074          NaN           NaN
'''

# 두개의 요소 상호작용이 있는 형태로 처리
formula = '머리둘레 ~ C(태아수) + C(관측자수) + C(태아수):C(관측자수)'
reg2 = ols(formula, data).fit()
print(reg2)
result2 = anova_lm(reg2, type=2)
print(result2)
'''
                  df      sum_sq     mean_sq            F        PR(>F)
C(태아수)           2.0  324.008889  162.004444  2113.101449  1.051039e-27
C(관측자수)          3.0    1.198611    0.399537     5.211353  6.497055e-03
C(태아수):C(관측자수)   6.0    0.562222    0.093704     1.222222  3.295509e-01
Residual        24.0    1.840000    0.076667          NaN           NaN
'''

p-value PR(>F) C(태아수):C(관측자수) : 3.295509e-01 > 0.05 이므로 귀무가설 채택.
# 귀무가설 : 태아와 관측자수는 태아의 머리둘레의 평균과 관련이 없다.

jikwon 테이블 정보로 chi, t검정, anova

* anova5_t.py

import MySQLdb
import ast
import pandas as pd
import numpy as np
import scipy.stats as stats
import statsmodels.stats.api as sm
import matplotlib.pyplot as plt

try:
    with open('mariadb.txt', 'r') as f:
        config = f.read()
except Exception as e:
    print('err :', e)

config = ast.literal_eval(config)

conn = MySQLdb.connect(**config)
cursor = conn.cursor()

print("=================================================================")
print(' * 교차분석 (이원 카이제곱 검정 : 각 부서(범주형)와 직원평가 점수(범주형) 간의 관련성 분석) *')
# 독립변수 : 범주형, 종속변수 : 범주형
# 귀무가설 : 각 부서와 직원평가 점수 간 관련이 없다.(독립)
# 대립가설 : 각 부서와 직원평가 점수 간 관련이 있다.

df = pd.read_sql("select * from jikwon", conn)
print(df.head(3))
'''
   jikwon_no jikwon_name  buser_num  ... jikwon_ibsail  jikwon_gen jikwon_rating
0          1         홍길동         10  ...    2008-09-01           남             a
1          2         한송이         20  ...    2010-01-03           여             b
2          3         이순신         20  ...    2010-03-03           남             b
'''

buser = df['buser_num']
rating = df['jikwon_rating']

ctab = pd.crosstab(buser, rating) # 교차표 작성
print(ctab)
'''
jikwon_rating  a  b  c
buser_num             
10             5  1  1
20             3  6  3
30             5  2  0
40             2  2  0
'''

chi, p, df, exp = stats.chi2_contingency(ctab)
print('chi : {}, p : {}, df : {}'.format(chi, p, df))
# chi : 7.339285714285714, p : 0.2906064076671985, df : 6
# p-value : 0.2906 > 0.05 이므로 귀무가설 채택.
# 귀무가설 : 각 부서와 직원평가 점수 간 관련이 없다.(독립)

print("=================================================================")
print(' * 교차분석 (이원 카이제곱 검정 : 각 부서(범주형)와 직급(범주형) 간의 관련성 분석) *')
# 귀무가설 : 각 부서와 직급 간 관련이 없다.(독립)
# 대립가설 : 각 부서와 직급 간 관련이 있다.

df2 = pd.read_sql("select buser_num, jikwon_jik from jikwon", conn)
print(df2.head(3))
buser = df2.buser_num
jik = df2.jikwon_jik

ctab2 = pd.crosstab(buser, jik) # 교차표 작성
print(ctab2)

chi, p, df, exp = stats.chi2_contingency(ctab2)
print('chi : {}, p : {}, df : {}'.format(chi, p, df))
# chi : 9.620617477760335, p : 0.6492046290079438, df : 12
# p-value : 0.6492 > 0.05 이므로 귀무가설 채택.
# 귀무가설 : 각 부서와 직급 간 관련이 없다.(독립)
print()

print("=================================================================")
print(' * 차이분석 (t-test : 10, 20번 부서(범주형)와 평균 연봉(연속형) 간의 차이 분석) *')
# 독립변수 : 범주형, 종속변수 : 연속형 

# 귀무가설 : 두 부서 간 연봉 평균의 차이가 없다.
# 대립가설 : 두 부서 간 연봉 평균의 차이가 있다.

#df_10 = pd.read_sql("select buser_num, jikwon_pay from jikwon where buser_num in (10, 20)", conn)
df_10 = pd.read_sql("select buser_num, jikwon_pay from jikwon where buser_num = 10", conn)
df_20 = pd.read_sql("select buser_num, jikwon_pay from jikwon where buser_num = 20", conn)
buser_10 = df_10['jikwon_pay']
buser_20 = df_20['jikwon_pay']

print('평균 :',np.mean(buser_10), ' ', np.mean(buser_20))
# 평균 : 5414.285714285715   4908.333333333333

t_result = stats.ttest_ind(buser_10, buser_20)
print(t_result)
# pvalue=0.6523 > 0.05 이므로 귀무가설 채택
# 귀무가설 : 두 부서 간 연봉 평균의 차이가 없다.
print()

print("=================================================================")
print(' * 분산분석 (ANOVA : 각 부서(부서라는 1개의 요인에 4그룹으로 분리. 범주형)와 평균 연봉(연속형) 간의 차이 분석) *')
# 독립변수 : 범주형, 종속변수 : 연속형 

# 귀무가설 : 4개의 부서 간 연봉 평균의 차이가 없다.
# 대립가설 : 4개의 부서 간 연봉 평균의 차이가 있다.

df3 = pd.read_sql("select buser_num, jikwon_pay from jikwon", conn)
buser = df3['buser_num']
pay = df3['jikwon_pay']

gr1 = df3[df3['buser_num'] == 10 ]['jikwon_pay']
gr2 = df3[df3['buser_num'] == 20 ]['jikwon_pay']
gr3 = df3[df3['buser_num'] == 30 ]['jikwon_pay']
gr4 = df3[df3['buser_num'] == 40 ]['jikwon_pay']
print(gr1)

# 시각화
plt.boxplot([gr1, gr2, gr3, gr4])
#plt.show()

# 방법 1
f_sta, pv = stats.f_oneway(gr1, gr2, gr3, gr4)
print('f : {}, p : {}'.format(f_sta, pv))
# f : 0.41244077160708414, p : 0.7454421884076983
# p > 0.05 이므로 귀무가설 채택
# 귀무가설 : 4개의 부서 간 연봉 평균의 차이가 없다.


# 방법 2
lmodel = ols('jikwon_pay ~ C(buser_num)', data = df3).fit()
result = anova_lm(lmodel, type=2)
print(result)
'''
                df        sum_sq       mean_sq         F    PR(>F)
C(buser_num)   3.0  5.642851e+06  1.880950e+06  0.412441  0.745442
Residual      26.0  1.185739e+08  4.560535e+06       NaN       NaN
'''
# P : 0.745442 > 0.05  이므로 귀무가설 채택
# 귀무가설 : 4개의 부서 간 연봉 평균의 차이가 없다.

# 사후검정
from statsmodels.stats.multicomp import pairwise_tukeyhsd

tukey = pairwise_tukeyhsd(df3.jikwon_pay, df3.buser_num)
print(tukey)
'''
   Multiple Comparison of Means - Tukey HSD, FWER=0.05    
==========================================================
group1 group2  meandiff p-adj    lower      upper   reject
----------------------------------------------------------
    10     20 -505.9524    0.9 -3292.2958  2280.391  False
    10     30  -85.7143    0.9 -3217.2939 3045.8654  False
    10     40  848.2143    0.9 -2823.8884 4520.3169  False
    20     30  420.2381    0.9 -2366.1053 3206.5815  False
    20     40 1354.1667 0.6754  -2028.326 4736.6593  False
    30     40  933.9286 0.8955 -2738.1741 4606.0312  False
'''
tukey.plot_simultaneous()
plt.show()

'BACK END > Deep Learning' 카테고리의 다른 글

[딥러닝] 선형회귀 (0)	2021.03.10
[딥러닝] 공분산, 상관계수 (0)	2021.03.10
[딥러닝] 이항검정 (0)	2021.03.10
[딥러닝] T 검정 (0)	2021.03.05
[딥러닝] 카이제곱 (0)	2021.03.04

[딥러닝] T 검정

2021. 3. 5. 15:24

T 검정

: 집단 간 평균(비율) 차이 검정
: 평균값의 차이와 표준편차의 비율이 얼마나 큰지 혹은 작은지 통계적으로 검정하는 방법

독립변수 : 범주형 => t-test(2개이하) / anova(3개이상)
종속변수 : 연속형

T 검정 종류

One sample t test
Indepenent t test
Paired t test

단일 표본 검정 (one sample t-test)

: 하나의 집단에 대한 표본평균이 예측된 평균과 같은지 여부를 검정하나의 집단에 대한 표본평균과 새롭게 수집된 데이터의 예측된 평균이 같은지 여부를 검정

실습 1 : 어느 남성 집단의 평균 키 검정

귀무가설 : 남성 집단의 평균 키가 177이다. 샘플의 평균과 모집단의 평균은 같다.
대립가설 : 남성 집단의 평균 키가 177아니다. 샘플의 평균과 모집단의 평균은 다르다.

* t1.py

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats

one_sample =[177.0, 182.7, 169.6, 176.8, 180.0]
print(np.array(one_sample).mean())  # 평균 : 177.21999999999997

one_sample2 =[167.0, 162.7, 169.6, 176.8, 170.0]
print(np.array(one_sample2).mean()) # 평균 : 169.21999999999997

result1 = stats.ttest_1samp(one_sample, popmean=177) # popmean : 모집단의 평균
print('result1 :', result1)
# result1 : Ttest_1sampResult(statistic=0.10039070766877535, pvalue=0.9248646407498543)

p-value=0.92486 > 0.05 이므로 귀무가설 채택
# 귀무가설 : 남성 집단의 평균 키가 177이다. 샘플의 평균과 모집단의 평균은 같다.

result2 = stats.ttest_1samp(one_sample2, popmean=177)
print('result2 :', result2)
# result2 : Ttest_1sampResult(statistic=-3.3850411682038235, pvalue=0.02765632703927135)

p-value=0.02765 < 0.05 이므로 귀무가설 기각
# 대립가설 : 남성 집단의 평균 키가 177아니다. 샘플의 평균과 모집단의 평균은 다르다.

result3 = stats.ttest_1samp(one_sample, popmean=165)
print('result3 :', result3)

p-value=0.005 < 0.05 이므로 귀무가설 기각
# 대립가설 : 남성 집단의 평균 키가 177아니다. 샘플의 평균과 모집단의 평균은 다르다.

stats.ttest_1samp(데이터, popmean=모집단 평균) : one sample t-test 함수

실습 2 : 어느 집단 자료 평균 검정

귀무가설 : 자료들의 평균은 0이다.
대립가설 : 자료들의 평균은 0이 아니다.

np.random.seed(123)
mu = 0
n = 10
x = stats.norm(mu).rvs(n) # 평균이 0인 가우시안 정규분포를 따르는 난수 10개 출력
print(x)
#[-1.0856306   0.99734545  0.2829785  -1.50629471 -0.57860025  1.65143654 -2.42667924 -0.42891263  1.26593626 -0.8667404 ]
print(np.mean(x)) #-0.26951611032632805

import seaborn as sns

# x 데이터의 정규성 만족 여부 확인
sns.distplot(x, kde=False, fit=stats.norm) # 분포 시각화
plt.show()

import seaborn as sns

sns.distplot(x, kde=, fit=) : 히스토그램

print(stats.shapiro(x)) # 정규성 검정 함수
# ShapiroResult(statistic=0.9674148559570312, pvalue=0.8658965229988098)

p-value 0.865 > 0.05 이므로 정규성 만족

stats.shapiro(데이터) : 정규성 검정 함수

result4 = stats.ttest_1samp(x, popmean=0) # 모수의 평균 키
print('result4 :', result4)
# result4 : Ttest_1sampResult(statistic=-0.6540040368674593, pvalue=0.5294637946339893)

p-value=0.529 > 0.05 이므로 귀무가설 채택
# 귀무가설 : 자료들의 평균은 0이다.

# 참고 : 모수의 평균이 0.8이라고 하면
result5 = stats.ttest_1samp(x, popmean=0.8) # 모수의 평균 키
print('result5 :', result5)
# result5 : Ttest_1sampResult(statistic=-2.595272886660806, pvalue=0.028961904737220684)

p-value=0.028 < 0.05 이므로 귀무가설 채택
# 대립가설 : 자료들의 평균은 0이 아니다.

실습 3 : A 중학교 1 학년 1 반 학생들의 시험결과가 담긴 파일을 읽어 처리 (국어 점수 80점에 대한 평균검정)

귀무가설 : 학생들의 국어점수의 평균은 80이다.
대립가설 : 학생들의 국어점수의 평균은 80이 아니다.

* t2.py

import pandas as pd
import scipy.stats as stats
import numpy as np

data = pd.read_csv("../testdata/student.csv")
print(data.head(3))
'''
    이름  국어  영어  수학
0  박치기  90  85  55
1  홍길동  70  65  80
2  김치국  92  95  76
'''
print(data.describe())

print(np.mean(data.국어)) # 72.9와 80은 평균에 차이가 있는지 확인.
print(stats.shapiro(data.국어))

result = stats.ttest_1samp(data.국어, popmean=80)
print('result :',result)

pvalue=0.19 > 0.05 이므로 귀무가설 채택.
# 귀무가설 : 학생들의 국어점수의 평균은 80이다.

result2 = stats.ttest_1samp(data.국어, popmean=60)
print('result2 :',result2)

pvalue=0.02568 < 0.05 이므로 귀무가설 기각.
# 대립가설 : 학생들의 국어점수의 평균은 80이 아니다.

실습4 : 여아 신생아 몸무게의 평균 검정 수행

여아 신생아의 몸무게는 평균이 2800(g) 으로 알려져 왔으나 이보다 더 크다는 주장이 나왔다.
표본으로 여아 18 명을 뽑아 체중을 측정하였다고 할 때 새로운 주장이 맞는지 검정해 보자

귀무가설 : 여아 신생아 몸무게의 평균이 2800g이다.
대립가설 : 여아 신생아 몸무게의 평균이 2800g 보다 크다.

data = pd.read_csv("https://raw.githubusercontent.com/pykwon/python/master/testdata_utf8/babyboom.csv")
print(data.head(3))
'''
   time  gender  weight  minutes
0     5       1    3837        5
1   104       1    3334       64
2   118       2    3554       78
'''
print(data.describe()) # gender - 1 : 여아, 2: 남아

fdata = data[data.gender == 1]
print(fdata.head(3), len(fdata), np.mean(fdata.weight))
'''
   time  gender  weight  minutes
0     5       1    3837        5
1   104       1    3334       64
5   405       1    2208      245
18명, 평균 : 3132.4444444444443
'''
# 3132(g)과 2800(g)에 대해 평균에 차이가 있는 지 검정.

# 정규성 확인
# 통계 추론시 대부분이 모집단은 정규분포를 따른다는 가정하에 진행하는 것이 일반적이다. 중심 극한의 원리에 의해
# 중심 극한의 원리 : 표본의 크기가 커질 수록 표본 평균의 분포는 모집단의 분포 모양과는 관계없이 정규 분포에 가까워진다.
print(stats.shapiro(fdata.weight)) # p-value=0.0179 < 0.05 => 정규성 위반

import matplotlib.pyplot as plt
import seaborn as sns
sns.distplot(fdata.iloc[:, 2], fit=stats.norm)
plt.show()

stats.probplot(fdata.iloc[:, 2], plot=plt) # Q-Q plot 상에서 정규성 확인 - 잔차의 정규성
plt.show()

stats.probplot(데이터, plot=plt) : Q-Q plot 상에서 정규성 확인 - 잔차의 정규성

result3 = stats.ttest_1samp(fdata.weight, popmean=2800)
print('result3 :',result3) # pvalue=0.03926844173060218

pvalue=0.0392 < 0.05 이므로 귀무가설 기각.
# 대립가설 : 여아 신생아 몸무게의 평균이 2800g 보다 크다.

서로 독립인 두 집단의 평균 차이 검정 (independent samples t test)

선행조건 : 두 집단은 정규분포를 따라야 하며, 두 집단의 분산이 동일해야한다.

남녀의 성적 A 반과 B 반의 키 경기도와 충청도의 소득 따위의 서로 독립인 두 집단에서 얻은 표본을 독립표본 (two sample) 이라고 한다

실습1 : 남녀 두 집단 간 파이썬 시험의 평균 차이 검정

귀무가설 : 남녀 두 집단 간 파이썬 시험의 평균에 차이가 없다.
대립가설 : 남녀 두 집단 간 파이썬 시험의 평균에 차이가 있다.

* t3.py

import numpy as np
import scipy.stats as stats

male= [75, 85, 100, 72.5, 86.5]
female = [63.2, 76, 52, 100, 70]

print(np.mean(male))    # 83.8
print(np.mean(female))  # 72.24

# 데이터에 대한 정규성/ 등분산성은 생략

two_sample = stats.ttest_ind(male, female)
two_sample = stats.ttest_ind(male, female, equal_var = True)
# equal_var=True(default): 등분산을 만족한 경우
print(two_sample.pvalue)
print(two_sample)
# Ttest_indResult(statistic=1.233193127514512, pvalue=0.25250768448532773)

stats.ttest_ind(x, y, equal_var =) : 독립 샘플 t 검정

pvalue=0.2525 > 0.05 이므로 귀무채택.
# 귀무가설 : 남녀 두 집단 간 파이썬 시험의 평균에 차이가 없다.

실습2 : 두 가지 교육방법에 따른 평균시험 점수에 대한 검정 수행

귀무가설 : 두 가지 교육방법에 따른 평균시험 점수에 차이가 없다.
대립가설 : 두 가지 교육방법에 따른 평균시험 점수에 차이가 있다.

import pandas as pd
data = pd.read_csv('../testdata/two_sample.csv')
print(data.head(3))
'''
   no  gender  method  survey  score
0   1       1       1       1    5.1
1   2       1       2       0    NaN
2   3       1       1       1    4.7
'''

ms = data[['method', 'score']]
print(data['method'].unique()) # [1 2]

# 교육방법 별로 데이터 분리
m1 = ms[ms['method'] == 1]
m2 = ms[ms['method'] == 2]
print(m1[:5])
'''
    method  score
0        1    5.1
2        1    4.7
4        1    5.4
7        1    5.2
10       1    5.7
'''
print(m2[:5])
'''
   method  score
1       2    NaN
3       2    NaN
5       2    4.4
6       2    4.9
8       2    4.3
'''

sco1 = m1['score']
sco2 = m2['score']

# NaN : 임의의 값으로 대체, 평균으로 대체, 제거
#sco1 = sco1.fillna(0)
sco1 = sco1.fillna(sco1.mean())
sco2 = sco2.fillna(sco2.mean())
print()
print(sco1[:5])
'''
0     5.1
2     4.7
4     5.4
7     5.2
10    5.7
'''
print(sco2[:5])
'''
1    5.246667
3    5.246667
5    4.400000
6    4.900000
8    4.300000
'''

fillna(대체값) : 결측치 임의의 값으로 대체.

- 정규성 확인

import matplotlib.pyplot as plt
import seaborn as sns
sns.distplot(sco1, kde = False, fit=stats.norm)
sns.distplot(sco2, kde = False, fit=stats.norm)
plt.show()

print(stats.shapiro(sco1)) # pvalue=0.3679 > 0.05 정규성 만족.
print(stats.shapiro(sco2)) # pvalue=0.6714 > 0.05

print(stats.levene(sco1, sco2))   # pvalue=0.4568 > 0.05 등분산성 만족
print(stats.fligner(sco1, sco2))  # pvalue=0.4432
print(stats.bartlett(sco1, sco2)) # 비모수 30개 이상. pvalue=0.2678

result = stats.ttest_ind(sco1, sco2, equal_var=True) # 정규성 만족, 등분산성 만족할 경우
print(result) # pvalue=0.8450

stats.levene(x, y) : 등분산성 검증(레빈).

stats.fligner(x, y) : 등분산성 검증(플리그너).

stats.bartlett(x, y) : 등분산성 검증(바틀렛).

결론 : pvalue=0.8450 > 0.05 이므로 귀무채택.
# 귀무가설 : 두 가지 교육방법에 따른 평균시험 점수에 차이가 없다.

# 참고
result = stats.ttest_ind(sco1, sco2, equal_var=False) # 정규성 만족, 등분산성 만족하지 못할 경우
result = stats.wilcoxon(sco1, sco2)
print(result)

실습3 : 어느 음식점의 매출자료와 날씨 자료를 이용하여 강수여부에 따른 매출의 평균의 차이에 대한 검정

집단 1 : 비가 올때의 매출, 집단 2 : 비가 안올때의 매출

귀무가설 : 강수여부에 따른 매출액 평균에 차이가 없다.
대립가설 : 강수여부에 따른 매출액 평균에 차이가 있다.

* t4.py

- 매출자료

import numpy as np
import scipy.stats as stats
import pandas as pd
import matplotlib.pyplot as plt

sales_data = pd.read_csv('https://raw.githubusercontent.com/pykwon/python/master/testdata_utf8/tsales.csv', dtype={'YMD':'object'})
print(sales_data.head(3)) # 328행
'''
        YMD    AMT  CNT
0  20190514      0    1
1  20190519  18000    1
2  20190521  50000    4
'''
print(sales_data.info())

- 날씨 자료

wt_data = pd.read_csv('https://raw.githubusercontent.com/pykwon/python/master/testdata_utf8/tweather.csv')
print(wt_data.head(3)) # 702행
'''
   stnId          tm  avgTa  minTa  maxTa  sumRn  maxWs  avgWs  ddMes
0    108  2018-06-01   23.8   17.5   30.2    0.0    4.3    1.9    0.0
1    108  2018-06-02   23.4   17.6   30.1    0.0    4.5    2.0    0.0
2    108  2018-06-03   24.0   16.9   30.8    0.0    4.2    1.6    0.0
'''
print(wt_data.info())

- 날짜를 기준으로 join

wt_data.tm = wt_data.tm.map(lambda x : x.replace('-','')) # wt_data.tm에서 '-' 제거 
#print(wt_data.head(3))
frame = sales_data.merge(wt_data, how='left', left_on='YMD', right_on='tm') # join
print(frame.head(3), frame.shape) # (328, 12)
'''
        YMD    AMT  CNT  stnId        tm  ...  maxTa  sumRn  maxWs  avgWs  ddMes
0  20190514      0    1    108  20190514  ...   26.9    0.0    4.1    1.6    0.0
1  20190519  18000    1    108  20190519  ...   21.6   22.0    2.7    1.2    0.0
2  20190521  50000    4    108  20190521  ...   23.8    0.0    5.9    2.9    0.0
'''
print(frame.columns)

df1.merge(df2, how='left', left_on='df1 칼럼', right_on='df2 칼럼') : join

- 분석에 참여할 칼럼만 추출

data = frame.iloc[:, [0,1,7,8]]
print(data.head(3))
'''
        YMD    AMT  maxTa  sumRn
0  20190514      0   26.9    0.0
1  20190519  18000   21.6   22.0
2  20190521  50000   23.8    0.0
'''

# 방법 1
print(data['sumRn'] > 0)
data['rain_yn'] = (data['sumRn'] > 0).astype(int)
print(data.head(3))
'''
        YMD    AMT  maxTa  sumRn  rain_yn
0  20190514      0   26.9    0.0        0
1  20190519  18000   21.6   22.0        1
2  20190521  50000   23.8    0.0        0
'''

# 방법 2
print(True * 1, False * 1) # 1 0
data['rain_yn'] = (data.loc[:, ('sumRn')] > 0) * 1
print(data.head(3))

df.astype(타입명) : 타입변경

- 강수여부에 따른 매출액 비교용 시각화

sp = np.array(data.iloc[:, [1, 4]])  #  AMT, rain_yn
tg1 = sp[sp[:, 1] == 0, 0] # 비가 안올때의 매출액
tg2 = sp[sp[:, 1] == 1, 0] # 비가 올때의 매출액
print(tg1[:3]) # [     0  50000 125000]
print(tg2[:3]) # [ 18000 274000 318000]
print(np.mean(tg1), np.mean(tg2)) # 761040.2542372881 757331.5217391305

plt.plot(tg1)
plt.show()
plt.plot(tg2)
plt.show()

plt.boxplot([tg1, tg2], meanline=True, showmeans=True, notch=True)
plt.show()

plt.boxplot([data1, data2], meanline=True, showmeans=True, notch=True) : 상자 막대그래프

- 두 집단 평균차이 검정

# 정규성 (N > 30)
print(len(tg1), ' ', len(tg2))    # 236   92
print(stats.shapiro(tg1).pvalue) # 0.0560 > 0.05 => 정규성 만족 
print(stats.shapiro(tg2).pvalue) # 0.8827

# 등분산
print(stats.levene(tg1, tg2).pvalue) # 0.7123 > 0.05 => 등분산성 만족

print(stats.ttest_ind(tg1, tg2, equal_var=True)) # pvalue=0.9195

결론 : pvalue=0.9195 > 0.05 이므로 귀무가설 채택.
# 귀무가설 : 강수여부에 따른 매출액 평균에 차이가 없다.

서로 대응인 두 집단의 평균 차이 검정 (paired samples t test)

: 처리 이전과 처리 이후를 각각의 모집단으로 판단하여 동일한 관찰 대상으로부터 처리 이전과 처리 이후를 1:1 로 대응시킨 두 집단으로 부터의 표본을 대응표본 (paired sample) 이라고 한다.
: 대응인 두 집단의 평균 비교는 동일한 관찰 대상으로부터 처리 이전의 관찰과 이후의 관찰을 비교하여 영향을 미친 정도를 밝히는데 주로 사용하고 있다 집단 간 비교가 아니므로 등분산 검정을 할 필요가 없다.
: 광고 전/후의 상품 판매량의 차이, 운동 전/후의 근육량의 차이

실습 1 : 특강 전/후의 시험점수는 차이

귀무가설 : 특강 전/후의 시험점수는 차이가 없다.
대립가설 : 특강 전/후의 시험점수는 차이가 있다.

* t5.py

import numpy as np
import scipy.stats as stats
from seaborn.distributions import distplot

# 대응 표본 t 검정 1
np.random.seed(12)
x1 = np.random.normal(80, 10, 100)
x2 = np.random.normal(77, 10, 100)

# 정규성
import matplotlib.pyplot as plt
import seaborn as sns
sns.distplot(x1, kde=False, fit=stats.norm)
sns.distplot(x2, kde=False, fit=stats.norm)
plt.show()

print(stats.shapiro(x1)) # pvalue=0.9942 > 0.05 => 정규성 만족
print(stats.shapiro(x2)) # pvalue=0.7985

# 집단이 하나이므로 등분산성 검즘은 하지않는다.

print(stats.ttest_rel(x1, x2))

stats.ttest_rel(x1, x2) : 서로 대응인 두 집단의 평균 차이 검정

결론 : pvalue=0.0187 < 0.05 이므로 귀무가설 기각.
# 대립가설 : 특강 전/후의 시험점수는 차이가 있다.

실습 2 : 환자 9명의 복부 수술 전/후 몸무게 변화

귀무가설 : 복부 수술 전/후 몸무게 변화가 없다.
대립가설 : 복부 수술 전/후 몸무게 변화가 있다.

baseline = [67.2, 67.4, 71.5, 77.6, 86.0, 89.1, 59.5, 81.9, 105.5]
follow_up = [62.4, 64.6, 70.4, 62.6, 80.1, 73.2, 58.2, 71.0, 101.0]

print(np.mean(baseline)) # 78.41111111111111

print(np.mean(follow_up)) # 71.5
print(np.mean(baseline) - np.mean(follow_up)) # 6.911111111111111

result = stats.ttest_rel(baseline, follow_up)
print(result)

pvalue=0.0063 < 0.05 => 귀무가설 기각.
# 대립가설 : 복부 수술 전/후 몸무게 변화가 있다.

'BACK END > Deep Learning' 카테고리의 다른 글

[딥러닝] 선형회귀 (0)	2021.03.10
[딥러닝] 공분산, 상관계수 (0)	2021.03.10
[딥러닝] 이항검정 (0)	2021.03.10
[딥러닝] ANOVA (0)	2021.03.08
[딥러닝] 카이제곱 (0)	2021.03.04

[딥러닝] 카이제곱

2021. 3. 4. 09:46

귀무가설(영가설, H0) : 변함없는 생각.
대립가설(연구가설, H1) : 귀무가설에 반대하는 새로운 의견. 연구가설.

점추정 : 단일값을 모집단에서 샘플링
구간추정 : 범위를 모집단에서 샘플링. 신뢰도 상승.

가설검증

① 분포표 사용 -> 임계값 산출
임계값이 critical value 왼쪽에 있을 경우 귀무가설 채택/기각.

② p-value 사용(tool에서 산출). 유의확률. (경험적 수치)
p-value > 0.05(a)  : 귀무가설 채택. 우연히 발생할 확률이 0.05보다 낮아 연구가설 기각.
p-value < 0.05(a)  : 귀무가설 기각. 좋다
p-value < 0.01(a)  : 귀무가설 기각. 더 좋다
p-value < 0.001(a) : 귀무가설 기각. 더 더 좋다 => 신뢰 할 수 있다.

척도에 따른 데이터 분석방법

독립변수 x	종속변수 y	분석 방법
범주형	범주형	카이제곱 검정(교차분석)
범주형	연속형	T 검정(범주형 값 2개 : 집단 2개 이하), ANOVA(범주형 값 3개 : 집단 2개 이상)
연속형	범주형	로지스틱 회귀 분석
연속형	연속형	회귀분석, 구조 방정식

1종 오류 : 귀무가설이 채택(참)인데, 이를 기각.
2종 오류 : 귀무가설이 기각(거짓)인데, 이를 채택.

분산(표준편차)의 중요도 - 데이터의 분포

* temp.py

import scipy.stats as stats
import numpy as np
import matplotlib.pyplot as plt

centers = [1, 1.5, 2] # 3개 집단
col = 'rgb'
data = []
std = 0.01 # 표준편차 - 작으면 작을 수록 응집도가 높다.

for i in range(len(centers)):
    data.append(stats.norm(centers[i], std).rvs(100)) # norm() : 가우시안 정규분포. rvs(100) : 랜덤값 생성.
    plt.plot(np.arange(100) + i * 100, data[i], color=col[i])   # 그래프
    plt.scatter(np.arange(100) + i * 100, data[i], color=col[i]) # 산포도
plt.show()
print(data)

import scipy.stats as stats

stats.norm(평균, 표준편차).rvs(size=크기, random_state=seed) : 가우시안 정규 분포 랜덤 값 생성.

import matplotlib.pyplot as plt

plt.plot(x, y) : 그래프 출력.

교차분석 (카이제곱) 가설 검정

데이터나 집단의 분산을 추정하고 검정할때 사용
독립변수, 종속변수 : 범주형
일원 카이제곱 : 변수 단수. 적합성(선호도) 검정. 교차분할표 사용하지않음.
이원 카이제곱 : 변수 복수. 독립성(동질성) 검정. 교차분할표 사용.
절차 : 가설 설정 -> 유의 수준 결정 -> 검정통계량 계산 -> 귀무가설 채택여부 판단 -> 검정결과 진술
수식 : sum((관찰빈도 - 기대빈도)^2 / 기대빈도)

수식에 따른 카이제곱 검정

* chi1.py

import pandas as pd

data = pd.read_csv('../testdata/pass_cross.csv', encoding='euc-kr') # csv 파일 읽기
print(data.head(3))
'''
   공부함  공부안함  합격  불합격
0    1     0   1    0
1    1     0   1    0
2    0     1   0    1
'''
print(data.shape) # (행, 열) : (50, 4)

귀무가설 : 벼락치기와 합격여부는 관계가 없다.
대립가설 : 벼락치기와 합격여부는 관계가 있다.

print(data[(data['공부함'] == 1) & (data['합격'] == 1)].shape[0]) # 18 - 공부하고 합격
print(data[(data['공부함'] == 1) & (data['불합격'] == 1)].shape[0]) # 7 - 공부하고 불합격

빈도표

data2 = pd.crosstab(index=data['공부안함'], columns=data['불합격'], margins=True)
'''
불합격     0   1
공부안함        
0     18   7
1     12  13
'''
data2.columns = ['합격', '불합격', '행 합']
data2.index = ['공부함', '공부안함', '열 합']
print(data2)
'''
             합격  불합격  행 합
공부함    18    7   25
공부안함  12   13   25
열 합      30   20   50
'''

기대도수 = 각 행 합 * 각 열 합 / 총합

'''
        합격  불합격  행 합
공부함    15    10   25
공부안함  15   10   25
열 합      30   20   50
'''

카이제곱

ch2 = ((18 - 15) ** 2 / 15) + ((7 - 10) ** 2 / 10) + ((12 - 15) ** 2 / 15) + ((13 - 10) ** 2 / 10)
print('카이제곱 :', ch2) # 3.0

자유도(df) = (행 개수 - 1) * (열 개수 - 1) = (2-1) * (2-1) = 1

카이제곱 분포표에서 확인한 임계치 : 3.84

math100.tistory.com/45

카이제곱분포표 보는 법

카이제곱분포는 t분포와 마찬가지로 확률을 구할 때 사용하는 분포가 아니라, 나중에 신뢰구간이랑 가설검정에서 사용하는 분포다. 그래서 카이제곱분포표는 “t분포표 보는 법”과 얼추 비슷

math100.tistory.com

# 결론1 : 카이제곱 3 < 유의수준 3.84 => 귀무가설 채택 내에서 존재하여 귀무가설 채택.
( 귀무가설 : 벼락치기와 합격여부는 관계가 없다. )

전문가 제공 모듈 사용(chi2_contingency())

import scipy.stats as stats
chi2, p, ddof, expected = stats.chi2_contingency(data2)
print('chi2\t:', chi2)
print('p\t:', p)
print('ddof\t:', ddof)
print('expected :\n', expected)
'''
chi2    : 3.0
p    : 0.5578254003710748
ddof    : 4
expected :
 [[15. 10. 25.]
 [15. 10. 25.]
 [30. 20. 50.]]
'''

# 결론2 : 유의확률(p-value) 0.5578 > 유의수준 0.05 => 귀무가설 채택.
( 귀무가설 : 벼락치기와 합격여부는 관계가 없다. )

stats.chi2_contigency(데이터) : 이원카이제곱 함수

일원카이제곱(chisquare())

: 관찰도수가 기대도수와 일치하는 지를 검정하는 방법
: 종류 적합도 선호도 검정 - 범주형 변수가 한 가지로 관찰도수가 기대도수에 일치하는지 검정한다

적합도검정

: 자연현상이나 각종 실험을 통해 관찰되는 도수들이 귀무가설 하의 분포 범주형 자료의 각 수준별 비율 에 얼마나 일치하는 가에 대한 분석을 적합도 검정이라 한다
: 관측값들이 어떤 이론적 분포를 따르고 있는지를 검정으로 한 개의 요인을 대상으로 함

<실습 : 적합도 검정>

주사위를 60 회 던져서 나온 관측도수 기대도수가 아래와 같이 나온 경우에 이 주사위는 적합한 주사위가 맞는가를 일원카이제곱 검정
으로 분석하자

주사위 눈금	1	2	3	4	5	6
관측도수	4	6	17	16	8	9
기대도수	10	10	10	10	10	10

귀무가설(영가설, H0) : 기대빈도와 관찰빈도는 차이가 없다. 현재 주사위는 평범한 주사위다.
대립가설(연구가설, H1) : 기대빈도와 관찰빈도는 차이가 있다. 현재 주사위는 평범하지않은 주사위다.

변수 1개 : 주사위를 던진 횟수. 기대빈도와 관찰빈도의 차이를 확인. stats.chisquare(관찰빈도, 기대빈도)

* chi2.py

import pandas as pd
import scipy.stats as stats

data = [4, 6, 17, 16, 8, 9] # 관찰빈도
result = stats.chisquare(data)
print(result)
# Power_divergenceResult(statistic=14.200000000000001, pvalue=0.014387678176921308)

data2 = [10, 10, 10, 10, 10, 10] # 기대빈도
result2 = stats.chisquare(data, data2)
print(result2)

print('통계량(x2) : %.5f, p-value : %.5f'%result)
# 통계량(x2) : 14.20000, p-value : 0.01439
# 통계량과 p-value는 반비례 관계.

# 결론 1 : p-value 0.01439 < 0.05 이므로 유의미한 수준(a=0.05)에서 귀무 가설 기각. 연구가설 채택.
( 대립가설 : 기대빈도와 관찰빈도는 차이가 있다. 현재 주사위는 평범하지않은 주사위다. )

# 결론 2 : df 5(N-1), 임계값 : 카이제곱 분포표를 참고시 11.07
statistic(x2) 14.200 > 임계값 11.07 이므로 귀무 가설 기각. 연구가설 채택.
( 대립가설 : 기대빈도와 관찰빈도는 차이가 있다. 현재 주사위는 평범하지않은 주사위다. )

관찰빈도와 기대빈도 사이에 유의한 차이가 있는 지 일원 카이제곱을 사용하여 검정

stats.chisquare(데이터) : 일원카이제곱 함수

<실습 : 선호도 분석>

5개의 스포츠 음료에 대한 선호도에 차이가 있는지 검정하기

귀무 가설 : 스포츠 음료에 대한 선호도 차이가 없다.
대립 가설 : 스포츠 음료에 대한 선호도 차이가 있다.

import numpy as np
data3 = pd.read_csv('../testdata/drinkdata.csv')
print(data3, ' ', sum(data3['관측도수']))
'''
  음료종류  관측도수
0   s1    41
1   s2    30
2   s3    51
3   s4    71
4   s5    61   254
'''
print(stats.chisquare(data3['관측도수']))
# Power_divergenceResult(statistic=20.488188976377952, pvalue=0.00039991784008227264)

# 결론 : pvalue 0.00039 < 유의수준 0.05 이기 때문에 귀무가설 기각. 대립가설 채택.
( 대립 가설 : 스포츠 음료에 대한 선호도 차이가 있다. 향후 작업에 대한 참조자료로 이용. )

이원카이제곱 - 교차분할표 이용

: 두 개 이상의 변인 집단 또는 범주 을 대상으로 검정을 수행한다
분석대상의 집단 수에 의해서 독립성 검정과 동질성 검정으로 나뉜다

독립성 검정 : 두 변인의 관계가 독립인지 검정
동질성 검정 : 두 변인 간의 비율이 서로 동일한지를 검정

독립성 (관련성) 검정

- 동일 집단의 두 변인 학력수준과 대학진학 여부 을 대상으로 관련성이 있는가 없는가
- 독립성 검정은 두 변수 사이의 연관성을 검정한다

<실습 : 교육수준과 흡연율 간의 관련성 분석>

집단 2개 : 교육수준, 흡연율
귀무가설 : 교육수준과 흡연율 간에 관련이 없다. (독립이다)
대립가설 : 교육수준과 흡연율 간에 관련이 있다. (독립이 아니다)

*chi3.py

import pandas as pd
import scipy.stats as stats
data = pd.read_csv('../testdata/smoke.csv')
print(data)
'''
     education  smoking
0            1        1
1            1        1
2            1        1
3            1        1
4            1        1
..         ...      ...
350          3        3
351          3        3
352          3        3
353          3        3
354          3        3
'''
print(data['education'].unique()) # [1 2 3]
print(data['smoking'].unique())   # [1 2 3]

교육 수준, 흡연인원를 이용한 교차표 작성

# education : 독립변수, smoking : 종속변수
ctab = pd.crosstab(index = data['education'], columns = data['smoking']) # 빈도수.
ctab = pd.crosstab(index = data['education'], columns = data['smoking'], normalize=True) # normalize : 비율
ctab.index = ['대학원졸', '대졸', '고졸']
ctab.columns = ['골초', '보통', '노담']
print(ctab)
'''
             골초  보통  노담
대학원졸  51  92  68
대졸       22  21   9
고졸       43  28  21
'''

pd.crosstab(index=, columns=) : 교차 테이블

이원 카이 제곱을 지원하는 함수

chi_result = [ctab.loc['대학원졸'], ctab.loc['대졸'], ctab.loc['고졸']]
chi, p, _, _ = stats.chi2_contingency(chi_result)
chi, p, _, _ = stats.chi2_contingency(ctab)

print('chi :', chi)    # chi : 18.910915739853955
print('p-value :', p)  # p-value : 0.0008182572832162924

# 결론 : p-value 0.0008 < 0.05 이므로 귀무가설 기각.
( 대립가설 : 교육수준과 흡연율 간에 관련이 있다. (독립이 아니다) )

stats.chi2_contigency(데이터) : 이원카이제곱 함수

야트보정

분할표의 자유도가 1인 경우는 x^2값이 약간 높게 계산된다. 그러므로 이에 하기 식으로 야트보정이라 한다.
x^2=∑(|O-E|-0.5)^2/E

=> 상기 함수는 자동으로 야트보정이 이루어진다.

<실습 : 국가전체와 지역에 대한 인종 간 인원수로 독립성 검정 실습>

두 집단 국가전체 national, 특정지역 la) 의 인종 간 인원수의 분포가 관련이 있는가

귀무가설 : 국가전체와 지역에 대한 인종 간 인원수는 관련이 없다. 독립적.
대립가설 : 국가전체와 지역에 대한 인종 간 인원수는 관련이 있다. 독립적이지 않다.

national = pd.DataFrame(["white"] * 100000 + ["hispanic"] * 60000 + 
                         ["black"] * 50000 + ["asian"] * 15000 +["other"] * 35000)
la = pd.DataFrame(["white"] * 600 + ["hispanic"] * 300 + ["black"] * 250 + 
                  ["asian"] * 75 + ["other"] * 150)
print(national) # [260000 rows x 1 columns]
print(la)       # [1375 rows x 1 columns]

na_table = pd.crosstab(index = national[0], columns = 'count')
la_table = pd.crosstab(index = la[0], columns = 'count')
print(la_table)
'''
col_0     count
0              
asian        75
black       250
hispanic    300
other       150
white       600
'''
na_table['count_la'] = la_table['count'] # 칼럼 추가
print(na_table)

'''
col_0      count  count_la
0                         
asian      15000        75
black      50000       250
hispanic   60000       300
other      35000       150
white     100000       600
'''
chi, p, _, _ = stats.chi2_contingency(na_table)
print('chi :', chi)    # chi : 18.099524243141698
print('p-value :', p)  # p-value : 0.0011800326671747886

# 결론 : p-value : 0.0011 < 0.05 이므로 귀무가설 기각.
( 대립가설 : 국가전체와 지역에 대한 인종 간 인원수는 관련이 있다. 독립적이지 않다.)

이원카이제곱

동질성 검정

- 두 집단의 분포가 동일한가 다른 분포인가 를 검증하는 방법이다 두 집단 이상에서 각 범주 집단 간의 비율이 서로 동일한가를 검정하게 된다 두 개 이상의 범주형 자료가 동일한 분포를 갖는 모집단에서 추출된 것인지 검정하는 방법이다

<실습 1 : 교육방법에 따른 교육생들의 만족도 분석 - 동질성 검정>

귀무가설 : 교육방법에 따른 교육생들의 만족도에 차이가 없다.
대립가설 : 교육방법에 따른 교육생들의 만족도에 차이가 있다.

* chi4.py

import pandas as pd
import scipy.stats as stats

data = pd.read_csv("../testdata/survey_method.csv")
print(data.head(6))
'''
   no  method  survey
0   1       1       1
1   2       2       2
2   3       3       3
3   4       1       4
4   5       2       5
5   6       3       2
'''
print(data['survey'].unique()) # [1 2 3 4 5] - survey 데이터 종류
print(data['method'].unique()) # [1 2 3]     - method 데이터 종류

# 교차표 작성
ctab = pd.crosstab(index=data['method'], columns=data['survey'])
ctab.columns = ['매우만족', '만족', '보통', '불만족', '매우불만족']
ctab.index = ['방법1', '방법2', '방법3']
print(ctab)
'''
     매우만족  만족  보통  불만족  매우불만족
방법1     5   8  15   16      6
방법2     8  14  11   11      6
방법3     8   7  11   15      9
'''

# 카이제곱
chi, p, df, ex = stats.chi2_contingency(ctab)
msg = "chi2 : {}, p-value:{}, df:{}"
print(msg.format(chi, p, df))
# chi2 : 6.544667820529891, p-value:0.5864574374550608, df:8

# 결론 : p-value 0.586 > 0.05 이므로 귀무가설 채택.
( 귀무가설 : 교육방법에 따른 교육생들의 만족도에 차이가 없다. )

<실습 : 연령대별 sns 이용률의 동질성 검정>

20대에서 40 대까지 연령대별로 서로 조금씩 그 특성이 다른 SNS 서비스들에 대해 이용 현황을 조사한 자료를 바탕으로 연령대별로 홍보 전략을 세우고자 한다. 연령대별로 이용 현황이 서로 동일한지 검정해 보도록 하자

귀무가설 : 연령대별로 SNS 서비스 이용 현황에 차이가 없다(동질이다).
대립가설 : 연령대별로 SNS 서비스 이용 현황에 차이가 있다(동질이 없다).

import pandas as pd
import scipy.stats as stats

data2 = pd.read_csv("../testdata/snsbyage.csv")
print(data2.head(), ' ', data2.shape)
'''
   age service
0    1       F
1    1       F
2    1       F
3    1       F
4    1       F   (1439, 2)
'''
print(data2['age'].unique())     # [1 2 3]
print(data2['service'].unique()) # ['F' 'T' 'K' 'C' 'E']

# 교차표 작성
ctab2 = pd.crosstab(index = data2['age'], columns = data2['service'])
print(ctab2)
'''
service    C   E    F    K    T
age                            
1         81  16  207  111  117
2        109  15  107  236  104
3         32  17   78  133   76
'''

# 카이제곱
chi, p, df, ex = stats.chi2_contingency(ctab2)
msg2 = "chi2 : {}, p-value:{}, df:{}"
print(msg2.format(chi, p, df))
# chi2 : 102.75202494484225, p-value:1.1679064204212775e-18, df:8

# 결론 : p-value 1.1679064204212775e-18 < 0.05 이므로 귀무가설 기각.
( 대립가설 : 연령대별로 SNS 서비스 이용 현황에 차이가 있다(동질이 없다) )

# 참고 : 상기 데이터가 모집단이었다면 샘플링을 진행하여 샘플링 데이터로 작업 진행.
sample_data = data2.sample(500, replace=True)

카이제곱 + Django

= django_use02_chi(PyDev Django Project)

Create application - coffesurvey

* settings

...

INSTALLED_APPS = [
    
    ...
    
    'coffeesurvey',
]

...

DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.mysql', 
        'NAME': 'coffeedb',            # DB명 : db는 미리 작성되어 있어야 함.       
        'USER': 'root',                # 계정명 
        'PASSWORD': '123',             # 계정 암호           
        'HOST': '127.0.0.1',           # DB가 설치된 컴의 ip          
        'PORT': '3306',                # DBMS의 port 번호     
    }
}

* MariaDB

create database coffeedb;

use coffeedb;

create table survey(rnum int primary key auto_increment, gender varchar(4),age int(3),co_survey varchar(10));

insert into survey (gender,age,co_survey) values ('남',10,'스타벅스');
insert into survey (gender,age,co_survey) values ('여',20,'스타벅스');

* anaconda prompt

cd C:\work\psou\django_use02_chi

python manage.py inspectdb > aaa.py

* models

from django.db import models

# Create your models here.
class Survey(models.Model):
    rnum = models.AutoField(primary_key=True)
    gender = models.CharField(max_length=4, blank=True, null=True)
    age = models.IntegerField(blank=True, null=True)
    co_survey = models.CharField(max_length=10, blank=True, null=True)

    class Meta:
        managed = False
        db_table = 'survey'

Make migrations - coffeesurvey

Migrate

* urls(django_use02_chi)

from django.contrib import admin
from django.urls import path
from coffeesurvey import views
from django.urls.conf import include

urlpatterns = [
    path('admin/', admin.site.urls),
    path('', views.Main), # url 없을 경우
    path('coffee/', include('coffeesurvey.urls')), # 위임
]

* views

from django.shortcuts import render
from coffeesurvey.models import Survey

# Create your views here.
def Main(request):
    return render(request, 'main.html')

...

* main.html

<body>
<h2>메인</h2>
<ul>
	<li>메뉴</li>
	<li>신제품</li>
	<li><a href="coffee/survey">설문조사</a></li>
</ul>
</body>

* urls(coffeesurvey)

from django.urls import path
from coffeesurvey import views

urlpatterns = [
    path('survey', views.SurveyView), 
    path('surveyprocess', views.SurveyProcess),
]

* views

...

def SurveyView(request):
    return render(request, 'survey.html')

* survey.html

<body>
<h2>* 커피전문점에 대한 소비자 인식조사 *</h2>
<form action="/coffee/surveyprocess" method="post">{% csrf_token %}
<table>
	<tr>
		<td>귀하의 성별은?</td>
		<td>
			<label for="genM">남</label>
			<input type="radio" id="genM" name="gender" value="남" checked="checked">
			&nbsp
			<label for="genF">여</label>
			<input type="radio" id="genF" name="gender" value="여">
		</td>
	</tr>
	<tr>
		<td>귀하의 나이는?</td>
		<td>
			<label for="age10">10대</label>
			<input type="radio" id="age10" name="age" value="10" checked="checked">
			&nbsp
			<label for="age20">20대</label>
			<input type="radio" id="age20" name="age" value="20">
			&nbsp
			<label for="age10">30대</label>
			<input type="radio" id="age30" name="age" value="30">
			&nbsp
			<label for="age40">40대</label>
			<input type="radio" id="age40" name="age" value="40">
			&nbsp
			<label for="age50">50대</label>
			<input type="radio" id="age50" name="age" value="50">
		</td>
	</tr>
	<tr>
		<td>선호하는 커피전문점은?</td>
		<td>
			<label for="startbucks">스타벅스</label>
			<input type="radio" id="genM" name="co_survey" value="스타벅스" checked="checked">
			&nbsp
			<label for="coffeebean">커피빈</label>
			<input type="radio" id="coffeebean" name="co_survey" value="커피빈">
			&nbsp
			<label for="ediya">이디아</label>
			<input type="radio" id="ediya" name="co_survey" value="이디아">
			&nbsp
			<label for="tomntoms">탐앤탐스</label>
			<input type="radio" id="tomntoms" name="co_survey" value="탐앤탐스">
		</td>
	</tr>
	<tr>
		<td colspan="2">
		<br>
		<input type="submit" value="설문완료">
		<input type="reset" value="초기화">
	</tr>
</table>
</form>
</body>

* views

...

import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt

plt.rc('font', family="malgun gothic")

def SurveyProcess(request):
    InsertData(request)
    rdata = list(Survey.objects.all().values())
    # print(rdata) # [{'rnum': 1, 'gender': '남', 'age': 10, 'co_survey': '스타벅스'}, ... 
    
    df, crossTab, results = Analysis(rdata)
    
    # 시각화
    fig = plt.gcf()
    gen_group = df['co_survey'].groupby(df['coNum']).count()
    gen_group.index = ['스타벅스', '커피빈', '이디아', '탐앤탐스']
    gen_group.plot.bar(subplots=True, color=['red', 'blue'], width=0.5)
    plt.xlabel('커피샵')
    plt.ylabel('선호도 건수')
    plt.title('커피샵 별 선호건수')
    fig.savefig('django_use02_chi/coffeesurvey/static/images/co.png')
    
    return render(request, 'list.html', {'df':df.to_html(index=False), \
                                         'crossTab':crossTab.to_html(), 'results':results})

def InsertData(request): # 입력자료 DB에 저장
    if request.method == 'POST':
        Survey(
            gender = request.POST.get('gender'),
            age = request.POST.get('age'),
            co_survey = request.POST.get('co_survey')
            ).save()
            # 또는 SQL문을 직접 사용해도 무관.

def Analysis(rdata): # 분석
    df = pd.DataFrame(rdata)
    
    
    df.dropna() # 결측치 처리가 필요한 경우에는 진행. 이상치도 처리.
    df['genNum'] = df['gender'].apply(lambda g:1 if g =='남' else 2)
    df['coNum'] = df['co_survey'].apply(lambda c:1 if c =='스타벅스' \
        else 2 if c =='커피빈' else 3 if c =='이디아' else 4)
    print(df)
    '''
            rnum gender  age co_survey  genNum  coNum
    0      1      남   10      스타벅스       1      1
    1      2      여   20      스타벅스       2      1
    2      3      남   10      스타벅스       1      1
    3      4      남   10      스타벅스       1      1
    4      5      남   10      탐앤탐스       1      4
    5      6      남   40       커피빈       1      2
    6      7      남   10      스타벅스       1      1
    7      8      남   10      스타벅스       1      1
    8      9      남   10      스타벅스       1      1
    9     10      남   10      스타벅스       1      1
    10    11      남   10      스타벅스       1      1
    11    12      남   10      스타벅스       1      1
    '''
    
    # 교차 빈도표
    crossTab = pd.crosstab(index=df['gender'], columns=df['co_survey'])
    #crossTab = pd.crosstab(index=df['genNum'], columns=df['coNum'])
    print(crossTab)
    
    chi, pv, _, _ = stats.chi2_contingency(crossTab)
    
    if pv > 0.05:
        results = 'p값이 {}이므로 유의수준 0.05 <b>이상</b>의 값을 가지므로 <br> "
        +"성별에 따라 선호하는 커피브랜드에는 <b>차이가 없다.</b> (귀무가설 채택)'.format(pv)
    else:
        results = 'p값이 {}이므로 유의수준 0.05 <b>이하</b>의 값을 가지므로 <br> "
        +"성별에 따라 선호하는 커피브랜드에는 <b>차이가 있다.</b> (연구가설 채택)'.format(pv)
    
    return df, crossTab, results

* list.html

<body>
<h2>* 커피전문점에 대한 소비자 인식 조사 결과 *</h2>

<a href="/">메인화면</a><br>
<a href="/coffee/survey">다시 설문조사 하기</a><br>
{% if df %}
	{{df|safe}}
{% endif %}<br>

{% if crossTab %}
	{{crossTab|safe}}
{% endif %}<br>

{% if results %}
	{{results|safe}}
{% endif %}<br>
<img src="/static/images/co.png" width="400" />
</body>

'BACK END > Deep Learning' 카테고리의 다른 글

[딥러닝] 선형회귀 (0)	2021.03.10
[딥러닝] 공분산, 상관계수 (0)	2021.03.10
[딥러닝] 이항검정 (0)	2021.03.10
[딥러닝] ANOVA (0)	2021.03.08
[딥러닝] T 검정 (0)	2021.03.05

[Pandas] pandas 정리2 - db, django

2021. 3. 2. 15:23

local db

* db1.py

# sqlite : db 자료 -> DataFrame -> db

import sqlite3

sql = "create table if not exists test(product varchar(10), maker varchar(10), weight real, price integer)"

conn = sqlite3.connect(':memory:')
#conn = sqlite3.connect('mydb.db')
conn.execute(sql)

data = [('mouse', 'samsung', 12.5, 6000), ('keyboard', 'lg', 502.0, 86000)]
stmt = "insert into test values(?, ?, ?, ?)"
conn.executemany(stmt, data)

data1 = ('연필', '모나미', 3.5, 500)
conn.execute(stmt, data1)
conn.commit()

cursor = conn.execute("select * from test")
rows = cursor.fetchall()
for a in rows:
    print(a)
print()

# DataFrame에 저장 1 - cursor.fetchall() 이용
import pandas as pd
#df1 = pd.DataFrame(rows, columns = ['product', 'maker', 'weight', 'price'])
print(*cursor.description)
df1 = pd.DataFrame(rows, columns = list(zip(*cursor.description))[0])
print(df1)

# DataFrame에 저장 2 - pd.read_sql() 이용
df2 = pd.read_sql("select * from test", conn)
print(df2)
print()

print(df2.to_html())
print()

# DataFrame의 자료를 DB로 저장
data = {
    'irum':['신선해', '신기해', '신기한'],
    'nai':[22, 25, 27]
}
frame = pd.DataFrame(data)
print(frame)
print()

conn = sqlite3.connect('test.db')
frame.to_sql('mytable', conn, if_exists = 'append', index = False)
df3 = pd.read_sql("select * from mytable", conn)
print(df3)

cursor.close()
conn.close()

교차 테이블(교차표) - 행과 열로 구성된 교차표로 결과(빈도수)를 요약

* cross_test

import pandas as pd

y_true = pd.Series([2, 0, 2, 2, 0, 1, 1, 2, 2, 0, 1, 2])
y_pred = pd.Series([0, 0, 2, 1, 0, 2, 1, 0, 2, 0, 2, 2])

result = pd.crosstab(y_true, y_pred, rownames=['True'], colnames=['Predicted'], margins=True)
print(result)
'''
Predicted  0  1  2  All
True                   
0          3  0  0    3
1          0  1  2    3
2          2  1  3    6
All        5  2  5   12
'''

# 인구통계 dataset 읽기
des = pd.read_csv('https://raw.githubusercontent.com/pykwon/python/master/testdata_utf8/descriptive.csv') 
print(des.info())

# 5개 칼럼만 선택하여 data frame 생성 
data = des[['resident','gender','age','level','pass']]
print(data[:5])

# 지역과 성별 칼럼 교차테이블 
table = pd.crosstab(data.resident, data.gender)
print(table)

# 지역과 성별 칼럼 기준 - 학력수준 교차테이블 
table = pd.crosstab([data.resident, data.gender], data.level)
print(table)

원격 DB 연동

* db2_remote.py

import MySQLdb
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.rc('font', family='malgun gothic')

import csv
import ast
import sys

try:
    with open('mariadb.txt', 'r') as f:
        config = f.read()
except Exception as e:
    print('read err :', e)
    sys.exit()
    
config = ast.literal_eval(config)
print(config)
# {'host': '127.0.0.1', 'user': 'root', 'password': '123', 'database': 'test', 
# 'port': 3306, 'charset': 'utf8', 'use_unicode': True}

try:
    conn = MySQLdb.connect(**config)
    cursor = conn.cursor()
    sql = """
    select jikwon_no, jikwon_name, jikwon_jik, buser_name, jikwon_gen, jikwon_pay
    from jikwon inner join buser
    on jikwon.buser_num = buser.buser_no
    """
    cursor.execute(sql)
    
    for (jikwon_no, jikwon_name, jikwon_jik, buser_name, jikwon_gen, jikwon_pay) in cursor:
        print(jikwon_no, jikwon_name, jikwon_jik, buser_name, jikwon_gen, jikwon_pay)

    # jikwon.csv 파일로 저장
    with open('jikwon.csv', 'w', encoding='utf-8') as fw:
        writer = csv.writer(fw)
        for row in cursor:
            writer.writerow(row)
        print('저장성공')

    # csv 파일 읽기 1
    df1 = pd.read_csv('jikwon.csv', header=None, names = ('번호', '이름', '직급', '부서', '성별', '연봉'))
    print(df1.head(3))
    print(df1.shape)  # (30, 6)

    # csv 파일 읽기 2
    df2 = pd.read_sql(sql, conn)
    df2.columns = ('번호', '이름', '직급', '부서', '성별', '연봉')
    print(df2.head(3))
    '''
           번호   이름  직급   부서 성별    연봉
    0   1  홍길동  이사  총무부  남  9900
    1   2  한송이  부장  영업부  여  8800
    2   3  이순신  과장  영업부  남  7900
    '''

    print('건수 :', len(df2))
    print('건수 :', df2['이름'].count()) # 건수 : 30
    print()
    
    print('직급별 인원 수 :\n', df2['직급'].value_counts())
    print()
    
    print('연봉 평균 :\n', df2.loc[:,'연봉'].sum() / len(df2))
    print('연봉 평균 :\n', df2.loc[:,'연봉'].mean())
    print()
    
    print('연봉 요약 통계 :\n', df2.loc[:,'연봉'].describe())
    print()
    
    print('연봉이 8000이상 : \n', df2.loc[df2['연봉'] >= 8000])
    print()
    
    print('연봉이 5000이상인 영업부 : \n', df2.loc[(df2['연봉'] >= 5000) & (df2['부서'] == '영업부')])
    print()
    
    print('* crosstab')
    ctab = pd.crosstab(df2['성별'], df2['직급'], margins=True)
    print(ctab)
    print()
    
    print('* groupby')
    print(df2.groupby(['성별', '직급'])['이름'].count())
    print()
    
    print('* pivot table')
    print(df2.pivot_table(['연봉'], index=['성별'], columns=['직급'], aggfunc = np.mean))
    print()

    # 시각화 - pie 차트
    # 직급별 연봉 평균
    jik_ypay = df2.groupby(['직급'])['연봉'].mean()
    print(jik_ypay, type(jik_ypay)) # Series
    print(jik_ypay.index)
    print(jik_ypay.values)
    
    plt.pie(jik_ypay,
            labels=jik_ypay.index, 
            labeldistance=0.5,
            counterclock=False,
            shadow=True,
            explode=(0.2, 0, 0, 0.3, 0))
    plt.show()
    
except Exception as e:
    print('process err :', e)
    
finally:
    cursor.close()
    conn.close()

# DataFrame의 자료를 DB로 저장
data = {
    'irum':['tom', 'james', 'john'],
    'nai':[22, 25, 27]
}
frame = pd.DataFrame(data)
print(frame)
print()

# pip install sqlalchemy
# pip install pymysql
from sqlalchemy import create_engine
import pymysql # MySQL Connector using pymysql

pymysql.install_as_MySQLdb()
engine = create_engine("mysql+mysqldb://root:"+"123"+"@Localhost/test", encoding='utf-8')
conn = engine.connect()

# MySQL에 저장하기
# pandas의 to_sql 함수 사용 저장
frame.to_sql(name='mytable', con = engine, if_exists = 'append', index = False)
df3 = pd.read_sql("select * from mytable", conn)
print(df3)

Django

PyDev Django Project 생성

* Django - Create application - myjikwonapp

= django_use01

* settings

...

INSTALLED_APPS = [
    
    ...
    
    'myjikwonapp',
]

...

DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.mysql', 
        'NAME': 'test',                # DB명 : db는 미리 작성되어 있어야 함.       
        'USER': 'root',                # 계정명 
        'PASSWORD': '123',             # 계정 암호           
        'HOST': '127.0.0.1',           # DB가 설치된 컴의 ip          
        'PORT': '3306',                # DBMS의 port 번호     
    }
}

* anaconda prompt

cd C:\work\psou\django_use01
python manage.py inspectdb > aaa.py

* models

from django.db import models

# Create your models here.
class Jikwon(models.Model):
    jikwon_no = models.IntegerField(primary_key=True)
    jikwon_name = models.CharField(max_length=10)
    buser_num = models.IntegerField()
    jikwon_jik = models.CharField(max_length=10, blank=True, null=True)
    jikwon_pay = models.IntegerField(blank=True, null=True)
    jikwon_ibsail = models.DateField(blank=True, null=True)
    jikwon_gen = models.CharField(max_length=4, blank=True, null=True)
    jikwon_rating = models.CharField(max_length=3, blank=True, null=True)

    class Meta:
        managed = False
        db_table = 'jikwon'

* Django - Create Migrations - myjikwonapp

* Django - Migrate

* urls

from django.contrib import admin
from django.urls import path
from myjikwonapp import views

urlpatterns = [
    path('admin/', admin.site.urls),
    path('', views.MainFunc),
    path('showdata', views.ShowFunc),
]

* views

from django.shortcuts import render
from myjikwonapp.models import Jikwon
import pandas as pd
import matplotlib.pyplot as plt
plt.rc('font', family='malgun gothic')

# Create your views here.
def MainFunc(request):
    return render(request, 'main.html')

def ShowFunc(request):
    #datas = Jikwon.objects.all()         # jikwon table의 모든 데이터 조회
    datas = Jikwon.objects.all().values() # dict type
    #print(datas)                         # <QuerySet [{'jikwon_no': 1, 'jikwon_name': '홍길동', ...
    
    pd.set_option('display.max_columns', 500) # width : 500
    df = pd.DataFrame(datas)
    df.columns = ['사번', '직원명','부서코드', '직급', '연봉', '입사일', '성별', '평점']
    #print(df)
    '''
            사번  직원명  부서코드  직급    연봉         입사일 성별 평점
    0    1  홍길동    10  이사  9900  2008-09-01  남  a
    1    2  한송이    20  부장  8800  2010-01-03  여  b
    2    3  이순신    20  과장  7900  2010-03-03  남  b
    3    4  이미라    30  대리  4500  2014-01-04  여  b
    4    5  이순라    20  사원  3000  2017-08-05  여  b
    '''
    
    # 부서별 급여 합/평균
    buser_group = df['연봉'].groupby(df['부서코드'])
    buser_group_detail = {'sum':buser_group.sum(), 'avg':buser_group.mean()}
    #print(buser_group_detail)
    '''
    {'sum': 부서코드
    10    37900
    20    58900
    30    37300
    40    25050,
    'avg': 부서코드
    10    5414.285714
    20    4908.333333
    30    5328.571429
    40    6262.500000
    }
    '''
    
    # 차트를 이미지로 저장
    bu_result = buser_group.agg(['sum', 'mean'])
    bu_result.plot.bar()
    #bu_result.plot(kind='bar')
    plt.title("부서별 급여 합/평균")
    fig = plt.gcf()
    fig.savefig('django_use01/myjikwonapp/static/images/jik.png')
    
    return render(request, 'show.html', {'msg':'직원정보', 'datas':df.to_html(), 'buser_group':buser_group_detail})

* main.html

<body>
 <h2>메인</h2>
 <a href="showdata">직원정보</a>
</body>

* show.html

<body>
  <h2>{{msg}} (DB -> pandas 이용)</h2>
  {% if datas %}
  {{datas|safe}}
  {% endif %}
  <hr>
  <h2>부서별 급여합</h2>
  총무부 : {{buser_group.sum.10}}<br>
  영업부 : {{buser_group.sum.20}}<br>
  전산부 : {{buser_group.sum.30}}<br>
  관리부 : {{buser_group.sum.40}}<br><br>
  <h2>부서별 급여평균</h2>
  총무부 : {{buser_group.avg.10}}<br>
  영업부 : {{buser_group.avg.20}}<br>
  전산부 : {{buser_group.avg.30}}<br>
  관리부 : {{buser_group.avg.40}}<br><br>
  <img alt="사진" src="/static/images/jik.png" title="차트 1-1">
</body>

* desc_stat

# 기술 통계
'''
기술통계(descriptive statistics)란 수집한 데이터의 특성을 표현하고 요약하는 통계 기법이다. 
기술통계는 샘플(전체 자료일수도 있다)이 있으면, 그 자료들에 대해  수치적으로 요약정보를 표현하거나, 
데이터 시각화를 한다. 
즉, 자료의 특징을 파악하는 관점으로 보면 된다. 평균, 분산, 표준편차 등이 기술통계에 속한다.
'''

# 도수 분포표
import pandas as pd

frame = pd.read_csv('../testdata/ex_studentlist.csv')
print(frame.head(2))
print(frame.info())
'''
  name sex  age  grade absence bloodtype  height  weight
0  김길동  남자   23      3       유         O   165.3    68.2
1  이미린  여자   22      2       무        AB   170.1    53.0
'''
print('나이\t:', frame['age'].mean())
print('나이\t:', frame['age'].var())
print('나이\t:', frame['age'].std())
print('혈액형\t:', frame['bloodtype'].unique())
print(frame.describe().T)
print()

# 혈액형별 인원수
data1 = frame.groupby(['bloodtype'])['bloodtype'].count()
print('혈액형별 인원수 : ', data1)
'''
혈액형별 인원수 :  bloodtype
A     3
AB    3
B     4
O     5
'''
print()

data2 = pd.crosstab(index = frame['bloodtype'], columns = 'count')
print('혈액형별 인원수 : ', data2)
'''
혈액형별 인원수 :  col_0      count
bloodtype       
A              3
AB             3
B              4
O              5
'''
print()

# 성별, 혈액형별 인원수
data3 = pd.crosstab(index = frame['bloodtype'], columns = frame['sex'])
data3 = pd.crosstab(index = frame['bloodtype'], columns = frame['sex'], margins=True) # 소계
data3.columns = ['남', '여', '행합']
data3.index = ['A', 'AB', 'B', 'O', '열합']
print('성별, 혈액형별 인원수 : ', data3)
'''
성별, 혈액형별 인원수 :      남  여  행합
A   1  2   3
AB  2  1   3
B   3  1   4
O   2  3   5
열합  8  7  15
'''
print()

print(data3/data3.loc['열합','행합'])
print()
'''
           남         여        행합
A   0.066667  0.133333  0.200000
AB  0.133333  0.066667  0.200000
B   0.200000  0.066667  0.266667
O   0.133333  0.200000  0.333333
열합  0.533333  0.466667  1.000000
'''

'BACK END > Python Library' 카테고리의 다른 글

[LINUX] 리눅스 - 명령어 /eclipse /FlashPlayer /DB /apache /R /anaconda /hadoop (0)	2021.03.26
[MatPlotLib] matplotlib 정리 (0)	2021.03.02
[Pandas] pandas 정리 (0)	2021.02.24
[NumPy] numpy 정리 (0)	2021.02.23

[MatPlotLib] matplotlib 정리

2021. 3. 2. 10:35

matplotlib : ploting library. 그래프 생성을 위한 다양한 함수 지원.

* mat1.py

import numpy as np
import matplotlib.pyplot as plt

plt.rc('font', family = 'malgun gothic')   # 한글 깨짐 방지
plt.rcParams['axes.unicode_minus'] = False # 음수 깨짐 방지

x = ["서울", "인천", "수원"]
y = [5, 3, 7]
# tuple 가능. set 불가능.

plt.xlabel("지역")                 # x축 라벨
plt.xlim([-1, 3])                 # x축 범위
plt.ylim([0, 10])                 # y축 범위
plt.yticks(list(range(0, 11, 3))) # y축 칸 number 지정
plt.plot(x, y)                    # 그래프 생성
plt.show()                        # 그래프 출력

data = np.arange(1, 11, 2)
print(data)    # [1 3 5 7 9]
plt.plot(data) # 그래프 생성
x = [0, 1, 2, 3, 4]
for a, b in zip(x, data):
    plt.text(a, b, str(b)) # 그래프 라인에 text 표시
plt.show()     # 그래프 출력

plt.plot(data)
plt.plot(data, data, 'r')
plt.show()

x = np.arange(10)
y = np.sin(x)
print(x, y)
plt.plot(x, y, 'bo')  # 파란색 o로 표시. style 옵션 적용.
plt.plot(x, y, 'r+')  # 빨간색 +로 표시
plt.plot(x, y, 'go--', linewidth = 2, markersize = 10) # 초록색 o 표시
plt.show()

# 홀드 : 그림 겹쳐 보기
x = np.arange(0, np.pi * 3, 0.1)
print(x)
y_sin = np.sin(x)
y_cos = np.cos(x)
plt.plot(x, y_sin, 'r')     # 직선, 곡선
plt.scatter(x, y_cos)       # 산포도
#plt.plot(x, y_cos, 'b')
plt.xlabel('x축')
plt.ylabel('y축')
plt.legend(['사인', '코사인']) # 범례
plt.title('차트 제목')         # 차트 제목
plt.show()

# subplot : Figure를 여러 행열로 분리
x = np.arange(0, np.pi * 3, 0.1)
y_sin = np.sin(x)
y_cos = np.cos(x)

plt.subplot(2, 1, 1)
plt.plot(x, y_sin)
plt.title('사인')

plt.subplot(2, 1, 2)
plt.plot(x, y_cos)
plt.title('코사인')
plt.show()

irum = ['a', 'b', 'c', 'd', 'e']
kor = [80, 50, 70, 70, 90]
eng = [60, 70, 80, 70, 60]
plt.plot(irum, kor, 'ro-')
plt.plot(irum, eng, 'gs--')
plt.ylim([0, 100])
plt.legend(['국어', '영어'], loc = 4)
plt.grid(True)          # grid 추가

fig = plt.gcf()         # 이미지 저장 선언
plt.show()              # 이미지 출력
fig.savefig('test.png') # 이미지 저장

from matplotlib.pyplot import imread
img = imread('test.png') # 이미지 read
plt.imshow(img)          # 이미지 출력
plt.show()

차트의 종류 결정

* mat2.py

import numpy as np
import matplotlib.pyplot as plt

fig = plt.figure()             # 명시적으로 차트 영역 객체 선언
ax1 = fig.add_subplot(1, 2, 1) # 1행 2열 중 1열에 표기
ax2 = fig.add_subplot(1, 2, 2) # 1행 2열 중 2열에 표기

ax1.hist(np.random.randn(10), bins=10, alpha=0.9) # 히스토그램 출력 - bins : 구간 수, alpha: 투명도
ax2.plot(np.random.randn(10)) # plot 출력
plt.show()

fig, ax = plt.subplots(nrows=2, ncols=1)
ax[0].plot(np.random.randn(10))
ax[1].plot(np.random.randn(10) + 20)
plt.show()

data = [50, 80, 100, 70, 90]
plt.bar(range(len(data)), data)
plt.barh(range(len(data)), data)
plt.show()

data = [50, 80, 100, 70, 90]
err = np.random.randn(len(data))
plt.barh(range(len(data)), data, xerr=err, alpha = 0.6)
plt.show()

plt.pie(data, explode=(0, 0.2, 0, 0, 0), colors = ['yellow', 'blue', 'red'])
plt.show()

n = 30
np.random.seed(42)
x = np.random.rand(n)
y = np.random.rand(n)
color = np.random.rand(n)
scale = np.pi * (15 * np.random.rand(n)) ** 2
plt.scatter(x, y, s = scale, c = color)
plt.show()

import pandas as pd
sdata = pd.Series(np.random.rand(10).cumsum(), index = np.arange(0, 100, 10))
plt.plot(sdata)
plt.show()

# DataFrame
fdata = pd.DataFrame(np.random.randn(1000, 4), index = pd.date_range('1/1/2000', periods = 1000),\
                     columns = list('ABCD'))
print(fdata)
fdata = fdata.cumsum()
plt.plot(fdata)
plt.show()

seaborn : matplotlib 라이브러리의 기능을 보완하기 위해 사용

* mat3.py

import matplotlib.pyplot as plt
import seaborn as sns

titanic = sns.load_dataset("titanic")
print(titanic.info())
print(titanic.head(3))

age = titanic['age']
sns.kdeplot(age) # 밀도
plt.show()

sns.distplot(age) # kdeplot + hist
plt.show()

import matplotlib.pyplot as plt
import seaborn as sns

titanic = sns.load_dataset("titanic")
print(titanic.info())
print(titanic.head(3))

age = titanic['age']
sns.kdeplot(age) # 밀도
plt.show()

sns.relplot(x = 'who', y = 'age', data=titanic)
plt.show()

sns.countplot(x = 'class', data=titanic, hue='who') # hue : 카테고리
plt.show()

t_pivot = titanic.pivot_table(index='class', columns='age', aggfunc='size')
print(t_pivot)
sns.heatmap(t_pivot, cmap=sns.light_palette('gray', as_cmap=True), fmt = 'd', annot=True)
plt.show()

import pandas as pd
iris_data = pd.read_csv('../testdata/iris.csv')
print(iris_data.info())
print(iris_data.head(3))
'''
   Sepal.Length  Sepal.Width  Petal.Length  Petal.Width Species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
'''

plt.scatter(iris_data['Sepal.Length'], iris_data['Petal.Length'])
plt.xlabel('Sepal.Length')
plt.ylabel('Petal.Length')
plt.show()

cols = []
for s in iris_data['Species']:
    choice = 0
    if s == 'setosa' : choice = 1
    elif s == 'versicolor' : choice = 2
    else: choice = 3
    cols.append(choice)
    
plt.scatter(iris_data['Sepal.Length'], iris_data['Petal.Length'], c = cols)
plt.xlabel('Sepal.Length')
plt.ylabel('Petal.Length')
plt.show()

# pandas의 시각화 기능
print(type(iris_data)) # DataFrame
from pandas.plotting import scatter_matrix 
#scatter_matrix(iris_data, diagonal='hist')
scatter_matrix(iris_data, diagonal='kde')
plt.show()

# seabon
sns.pairplot(iris_data, hue = 'Species', height = 1)
plt.show()

x = iris_data['Sepal.Length'].values
sns.rugplot(x)
plt.show()

sns.kdeplot(x)
plt.show()

# pandas의 시각화 기능
import numpy as np 

df = pd.DataFrame(np.random.randn(10, 3), index = pd.date_range('1/1/2000', periods=10), columns=['a', 'b', 'c'])
print(df)

#df.plot() # 꺽은선
#df.plot(kind= 'bar')
df.plot(kind= 'box')
plt.xlabel('time')
plt.ylabel('data')
plt.show()

df[:5].plot.bar(rot=0)
plt.show()

* mat4.py

import pandas as pd

tips = pd.read_csv('../testdata/tips.csv')
print(tips.info())
print(tips.head(3))
'''
   total_bill   tip     sex smoker  day    time  size
0       16.99  1.01  Female     No  Sun  Dinner     2
1       10.34  1.66    Male     No  Sun  Dinner     3
2       21.01  3.50    Male     No  Sun  Dinner     3
'''

tips['gender'] = tips['sex'] # sex 칼럼을 gender 칼럼으로 변경
del tips['sex']
print(tips.head(3))
'''
   total_bill   tip smoker  day    time  size  gender
0       16.99  1.01     No  Sun  Dinner     2  Female
1       10.34  1.66     No  Sun  Dinner     3    Male
2       21.01  3.50     No  Sun  Dinner     3    Male
'''

# tip 비율 추가
tips['tip_pct'] = tips['tip'] / tips['total_bill']
print(tips.head(3))
'''
   total_bill   tip smoker  day    time  size  gender   tip_pct
0       16.99  1.01     No  Sun  Dinner     2  Female  0.059447
1       10.34  1.66     No  Sun  Dinner     3    Male  0.160542
2       21.01  3.50     No  Sun  Dinner     3    Male  0.166587
'''

tip_pct_group = tips['tip_pct'].groupby([tips['gender'], tips['smoker']]) # 성별, 흡연자별 그륩화
print(tip_pct_group)
print(tip_pct_group.sum())
print(tip_pct_group.max())
print(tip_pct_group.min())

result = tip_pct_group.describe()
print(result)
'''
               count      mean       std  ...       50%       75%       max
gender smoker                             ...                              
Female No       54.0  0.156921  0.036421  ...  0.149691  0.181630  0.252672
       Yes      33.0  0.182150  0.071595  ...  0.173913  0.198216  0.416667
Male   No       97.0  0.160669  0.041849  ...  0.157604  0.186220  0.291990
       Yes      60.0  0.152771  0.090588  ...  0.141015  0.191697  0.710345

print(tip_pct_group.agg('sum'))
print(tip_pct_group.agg('mean'))
print(tip_pct_group.agg('max'))
print(tip_pct_group.agg('min'))

def diffFunc(group):
    diff = group.max() - group.min()
    return diff

result2 = tip_pct_group.agg(['var', 'mean', 'max', diffFunc])
print(result2)
'''
                    var      mean       max  diffFunc
gender smoker                                        
Female No      0.001327  0.156921  0.252672  0.195876
       Yes     0.005126  0.182150  0.416667  0.360233
Male   No      0.001751  0.160669  0.291990  0.220186
       Yes     0.008206  0.152771  0.710345  0.674707
'''

import matplotlib.pyplot as plt
result2.plot(kind = 'barh', title = 'aff fund', stacked = True)
plt.show()

'BACK END > Python Library' 카테고리의 다른 글

[LINUX] 리눅스 - 명령어 /eclipse /FlashPlayer /DB /apache /R /anaconda /hadoop (0)	2021.03.26
[Pandas] pandas 정리2 - db, django (0)	2021.03.02
[Pandas] pandas 정리 (0)	2021.02.24
[NumPy] numpy 정리 (0)	2021.02.23

[Pandas] pandas 정리

2021. 2. 24. 10:00

pandas : 고수준의 자료구조(Series, DataFrame)를 지원

축약연산, 누락된 데이터 처리, Sql Query, 데이터 조작, 인덱싱, 시각화 .. 등 다양한 기능을 제공.

1. Series : 일련의 데이터를 기억할 수 있는 1차원 배열과 같은 자료 구조로 명시적인 색인을 갖는다.

* pdex1.py

Series(순서가 있는 자료형) : series 생성

from pandas import Series
import numpy as np
obj = Series([3, 7, -5, 4])     # int
obj = Series([3, 7, -5, '4'])  # string - 전체가 string으로 변환
obj = Series([3, 7, -5, 4.5])  # float  - 전체가 float으로 변환
obj = Series((3, 7, -5, 4))    # tuple  - 순서가 있는 자료형 사용 가능
obj = Series({3, 7, -5, 4})    # set    - 순서가 없어 error 발생
print(obj, type(obj))
'''
0    3
1    7
2   -5
3    4
dtype: int64 <class 'pandas.core.series.Series'>
'''

Series( , index = ) : 색인 지정

obj2 = Series([3, 7, -5, 4], index = ['a', 'b', 'c', 'd']) # index : 색인 지정
print(obj2)
'''
a    3
b    7
c   -5
d    4
dtype: int64
'''

np.sum(obj2) = obj2.sum()

print(sum(obj2), np.sum(obj2), obj2.sum()) # pandas는 numpy 함수를 기본적으로 계승해서 사용.
'''
9 9 9
'''

obj.values : value를 list형으로 리턴

obj.index : index를 object형으로 리턴

print(obj2.values) # 값만 출력
print(obj2.index)  # index만 출력
'''
[ 3  7 -5  4]
Index(['a', 'b', 'c', 'd'], dtype='object')
'''

슬라이싱

'''
a    3
b    7
c   -5
d    4
'''
print(obj2['a'], obj2[['a']])
# 3 a    3

print(obj2[['a', 'b']])
'''
a    3
b    7
'''
print(obj2['a':'c'])
'''
a    3
b    7
c   -5
'''
print(obj2[2]) # -5
print(obj2[1:4])
'''
b    7
c   -5
d    4
'''
print(obj2[[2,1]])
'''
c   -5
b    7
'''
print(obj2 > 0)
'''
a     True
b     True
c    False
d     True
'''
print('a' in obj2)
# True

dict type으로 Series 생성

names = {'mouse':5000, 'keyboard':25000, 'monitor':55000}
print(names)
# {'mouse': 5000, 'keyboard': 25000, 'monitor': 55000}

obj3 = Series(names)
print(obj3, ' ', type(obj3))
'''
mouse        5000
keyboard    25000
monitor     55000
dtype: int64   <class 'pandas.core.series.Series'>
'''
print(obj3['mouse'])
# 5000

obj3.name = '상품가격' # Series 객체에 이름 부여
print(obj3)
'''
mouse        5000
keyboard    25000
monitor     55000
Name: 상품가격, dtype: int64
'''

DataFrame : 표 모양(2차원)의 자료 구조. Series가 여러개 합쳐진 형태. 각 칼럼마다 type이 다를 수 있다.

from pandas import DataFrame
df = DataFrame(obj3) # Series로 DataFrame 생성
print(df)
'''
           상품가격
mouse      5000
keyboard  25000
monitor   55000
'''
data = {
    'irum':['홍길동', '한국인', '신기해', '공기밥', '한가해'],
    'juso':['역삼동', '신당동', '역삼동', '역삼동', '신사동'],
    'nai':[23, 25, 33, 30, 35]
    }
print(data, type(data))
'''
{'irum': ['홍길동', '한국인', '신기해', '공기밥', '한가해'],
 'juso': ['역삼동', '신당동', '역삼동', '역삼동', '신사동'],
 'nai': [23, 25, 33, 30, 35]}
 <class 'dict'>
'''
frame = DataFrame(data) # dict로 DataFrame 생성
print(frame)
'''
  irum juso  nai
0  홍길동  역삼동   23
1  한국인  신당동   25
2  신기해  역삼동   33
3  공기밥  역삼동   30
4  한가해  신사동   35
'''
print(frame['irum']) # 칼럼을 dict 형식으로 접근
'''
0    홍길동
1    한국인
2    신기해
3    공기밥
4    한가해
'''
print(frame.irum, ' ', type(frame.irum))    # 칼럼을 속성 형식으로 접근
'''
0    홍길동
1    한국인
2    신기해
3    공기밥
4    한가해
Name: irum, dtype: object
<class 'pandas.core.series.Series'>
'''
print(DataFrame(data, columns=['juso', 'irum', 'nai'])) # 칼럼의 순서변경
'''
  juso irum  nai
0  역삼동  홍길동   23
1  신당동  한국인   25
2  역삼동  신기해   33
3  역삼동  공기밥   30
4  신사동  한가해   35
'''
print()
frame2 = DataFrame(data, columns=['juso', 'irum', 'nai', 'tel'], index = ['a', 'b', 'c', 'd', 'e'])
print(frame2)
'''
  juso irum  nai  tel
a  역삼동  홍길동   23  NaN
b  신당동  한국인   25  NaN
c  역삼동  신기해   33  NaN
d  역삼동  공기밥   30  NaN
e  신사동  한가해   35  NaN
'''
frame2['tel'] = '111-1111' # tel 칼럼의 모든행에 적용
print(frame2)
'''
  juso irum  nai       tel
a  역삼동  홍길동   23  111-1111
b  신당동  한국인   25  111-1111
c  역삼동  신기해   33  111-1111
d  역삼동  공기밥   30  111-1111
e  신사동  한가해   35  111-1111
'''
print()
val = Series(['222-2222', '333-2222', '444-2222'], index = ['b', 'c', 'e'])
frame2['tel'] = val
print(frame2)
'''
  juso irum  nai       tel
a  역삼동  홍길동   23       NaN
b  신당동  한국인   25  222-2222
c  역삼동  신기해   33  333-2222
d  역삼동  공기밥   30       NaN
e  신사동  한가해   35  444-2222

'''
print()
print(frame2.T) # 행과 열 swap
'''
        a         b         c    d         e
juso  역삼동       신당동       역삼동  역삼동       신사동
irum  홍길동       한국인       신기해  공기밥       한가해
nai    23        25        33   30        35
tel   NaN  222-2222  333-2222  NaN  444-2222
'''
print(frame2.values)
'''
[['역삼동' '홍길동' 23 nan]
 ['신당동' '한국인' 25 '222-2222']
 ['역삼동' '신기해' 33 '333-2222']
 ['역삼동' '공기밥' 30 nan]
 ['신사동' '한가해' 35 '444-2222']]
'''
print(frame2.values[0, 1]) # 0행 1열. 홍길동
print(frame2.values[0:2])  # 0 ~ 1행
'''
[['역삼동' '홍길동' 23 nan]
 ['신당동' '한국인' 25 '222-2222']]
'''

행/열 삭제

#frame3 = frame2.drop('d')         # index가 d인 행 삭제
frame3 = frame2.drop('d', axis=0) # index가 d인 행 삭제
print(frame3)
'''
  juso irum  nai       tel
a  역삼동  홍길동   23       NaN
b  신당동  한국인   25  222-2222
c  역삼동  신기해   33  333-2222
e  신사동  한가해   35  444-2222
'''
frame3 = frame2.drop('tel', axis = 1) # index가 tel인 열 삭제
print(frame3)
'''
  juso irum  nai
a  역삼동  홍길동   23
b  신당동  한국인   25
c  역삼동  신기해   33
d  역삼동  공기밥   30
e  신사동  한가해   35
'''

정렬

print(frame2.sort_index(axis=0, ascending=False)) # 행 단위. 내림차순
'''
  juso irum  nai       tel
e  신사동  한가해   35  444-2222
d  역삼동  공기밥   30       NaN
c  역삼동  신기해   33  333-2222
b  신당동  한국인   25  222-2222
a  역삼동  홍길동   23       NaN
'''
print(frame2.sort_index(axis=1, ascending=True)) # 열 단위. 오름차순
'''
  irum juso  nai       tel
a  홍길동  역삼동   23       NaN
b  한국인  신당동   25  222-2222
c  신기해  역삼동   33  333-2222
d  공기밥  역삼동   30       NaN
e  한가해  신사동   35  444-2222
'''
print(frame2.rank(axis=0)) # 행 단위. 사전 순위로 칼럼 값 순서를 매김
'''
   juso  irum  nai  tel
a   4.0   5.0  1.0  NaN
b   1.0   4.0  2.0  1.0
c   4.0   2.0  4.0  2.0
d   4.0   1.0  3.0  NaN
e   2.0   3.0  5.0  3.0
'''
print(frame2['juso'].value_counts()) # 칼럼의 개수
'''
역삼동    3
신사동    1
신당동    1
'''

문자열 자르기

data = {
    'juso':['강남구 역삼동', '중구 신당동', '강남구 대치동'],
    'inwon':[22,23,24]
}
fr = DataFrame(data)
print(fr)
'''
      juso  inwon
0  강남구 역삼동     22
1   중구 신당동     23
2  강남구 대치동     24
'''
result1 = Series([x.split()[0] for x in fr.juso])
result2 = Series([x.split()[1] for x in fr.juso])
print(result1)
'''
0    강남구
1     중구
2    강남구
'''
print(result2)
'''
0    역삼동
1    신당동
2    대치동
'''
print(result1.value_counts())
'''
강남구    2
중구     1
'''

재 색인, NaN, bool처리, 슬라이싱 관련 메소드, 연산

* pdex2.py

Series 재 색인

data = Series([1,3,2], index = (1,4,2))
print(data)
'''
1    1
4    3
2    2
'''

data2 = data.reindex((1,2,4)) # 해당 index 순서로 정렬
print(data2)
'''
1    1
2    2
4    3
'''

재색인 시 값 채워 넣기

data3 = data2.reindex([0,1,2,3,4,5]) # 대응 값이 없는 인덱스는 NaN(결측값)이 됨.
print(data3)
'''
0    NaN
1    1.0
2    2.0
3    NaN
4    3.0
5    NaN
'''

data3 = data2.reindex([0,1,2,3,4,5], fill_value = 333) # 대응 값이 없는 인덱스는 fill_value으로 대입.
print(data3)
'''
0    333
1      1
2      2
3    333
4      3
5    333
'''

data3 = data2.reindex([0,1,2,3,4,5], method='ffill') # 대응 값이 없는 인덱스는 이전 index의 값으로 대입.
print(data3)
data3 = data2.reindex([0,1,2,3,4,5], method='pad') # 대응 값이 없는 인덱스는 이전 index의 값으로 대입.
print(data3)
'''
0    NaN
1    1.0
2    2.0
3    2.0
4    3.0
5    3.0
'''

data3 = data2.reindex([0,1,2,3,4,5], method='bfill') # 대응 값이 없는 인덱스는 다음 index의 값으로 대입.
print(data3)
data3 = data2.reindex([0,1,2,3,4,5], method='backfill') # 대응 값이 없는 인덱스는 다음 index의 값으로 대입.
print(data3)
'''
0    1.0
1    1.0
2    2.0
3    3.0
4    3.0
5    NaN
'''

bool 처리 / 슬라이싱

df = DataFrame(np.arange(12).reshape(4,3), index=['1월', '2월', '3월', '4월'], columns=['강남', '강북', '서초'])
print(df)
'''
    강남  강북  서초
1월   0   1   2
2월   3   4   5
3월   6   7   8
4월   9  10  11
'''

print(df['강남'])
'''
1월    0
2월    3
3월    6
4월    9
'''

print(df['강남'] > 3 ) # True나 False 반환
'''
1월    False
2월    False
3월     True
4월     True
'''

print(df[df['강남'] > 3] ) # 강남이 3보다 큰 값 월의 값을 출력
'''
    강남  강북  서초
3월   6   7   8
4월   9  10  11
'''

df[df < 3] = 0 # 3보다 작으면 0 대입
print(df)
'''
    강남  강북  서초
1월   0   0   0
2월   3   4   5
3월   6   7   8
4월   9  10  11
'''

print(df.loc['3월', :]) # 복수 indexing. loc : 라벨 지원. iloc : 숫자 지원
print(df.loc['3월', ]) 
# 3월 행의 모든 값을 출력
'''
강남    6
강북    7
서초    8
'''
print(df.loc[:'2월']) # 2월행 이하의 모든 값 출력
'''
    강남  강북  서초
1월   0   0   0
2월   3   4   5
'''
print(df.loc[:'2월',['서초']]) # 2월 이하 서초 열만 출력
'''
    서초
1월   0
2월   5
'''
print(df.iloc[2]) # 2행의 모든 값 출력
print(df.iloc[2, :]) # 2행의 모든 값 출력
'''
강남    6
강북    7
서초    8
'''

print(df.iloc[:3]) # 3행 미만 행 모든 값 출력
'''
    강남  강북  서초
1월   0   0   0
2월   3   4   5
3월   6   7   8
'''
print(df.iloc[:3, 2]) # 3행 미만 행 2열 미만 출력
'''
1월    0
2월    5
3월    8
'''
print(df.iloc[:3, 1:3]) # 3행 미만 행 1~2열 출력
'''
    강북  서초
1월   0   0
2월   4   5
3월   7   8
'''

연산

s1 = Series([1,2,3], index = ['a', 'b', 'c'])
s2 = Series([4,5,6,7], index = ['a', 'b', 'd', 'c'])
print(s1)
'''
a    1
b    2
c    3
'''
print(s2)
'''
a    4
b    5
d    6
c    7
'''

print(s1 + s2) # 인덱스가 값은 경우만 연산, 불일치시 NaN
print(s1.add(s2))
'''
a     5.0
b     7.0
c    10.0
d     NaN
'''

df1 = DataFrame(np.arange(9).reshape(3,3), columns=list('kbs'), index=['서울', '대전', '부산'])
print(df1)
'''
    k  b  s
서울  0  1  2
대전  3  4  5
부산  6  7  8
'''

df2 = DataFrame(np.arange(12).reshape(4,3), columns=list('kbs'), index=['서울', '대전', '제주', '수원'])
print(df2)
'''
    k   b   s
서울  0   1   2
대전  3   4   5
제주  6   7   8
수원  9  10  11
'''

print(df1 + df2) # 대응되는 index만 연산
print(df1.add(df2))
'''
      k    b     s
대전  6.0  8.0  10.0
부산  NaN  NaN   NaN
서울  0.0  2.0   4.0
수원  NaN  NaN   NaN
제주  NaN  NaN   NaN
'''
print(df1.add(df2, fill_value = 0)) # 대응되지않는 값은 0으로 대입 후 연산
'''
      k     b     s
대전  6.0   8.0  10.0
부산  6.0   7.0   8.0
서울  0.0   2.0   4.0
수원  9.0  10.0  11.0
제주  6.0   7.0   8.0
'''

seri = df1.iloc[0] # 0행만 추출
print(seri)
'''
k    0
b    1
s    2
'''

print(df1)
'''
    k  b  s
서울  0  1  2
대전  3  4  5
부산  6  7  8
'''

print(df1 - seri)
'''
    k  b  s
서울  0  0  0
대전  3  3  3
부산  6  6  6
'''

함수

df = DataFrame([[1.4, np.nan], [7, -4.5], [np.NaN, np.NaN, None], [0.5, -1]]) # , index=['one', 'two', 'three', 'four']
print(df)
'''
     0    1     2
0  1.4  NaN  None
1  7.0 -4.5  None
2  NaN  NaN  None
3  0.5 -1.0  None
'''
print(df.isnull()) # null 여부 확인
'''
       0      1     2
0  False   True  True
1  False  False  True
2   True   True  True
3  False  False  True
'''
print(df.notnull())
'''
       0      1      2
0   True  False  False
1   True   True  False
2  False  False  False
3   True   True  False
'''
print(df.drop(1)) # 행 삭제
'''
     0    1     2
0  1.4  NaN  None
2  NaN  NaN  None
3  0.5 -1.0  None
'''
print()
print(df.dropna()) # na가 포함된 행 삭제
'''
Empty DataFrame
Columns: [0, 1, 2]
Index: []
'''
print()
print(df.dropna(how='any')) # na가 포함된 행 삭제
'''
Empty DataFrame
Columns: [0, 1, 2]
Index: []
'''
print(df.dropna(how='all')) # 모든 값이 na인 행 삭제
'''
     0    1     2
0  1.4  NaN  None
1  7.0 -4.5  None
3  0.5 -1.0  None
'''
print(df.dropna(axis='rows')) # na가 포함된 행 삭제
'''
Empty DataFrame
Columns: [0, 1, 2]
Index: []
'''
print(df.dropna(axis='columns')) # na가 포함된 열 삭제
'''
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3]
'''
print(df.fillna(0)) # na를 0으로 채움
'''
     0    1  2
0  1.4  0.0  0
1  7.0 -4.5  0
2  0.0  0.0  0
3  0.5 -1.0  0
'''
#print(df.dropna(subset = ['one']))
print(df)
print('--------')
print(df.sum()) # 열 단위 합
print(df.sum(axis = 0)) # 열 단위 합
'''
0    8.9
1   -5.5
2    0.0
'''
print(df.sum(axis = 1)) # 행 단위 합
'''
0    1.4
1    2.5
2    0.0
3   -0.5
'''
print(df.mean(axis = 1)) # 행단위 평균
'''
0    1.40
1    1.25
2     NaN
3   -0.25
'''
print(df.mean(axis = 1, skipna= False)) # 행단위 평균
'''
0     NaN
1    1.25
2     NaN
3   -0.25
'''
print(df.mean(axis = 1, skipna= True)) # 행단위 평균
'''
0    1.40
1    1.25
2     NaN
3   -0.25
'''
print()
print(df.describe()) # 요약 통계량
'''
              0         1
count  3.000000  2.000000
mean   2.966667 -2.750000
std    3.521837  2.474874
min    0.500000 -4.500000
25%    0.950000 -3.625000
50%    1.400000 -2.750000
75%    4.200000 -1.875000
max    7.000000 -1.000000
'''
print(df.info()) # 구조확인
'''
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       3 non-null      float64
 1   1       2 non-null      float64
 2   2       0 non-null      object 
dtypes: float64(2), object(1)
memory usage: 224.0+ bytes
None
'''
print()
words = Series(['봄', '여름', '가을', '봄'])
print(words.describe())
'''
count     4
unique    3
top       봄
freq      2
dtype: object
'''
#print(words.info()) #AttributeError: 'Series' object has no attribute 'info'

구조 : stack, unstack, cut, merge, concat, pivot

* pdex3.py

import pandas as pd
import numpy as np

df = pd.DataFrame(1000 + np.arange(6).reshape(2, 3), index=['대전', '서울'],\
                   columns = ['2019', '2020', '2021'])
print(df)
'''
    2019  2020  2021
대전  1000  1001  1002
서울  1003  1004  1005
'''
#print(df.info()) # 정보 보기
print()

df_row = df.stack() # 칼럼을 기준으로 stack 구조로 변경
print(df_row)
'''
대전  2019    1000
    2020    1001
    2021    1002
서울  2019    1003
    2020    1004
    2021    1005
'''
df_col = df_row.unstack()
print(df_col)
'''
    2019  2020  2021
대전  1000  1001  1002
서울  1003  1004  1005
'''

범주화

price = [10.3, 5.5, 7.8, 3.6] # data
cut = [3, 7, 9, 11] # 범주
result_cut = pd.cut(price, cut) # data를 범주 기준으로 나눠 범주화 진행. 
print(result_cut)
'''
[(9, 11], (3, 7], (7, 9], (3, 7]]
Categories (3, interval[int64]): [(3, 7] < (7, 9] < (9, 11]]
'''

print(pd.value_counts(result_cut)) # value의 수
'''
(3, 7]     2
(9, 11]    1
(7, 9]     1
'''
print()

datas = pd.Series(np.arange(1, 1001))
print(datas.head(3))
'''
0    1
1    2
2    3
'''
print(datas.tail(4))
'''
996     997
997     998
998     999
999    1000
'''
result_cut2 = pd.cut(datas, 3)
print(result_cut2)
'''
0       (0.001, 334.0]
1       (0.001, 334.0]
2       (0.001, 334.0]
3       (0.001, 334.0]
4       (0.001, 334.0]
            ...       
995    (667.0, 1000.0]
996    (667.0, 1000.0]
997    (667.0, 1000.0]
998    (667.0, 1000.0]
999    (667.0, 1000.0]
Length: 1000, dtype: category
Categories (3, interval[float64]): [(0.001, 334.0] < (334.0, 667.0] < (667.0, 1000.0]]
'''
print(pd.value_counts(result_cut2))
'''
(0.001, 334.0]     334
(667.0, 1000.0]    333
(334.0, 667.0]     333
'''

merge

df1 = pd.DataFrame({'data1':range(7), 'key':['b', 'b', 'a', 'c', 'a', 'a', 'b']})
print(df1)
'''
   data1 key
0      0   b
1      1   b
2      2   a
3      3   c
4      4   a
5      5   a
6      6   b
'''
df2 = pd.DataFrame({'key':['a', 'b', 'd'], 'data2':range(3)})
print(df2)
'''
  key  data2
0   a      0
1   b      1
2   d      2
'''
print()

print(pd.merge(df1, df2)) # inner join
print(pd.merge(df1, df2, on = 'key')) # inner join
print(pd.merge(df1, df2, on = 'key', how = 'inner')) # inner join
'''
   data1 key  data2
0      0   b      1
1      1   b      1
2      6   b      1
3      2   a      0
4      4   a      0
5      5   a      0
'''

print(pd.merge(df1, df2, on = 'key', how = 'outer')) # full outer join
'''
   data1 key  data2
0    0.0   b    1.0
1    1.0   b    1.0
2    6.0   b    1.0
3    2.0   a    0.0
4    4.0   a    0.0
5    5.0   a    0.0
6    3.0   c    NaN
7    NaN   d    2.0
'''
print(pd.merge(df1, df2, on = 'key', how = 'left')) # left outer join
'''
   data1 key  data2
0      0   b    1.0
1      1   b    1.0
2      2   a    0.0
3      3   c    NaN
4      4   a    0.0
5      5   a    0.0
6      6   b    1.0
'''
print(pd.merge(df1, df2, on = 'key', how = 'right')) # right outer join
'''
   data1 key  data2
0    2.0   a      0
1    4.0   a      0
2    5.0   a      0
3    0.0   b      1
4    1.0   b      1
5    6.0   b      1
6    NaN   d      2
'''

공통 칼럼명이 없는 경우

df3 = pd.DataFrame({'key2':['a','b','d'], 'data2':range(3)})
print(df3)
'''
  key2  data2
0    a      0
1    b      1
2    d      2
'''

print(df1)
'''
   data1 key
0      0   b
1      1   b
2      2   a
3      3   c
4      4   a
5      5   a
6      6   b
'''

print(pd.merge(df1, df3, left_on = 'key', right_on = 'key2'))
'''
   data1 key key2  data2
0      0   b    b      1
1      1   b    b      1
2      6   b    b      1
3      2   a    a      0
4      4   a    a      0
5      5   a    a      0
'''
print()
print(pd.concat([df1, df3]))
print(pd.concat([df1, df3], axis = 0)) # 열 단위
'''
   data1  key key2  data2
0    0.0    b  NaN    NaN
1    1.0    b  NaN    NaN
2    2.0    a  NaN    NaN
3    3.0    c  NaN    NaN
4    4.0    a  NaN    NaN
5    5.0    a  NaN    NaN
6    6.0    b  NaN    NaN
0    NaN  NaN    a    0.0
1    NaN  NaN    b    1.0
2    NaN  NaN    d    2.0
'''

print(pd.concat([df1, df3], axis = 1)) # 행 단위
'''
   data1 key key2  data2
0      0   b    a    0.0
1      1   b    b    1.0
2      2   a    d    2.0
3      3   c  NaN    NaN
4      4   a  NaN    NaN
5      5   a  NaN    NaN
6      6   b  NaN    NaN
'''

피벗 테이블: 데이터의 행렬을 재구성하여 그륩화 처리

data = {'city':['강남', '강북', '강남', '강북'],
        'year':[2000, 2001, 2002, 2002],
        'pop':[3.3, 2.5, 3.0, 2]
        }
df = pd.DataFrame(data)
print(df)
'''
  city  year  pop
0   강남  2000  3.3
1   강북  2001  2.5
2   강남  2002  3.0
3   강북  2002  2.0
'''

print(df.pivot('city', 'year', 'pop')) # city별, year별 pop의 평균
print(df.set_index(['city', 'year']).unstack()) # 기존의 행 인덱스를 제거하고 첫번째 열 인덱스 설정 
'''
year  2000  2001  2002
city                  
강남     3.3   NaN   3.0
강북     NaN   2.5   2.0
'''
print(df.pivot('year', 'city', 'pop')) # year별 , city별 pop의 평균
'''
city   강남   강북
year          
2000  3.3  NaN
2001  NaN  2.5
2002  3.0  2.0
'''
print(df['pop'].describe())
'''
count    4.000000
mean     2.700000
std      0.571548
min      2.000000
25%      2.375000
50%      2.750000
75%      3.075000
max      3.300000
'''

groupby

hap = df.groupby(['city'])
print(hap.sum()) # city별 합

print(df.groupby(['city']).sum()) # city별 합
'''
      year  pop
city           
강남    4002  6.3
강북    4003  4.5
'''
print(df.groupby(['city', 'year']).sum()) # city, year 별 합
'''
           pop
city year     
강남   2000  3.3
     2002  3.0
강북   2001  2.5
     2002  2.0
'''
print(df.groupby(['city', 'year']).mean()) # city, year별 평균
'''
           pop
city year     
강남   2000  3.3
     2002  3.0
강북   2001  2.5
     2002  2.0
'''

pivot_table : pivot, groupby의 중간 성격

print(df)
'''
  city  year  pop
0   강남  2000  3.3
1   강북  2001  2.5
2   강남  2002  3.0
3   강북  2002  2.0
'''

print(df.pivot_table(index=['city']))
print(df.pivot_table(index=['city'], aggfunc=np.mean)) # default : aggfunc=np.mean 
'''
       pop    year
city              
강남    3.15  2001.0
강북    2.25  2001.5
'''

print(df.pivot_table(index=['city', 'year'], aggfunc=[len, np.sum]))
'''
           len  sum
           pop  pop
city year          
강남   2000  1.0  3.3
     2002  1.0  3.0
강북   2001  1.0  2.5
     2002  1.0  2.0
'''

print(df.pivot_table(values=['pop'], index = 'city')) # city별 합의 평균
print(df.pivot_table(values=['pop'], index = 'city', aggfunc=np.mean))
'''
       pop
city      
강남    3.15
강북    2.25
'''

print(df.pivot_table(values=['pop'], index = 'city', aggfunc=len))
'''
      pop
city     
강남    2.0
강북    2.0
'''

print(df.pivot_table(values=['pop'], index = ['year'], columns=['city']))
'''
      pop     
city   강남   강북
year          
2000  3.3  NaN
2001  NaN  2.5
2002  3.0  2.0
'''

print(df.pivot_table(values=['pop'], index = ['year'], columns=['city'], margins=True))
'''
       pop           
city    강남    강북  All
year                 
2000  3.30   NaN  3.3
2001   NaN  2.50  2.5
2002  3.00  2.00  2.5
All   3.15  2.25  2.7
'''

print(df.pivot_table(values=['pop'], index = ['year'], columns=['city'], margins=True, fill_value=0))
'''
       pop           
city    강남    강북  All
year                 
2000  3.30  0.00  3.3
2001  0.00  2.50  2.5
2002  3.00  2.00  2.5
All   3.15  2.25  2.7
'''

file i/o

* pdex4_fileio.py

import pandas as pd

#local
df = pd.read_csv(r'../testdata/ex1.csv')
#web
df = pd.read_csv(r'https://raw.githubusercontent.com/pykwon/python/master/testdata_utf8/ex1.csv')

print(df, type(df))
'''
   bunho irum  kor  eng
0      1  홍길동   90   90
1      2  신기해   95   80
2      3  한국인  100   85
3      4  박치기   67   54
4      5  마당쇠   55  100 <class 'pandas.core.frame.DataFrame'>
'''

df = pd.read_csv(r'../testdata/ex2.csv', header=None)
print(df, type(df))
'''
   0   1   2   3      4
0  1   2   3   4  hello
1  5   6   7   8  world
2  9  10  11  12    foo <class 'pandas.core.frame.DataFrame'>
'''

df = pd.read_csv(r'../testdata/ex2.csv', names=['a', 'b', 'c', 'd', 'e'])
print(df, type(df))
'''
   a   b   c   d      e
0  1   2   3   4  hello
1  5   6   7   8  world
2  9  10  11  12    foo <class 'pandas.core.frame.DataFrame'>
'''

df = pd.read_csv(r'../testdata/ex2.csv', names=['a', 'b', 'c', 'd', 'e'], index_col = 'e')
print(df, type(df))
'''
       a   b   c   d
e                   
hello  1   2   3   4
world  5   6   7   8
foo    9  10  11  12 <class 'pandas.core.frame.DataFrame'>
'''

df = pd.read_csv(r'../testdata/ex3.txt')
print(df, type(df))
'''
               A         B         C
0  aaa -0.264438 -1.026059 -0.619500
1  bbb  0.927272  0.302904 -0.032399
2  ccc -0.264273 -0.386314 -0.217601
3  ddd -0.871858 -0.348382  1.100491 <class 'pandas.core.frame.DataFrame'>
'''

df = pd.read_csv(r'../testdata/ex3.txt', sep = '\s+', skiprows=[1, 3])
print(df, type(df))
'''
            A         B         C
bbb  0.927272  0.302904 -0.032399
ddd -0.871858 -0.348382  1.100491 <class 'pandas.core.frame.DataFrame'>
'''

df = pd.read_fwf(r'../testdata/data_fwt.txt', encoding = 'utf8', widths=(10, 3, 5), names = ('date', 'name', 'price'))
print(df, type(df))
'''
         date name  price
0  2017-04-10  네이버  32000
1  2017-04-11  네이버  34000
2  2017-04-12  네이버  33000
3  2017-04-10  코리아  22000
4  2017-04-11  코리아  21000
5  2017-04-12  코리아  24000 <class 'pandas.core.frame.DataFrame'>
'''

print(df['date'])
'''
0    2017-04-10
1    2017-04-11
2    2017-04-12
3    2017-04-10
4    2017-04-11
5    2017-04-12
'''

chunksize : 파일이 너무 큰 경우에는 나눠서 읽기 옵션을 사용

test = pd.read_csv('../testdata/data_csv2.csv', header=None, chunksize = 3)
print(test)
'''
    0      1     2
0   1     사과  3500
1   2      배  5000
2   3    바나나  1000
3   4     수박  7000
4   5     참외  2000
5   6  에스프레소  5500
6   7  아메리카노  2000
7   8   카페라떼  4000
8   9   카푸치노  4000
9  10   카페모카  5000

<pandas.io.parsers.TextFileReader object at 0x00000240E6E521C0>
'''
for p in test:
    #print(p)
    print(p.sort_values(by=2, ascending=True))
    print()
'''
   0    1     2
2  3  바나나  1000
0  1   사과  3500
1  2    배  5000

   0      1     2
4  5     참외  2000
5  6  에스프레소  5500
3  4     수박  7000

   0      1     2
6  7  아메리카노  2000
7  8   카페라떼  4000
8  9   카푸치노  4000

    0     1     2
9  10  카페모카  5000
'''

파일로 저장

items = {'apple':{'count':10, 'price':1500},
         'orange':{'count':5, 'price':500}}
print(items)
df = pd.DataFrame(items)
print(df)
'''
count     10       5
price   1500     500
'''

csv

df.to_csv('result1.csv', sep=',')
df.to_csv('result2.csv', sep=',', index=False) # 색인 제외
df.to_csv('result3.csv', sep=',', index=False, header=False) # 색인, 칼럼명 제외

html

data = df.T
'''
        count  price
apple      10   1500
orange      5    500
'''
print(data)
data.to_html('result1.html')

excel로 저장

df2 = pd.DataFrame({'data':[1,2,3,4,5]})
wr = pd.ExcelWriter('good.xlsx', engine='xlsxwriter')
df2.to_excel(wr, sheet_name = 'Sheet1')
wr.save()

excel로 읽기

exf = pd.ExcelFile('good.xlsx')
print(exf.sheet_names)

dfdf = exf.parse('Sheet1')
print(dfdf)
dfdf2 = pd.read_excel(open('good.xlsx','rb'), sheet_name='Sheet1')
print(dfdf2)
'''
['Sheet1']
   Unnamed: 0  data
0           0     1
1           1     2
2           2     3
3           3     4
4           4     5
'''

ElementTree : XML, HTML 문서 읽기

* pdex5_etree.py

import xml.etree.ElementTree as etree

# 일반적인 형태의 파일읽기
xml_f = open("pdex5.xml","r", encoding="utf-8").read()
print(xml_f,"\n",type(xml_f))  # <class str> 이므로 str관련 명령만 사용 가능
print()

root = etree.fromstring(xml_f)
print(root,'\n', type(root))   # <class xml.etree.ElementTree.Element> Element 관련 명령 사용가능
print(root.tag, len(root.tag)) # items 5

xmlfile = etree.parse("pdex5.xml")
print(xmlfile)
root = xmlfile.getroot()
print(root.tag)                   # items
print(root[0].tag)                # item
print(root[0][0].tag)             # name
print(root[0][0].attrib)          # {'id': 'ks1'}
print(root[0][2].attrib.keys())   # dict_keys(['kor', 'eng'])
print(root[0][2].attrib.values()) # dict_values(['90', '80'])

* pdex5.xml

<?xml version="1.0" encoding="UTF-8"?>
<!-- xml문서 최상위 해당 라인이 추가되어야 한다. -->
<items>
	<item>
		<name id="ks1">홍길동</name>
		<tel>111-1111</tel>
		<exam kor="90" eng="80"/>
	</item>
	<item>
		<name id="ks2">신길동</name>
		<tel>111-2222</tel>
		<exam kor="100" eng="50"/>
	</item>
</items>

Beautifulsoup : XML, HTML문서의 일부 자료 추출

* pdex6_beautifulsoup.py

# Beautifulsoup : XML, HTML문서의 일부 자료 추출
# 라이브러리 설치 beautifulsoup4, request, lxml
# find(), select()
 
import requests  
from bs4 import BeautifulSoup
def go():
    base_url = "http://www.naver.com/index.html"
    
    #storing all the information including headers in the variable source code
    source_code = requests.get(base_url)
    
    #sort source code and store only the plaintext
    plain_text = source_code.text            # 문자열
    #print(plain_text)
    
    #converting plain_text to Beautiful Soup object so the library can sort thru it
    convert_data = BeautifulSoup(plain_text, 'lxml') # lxml의 HTML 해석기 사용
    
    # 해석 라이브러리
    # BeautifulSoup(markup, "html.parser")
    # BeautifulSoup(markup, "lxml")
    # BeautifulSoup(markup, ["lxml", "xml"])
    # BeautifulSoup(markup, "xml")
    # BeautifulSoup(markup, html5lib)
    
    # find 함수
    # find()
    # find_next()
    # find_all()
    
    for link in convert_data.findAll('a'):   # a tag search 
        href = link.get('href')              # href 속성 get
        #href = base_url + link.get('href')  #Building a clickable url
        print(href)                          #displaying href
go()

Beautifulsoup의 find(), select()

* pdex7_beautifulsoup.py

from bs4 import BeautifulSoup

html_page = """
<html><body>
<h1>제목태그</h1>
<p>웹문서 읽기</p>
<p>원하는 자료 선택</p>
</body></html>
"""
print(html_page, type(html_page)) # <class str>

soup = BeautifulSoup(html_page, 'html.parser') # BeautifulSoup 객체 생성
print(type(soup))  # bs4.BeautifulSoup BeautifulSoup이 제공하는 명령 사용 가능
print()

h1 = soup.html.body.h1
print('h1 :', h1.string)  # h1 : 제목태그
p1 = soup.html.body.p     # 최초의 p
print('p1 :', p1.string)  # p1 : 웹문서 읽기

p2 = p1.next_sibling      # </p>
p2 = p1.next_sibling.next_sibling
print('p2 :', p2.string)  # p2 : 원하는 자료 선택

find() 사용

html_page2 = """
<html><body>
<h1 id='title'>제목태그</h1>
<p>웹문서 읽기</p>
<p id='my'>원하는 자료 선택</p>
</body></html>
"""

soup2 = BeautifulSoup(html_page2, 'html.parser')
print(soup2.p, ' ', soup2.p.string)    # 직접 최초 tag 선택가능
                                        # <p>웹문서 읽기</p>   웹문서 읽기

print(soup2.find('p').string)          # find('태그명')
                                        # 웹문서 읽기

print(soup2.find('p', id='my').string) # find('태그명', id='아이디명')
                                        # 원하는 자료 선택

print(soup2.find_all('p'))             # find_all('태그명') 
                                        # [<p>웹문서 읽기</p>, <p id="my">원하는 자료 선택</p>]

print(soup2.find(id='title').string)   # find(id='아이디명')
                                        # 제목태그

print(soup2.find(id='my').string)
                                        # 원하는 자료 선택

find_all(), findeAll() 사용

html_page3 = """
<html><body>
<h1 id='title'>제목태그</h1>
<p>웹문서 읽기</p>
<p id='my'>원하는 자료 선택</p>
<div>
    <a href="https://www.naver.com">naver</a><br/>
    <a href="https://www.daum.net">daum</a><br/>
</div>
</body></html>
"""

soup3 = BeautifulSoup(html_page3, 'html.parser')
print(soup3.find('a'))             # <a href="https://www.naver.com">naver</a>

print(soup3.find('a').string)
# naver

print(soup3.find(['a', 'i']))      # find(['태그명1', '태그명2']) 
                                    # <a href="https://www.naver.com">naver</a>

print(soup3.find_all(['a', 'p']))  # [<p>웹문서 읽기</p>, <p id="my">원하는 자료 선택</p>, <a href="https://www.naver.com">naver</a>, <a href="https://www.daum.net">daum</a>]
print(soup3.findAll(['a', 'p']))   # 위와 동일

print(soup3.find_all('a'))         # a태그만
                                    # [<a href="https://www.naver.com">naver</a>, <a href="https://www.daum.net">daum</a>]

print(soup3)
print(soup3.prettify()) # 들여쓰기
print()

links = soup3.find_all('a')        # [<a href="https://www.naver.com">naver</a>, <a href="https://www.daum.net">daum</a>]
print(links)
for i in links:
    href = i.attrs['href']
    text = i.string
    print(href, text)
# https://www.naver.com naver
# https://www.daum.net daum

find() 정규표현식 사용

import re
links2 = soup3.find_all(href=re.compile(r'^https://'))
print(links2)                 # [<a href="https://www.naver.com">naver</a>, <a href="https://www.daum.net">daum</a>]

for k in links2:
    print(k.attrs['href'])
# https://www.naver.com
# https://www.daum.net

select() 사용(css의 selector)

html_page4 = """
<html><body>
<div id='hello'>
    <a href="https://www.naver.com">naver</a><br/>
    <a href="https://www.daum.net">daum</a><br/>
    <ul class="world">
        <li>안녕</li>
        <li>반가워</li>
    </ul>
</div>
<div id='hi'>
    seconddiv
</div>
</body></html>
"""

soup4 = BeautifulSoup(html_page4, 'lxml')
aa = soup4.select_one('div#hello > a').string # select_one() : 하나만 선택
print("aa :", aa)            # aa : naver

bb = soup4.select("div#hello ul.world > li") # > : 직계, 공백 : 자손, select() : 복수 선택. 객체리턴.
print("bb :", bb) # bb : [<li>안녕</li>, <li>반가워</li>]

for i in bb:
    print("li :", i.string)
# li : 안녕
# li : 반가워

웹문서 읽기 - web scraping

* pdex8.py

import urllib.request as req
from bs4 import BeautifulSoup

url = "https://ko.wikipedia.org/wiki/%EC%9D%B4%EC%88%9C%EC%8B%A0"
wiki = req.urlopen(url)
print(wiki) # <http.client.HTTPResponse object at 0x00000267F71B1550>

soup = BeautifulSoup(wiki, 'html.parser')
# Chrome - F12 - 태그 오른쪽 클릭  - Copy - Copy selector
print(soup.select_one("#mw-content-text > div.mw-parser-output > p"))

url = "https://news.daum.net/society#1"
daum = req.urlopen(url)
soup = BeautifulSoup(daum, 'html.parser')
print(soup.select_one("div#kakaoIndex > a").string) # 본문 바로가기
datas = soup.select("div#kakaoIndex > a")

for i in datas:
    href = i.attrs['href']
    text = i.string
    print("href :{}, text:{}".format(href, text))
    #print("href :%s, text:%s"%(href, text))
# href :#kakaoBody, text:본문 바로가기
# href :#kakaoGnb, text:메뉴 바로가기
print()

datas2 = soup.findAll('a')
#print(datas2)
for i in datas2[:2]:
    href = i.attrs['href']
    text = i.string
    print("href :{}, text:{}".format(href, text))
    
# href :#kakaoBody, text:본문 바로가기
# href :#kakaoGnb, text:메뉴 바로가기

날씨 정보 예보

* pdex9_weather.py

#<![CDATA[ ]]>
import urllib.request
import urllib.parse
from bs4 import BeautifulSoup
import pandas as pd
url = "http://www.weather.go.kr/weather/forecast/mid-term-rss3.jsp"
data = urllib.request.urlopen(url).read()
#print(data.decode('utf-8'))
soup = BeautifulSoup(urllib.request.urlopen(url).read(), 'lxml')
#print(soup)
print()

title = soup.find('title').string
print(title) # 기상청 육상 중기예보
#wf = soup.find('wf')
wf = soup.select_one('item > description > header > wf')
print(wf)


city = soup.find_all('city')
#print(city)
cityDatas = []
for c in city:
    #print(c.string)
    cityDatas.append(c.string)

df = pd.DataFrame()
df['city'] = cityDatas
print(df.head(3))
'''
  city
0   서울
1   인천
2   수원
'''

#tmEfs = soup.select_one('location data tmef')
#tmEfs = soup.select_one('location > data > tmef')
tmEfs = soup.select_one('location > province + city + data > tmef') # + 아래 형제
print(tmEfs) # <tmef>2021-02-28 00:00</tmef>

tempMins = soup.select('location > province + city + data > tmn')
tempDatas = []
for t in tempMins:
    tempDatas.append(t.string)
    
df['temp_min'] = tempDatas
print(df.head(3), len(df))
'''
  city temp_min
0   서울        3
1   인천        3
2   수원        2 41
'''
df.columns = ['지역', '최저기온']
print(df.head(3))
'''
   지역 최저기온
0  서울    3
1  인천    3
2  수원    2
'''
print(df.describe())
print(df.info())

# 파일로 저장
df.to_csv('날씨정보.csv', index=False)
df2 = pd.read_csv('날씨정보.csv')

print(df2.head(2))
print(df2[0:2])

print(df2.tail(2))
print(df2[-2:len(df)])
'''
     지역  최저기온
39   제주    10
40  서귀포    10
'''

print(df.iloc[0])
'''
지역      서울
최저기온     3
'''
print(type(df.iloc[0])) # Series

print(df.iloc[0:2, :])
'''
   지역 최저기온
0  서울    3
1  인천    3
'''
print(type(df.iloc[0:2, :])) # DataFrame
print("---")
print(df.iloc[0:2, 0:2])
'''
   지역 최저기온
0  서울    3
1  인천    3
0    서울
1    인천
'''
print(df['지역'][0:2])
'''
0    서울
1    인천
'''
print(df['지역'][:2])
'''
0    서울
1    인천
'''
print("---")
print(df.loc[1:3])
'''
   지역 최저기온
1  인천    3
2  수원    2
3  파주   -2
'''
print(df.loc[[1, 3]])
'''
   지역 최저기온
1  인천    3
3  파주   -2
'''

print(df.loc[:, '지역'])
'''
0      서울
1      인천
2      수원
3      파주
'''
print('----------')
df = df.astype({'최저기온':'int'})
print(df.info())
print('----------')
print(df['최저기온'].mean()) # 2.1951219512195124
print(df['최저기온'].std()) # 3.034958914014504
print(df['최저기온'].describe())
#print(df['최저기온'] >= 5)
print(df.loc[df['최저기온'] >= 5])
'''
     지역  최저기온
17   여수     5
19   광양     5
27   부산     7
28   울산     5
32   통영     6
35   포항     6
39   제주    10
40  서귀포    10
'''
print('----------')
print(df.sort_values(['최저기온'], ascending=True))
'''
     지역  최저기온
31   거창    -3
6    춘천    -3
3    파주    -2
4    이천    -2
34   안동    -2
13   충주    -1
7    원주    -1
'''

웹 문서를 다운받아 파일로 저장하기 - 스케줄러

* pdex10_schedule.py

# 웹 문서를 다운받아 파일로 저장하기 - 스케줄러
from bs4 import BeautifulSoup
import urllib.request as req
import datetime

def working():
    url = "https://finance.naver.com/marketindex/"
    data = req.urlopen(url)
    soup = BeautifulSoup(data, 'html.parser')
    price = soup.select_one("div.head_info > span.value").string
    print("미국 USD :", price) # 미국 USD : 1,108.90
    
    t = datetime.datetime.now()
    print(t) # 2021-02-25 14:52:30.108522
    fname = './usd/' + t.strftime('%Y-%m-%d-%H-%M-%S') + ".txt"
    print(fname) # 2021-02-25-14-53-59.txt
    
    with open(fname, 'w') as f:
        f.write(price)
    
# 스케줄러
import schedule   # pip install schedule
import time
 
# 한번만 실행
working(); 
# 10초에 한번씩 실행
schedule.every(10).second.do(working)
# 10분에 한번씩 실행
schedule.every(10).minutes.do(working)
# 매 시간 실행
schedule.every().hour.do(working)
# 매일 10:30 에 실행
schedule.every().day.at("10:30").do(working)
# 매주 월요일 실행
schedule.every().monday.do(working)
# 매주 수요일 13:15 에 실행
schedule.every().wednesday.at("13:15").do(working)

while True:
    schedule.run_pending()
    time.sleep(1)

urllib.request, requests 모듈로 웹 자료 읽기

* pdex11.py

방법 1

from bs4 import BeautifulSoup

import urllib.request
url = "https://movie.naver.com/movie/sdb/rank/rmovie.nhn" # 네이버 영화 랭킹 정보
data = urllib.request.urlopen(url).read()
print(data)
soup = BeautifulSoup(data, 'lxml')
print(soup)
print(soup.select("div.tit3"))
print(soup.select("div[class=tit3]")) # 위와 동일

for tag in soup.select("div[class=tit3]"):
    print(tag.text.strip())
'''
미션 파서블
극장판 귀멸의 칼날: 무한열차편
소울
퍼펙트 케어
새해전야
몬스터 헌터
'''

방법 2

import requests
data = requests.get(url);
print(data.status_code, data.encoding) # 정보 반환. 200 MS949
datas = data.text
print(datas)

datas = requests.get(url).text; # 명령을 연속적으로 주고 읽음
soup2 = BeautifulSoup(datas, "lxml")
print(soup2)

m_list = soup2.findAll("div", "tit3")          # findAll("태그명", "속성명")
m_list = soup2.findAll("div", {'class':'tit3'}) # findAll("태그명", {'속성':'속성명'}) 
print(m_list)
count = 1
for i in m_list:
    title = i.find('a')
    #print(title)
    print(str(count) + "위 : " + title.string)
    count += 1
'''
1위 : 미션 파서블
2위 : 극장판 귀멸의 칼날: 무한열차편
3위 : 소울
4위 : 퍼펙트 케어
5위 : 새해전야
6위 : 몬스터 헌터
7위 : 더블패티
'''

네이버 실시간 검색어

import requests
from bs4 import BeautifulSoup  # html 분석 라이브러리

# 유저 설정
url = 'https://datalab.naver.com/keyword/realtimeList.naver?where=main'
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36'}

res = requests.get(url, headers = headers)
soup = BeautifulSoup(res.content, 'html.parser')

# span.item_title 정보를 선택
data = soup.select('span.item_title')
i = 1

for item in data:
    print(str(i) + ')' + item.get_text())
    i += 1
'''
1)함소원 진화
2)함소원
3)오은영
4)기성용
'''

* pdex12_search.py

구글 검색기능

import requests
from bs4 import BeautifulSoup
import webbrowser

def searchFunc(search_word):
    base_url = "https://www.google.com/search?q={0}"
    #sword = base_url.format("colab")
    sword = base_url.format(search_word)
    print(sword) # https://www.google.com/search?q=colab
    
    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36'}
    #plain_text = requests.get(sword)
    plain_text = requests.get(sword, headers=headers)
    #print(plain_text.text)
    soup = BeautifulSoup(plain_text.text, 'lxml')
    #print(soup)
    link_data = soup.select('div.yuRUbf > a')
    #print(link_data)
    for link in link_data:
        #print(link.attrs['href'])
        #print(type(link),type(str(link)))
        print(str(link).find('https'),str(link).find('ping') - 2)
        urls = str(link)[str(link).find('https'):str(link).find('ping') -2]
        print(urls)
        webbrowser.open(urls)
        
search_word = "파이썬"
    
searchFunc(search_word)

XML

* pdex13_xml.py

xmlObj.select('태그명')

# XML로 제공된 강남구 도서간 정보 읽기
import urllib.request as req
from bs4 import BeautifulSoup

url = "http://openapi.seoul.go.kr:8088/sample/xml/SeoulLibraryTime/1/5/"
plainText = req.urlopen(url).read().decode() # url의 text read
#print(plainText)

xmlObj = BeautifulSoup(plainText, 'lxml') # xml 해석기 사용
libData = xmlObj.select('row')            # row 태그의 데이터 정보 객체에 저장
#print(libData)

for data in libData:
    name = data.find('lbrry_name').text   # row태그내 lbrry_name 태그 정보 read
    addr = data.find('adres').text        # row태그내 adres 태그 정보 read
    print('도서관명\t:', name,'\n주소\t:',addr)

json

* pdex14.json

{
	"직원":{
		"이름":"홍길동",
		"직급":"대리",
		"전화":"010-222-2222"
	},
	"웹사이트":{
		"카페명":"cafe.daum.net/flowlife",
		"userid":"hell"
	}
}

* pdex14_json.py

json.loads('json 문자열')

import json

json_file = "./pdex14.json" # 파일경로
json_data = {}

def readData(filename):
    f = open(filename, 'r', encoding="utf-8") # 파일 열기
    lines = f.read()                          # 파일 읽기
    f.close()                                 # 파일 닫기
    #print(lines)
    return json.loads(lines)                  # decoding str -> dict                        
    
def main():
    global json_file                          # 전역 변수 사용
    json_data = readData(json_file)           # json파일의 내용을 dict타입으로 반환
    print(type(json_data))                    # dict
    
    d1 = json_data['직원']['이름']              # data의 key로 value read
    d2 = json_data['직원']['직급']
    d3 = json_data['직원']['전화']
    print("이름 : " + d1 +" 직급 : "+d2 + " 전화 : "+d3)
    # 이름 : 홍길동 직급 : 대리 전화 : 010-222-2222

if __name__ == "__main__":
    main()

* pdex15_json.py

jsonData.get('태그명')

# JSON으로 제공된 강남구 도서간 정보 읽기
import urllib.request as req
import json

url = "http://openapi.seoul.go.kr:8088/sample/json/SeoulLibraryTime/1/5/ "
plainText = req.urlopen(url).read().decode() # url text read 및 decode
print(plainText)

jsonData = json.loads(plainText)
print(jsonData['SeoulLibraryTime']['row'][0]["LBRRY_NAME"]) # LH강남3단지작은도서관

# get 함수
libData = jsonData.get("SeoulLibraryTime").get("row")
print(libData)

name = libData[0].get('LBRRY_NAME') # LH강남3단지작은도서관
print(name)
print()

for msg in libData:
    name = msg.get('LBRRY_NAME')
    tel = msg.get('TEL_NO')
    addr = msg.get('ADRES')
    print('도서관명\t:', name,'\n주소\t:',addr,'\n전화\t:',tel)
'''
도서관명    : LH강남3단지작은도서관 
주소    : 서울특별시 강남구 자곡로3길 22 
전화    : 02-459-8700
도서관명    : 강남구립못골도서관 
주소    : 서울시 강남구 자곡로 116 
전화    : 02-459-5522
도서관명    : 강남역삼푸른솔도서관 
주소    : 서울특별시 강남구 테헤란로8길 36. 4층 
전화    : 02-2051-1178
도서관명    : 강남한신휴플러스8단지작은도서관 
주소    : 서울특별시 강남구 밤고개로27길 20(율현동, 강남한신휴플러스8단지) 
전화    : 
도서관명    : 강남한양수자인작은씨앗도서관 
주소    : 서울특별시 강남구 자곡로 260 
전화    : 
'''

selenium : 자동 웹 브라우저 제어

anaconda 접속
pip install selenium

https://sites.google.com/a/chromium.org/chromedriver/home 접속
Latest stable release: ChromeDriver 88.0.4324.96 접속
chromedriver_win32.zip download
dowload 파일 특정경로에 압축풀어 이동

anaconda 접속

python
from selenium import webdriver
browser = webdriver.Chrome('D:/1. 프로그래밍/0. 설치 Program/Python/chromedriver')
browser.implicitly_wait(5)
browser.get('https://daum.net')
browser.quit()

import time
from selenium import webdriver
browser = webdriver.Chrome('D:/1. 프로그래밍/0. 설치 Program/Python/chromedriver')
browser.get('http://www.google.com/xhtml');
time.sleep(5)
search_box = browser.find_element_by_name('q')
search_box.send_keys('파이썬')
search_box.submit()
time.sleep(5)
browser.quit()

* pdex16_selenium.py

# 셀레니움으로 임의의 사이트 화면 캡처
from selenium import webdriver

try:
    url = "http://www.daum.net"
    browser = webdriver.Chrome('D:/1. 프로그래밍/0. 설치 Program/Python/chromedriver')
    browser.implicitly_wait(3)

    browser.get(url);
    browser.save_screenshot("daum_img.png")
    browser.quit()
    print('성공')

except Exception:
    print('에러')

형태소 분석

* nlp01.py

# 말뭉치 : 자연어 연구를 목적으로 수집된 샘플 dataset
# 형태소(단어로써 의미를 가지는 최소 단위) 분석 (한글) - 어근, 접두사, 접미사, 품사 형태로 분리한 데이터로 분석작업

from konlpy.tag import Kkma

kkma = Kkma()
#print(kkma)
phrase = "영국 제약사 아스트라제네카의 백신 1차분 접종은 오전 9시부터 전국의 요양원 직원들과 약 5200만명의 환자들에게 투여되기 시작했다. 반가워요"
print(kkma.sentences(phrase))
print(kkma.nouns(phrase)) # 명사 출력
print()

from konlpy.tag import Okt
okt = Okt()
print(okt.pos(phrase))               # 단어 + 품사
print(okt.pos(phrase, stem=True))    # 
print(okt.nouns(phrase))             # 명사 출력
print(okt.morphs(phrase))            # 모든 품사 출력

단어 빈도수

* nlp2_webscrap

# 위키백과 사이트에서 원하는 단어 검색 후 형태소 분석. 단어 출현 빈도 수 출력.

import urllib
from bs4 import BeautifulSoup
from konlpy.tag import Okt
from urllib import parse # 한글 인코딩용

okt = Okt()
#para = input('검색 단어 입력 : ')
para = "이순신"
para = parse.quote(para)    # 한글 인코딩
url = "https://ko.wikipedia.org/wiki/" + para
print(url) # https://ko.wikipedia.org/wiki/%EC%9D%B4%EC%88%9C%EC%8B%A0

page = urllib.request.urlopen(url)        # url 열기
#print(page)

soup = BeautifulSoup(page.read(), 'lxml') # url 읽어. xml 해석기 사용
#print(soup)

wordList = []               # 형태소 분석으로 명사만 추출해 기억

for item in soup.select("#mw-content-text > div > p"): # p태그의 데이터를 읽는다
    #print(item)
    if item.string != None: # 태그 내부가 비어있으면 저장 하지않는다.
        #print(item.string)
        ss = item.string
        wordList += okt.nouns(ss) # 명사만 추출

print("wordList :", wordList)
print("단어 수 :", len(wordList))
'''
wordList : ['당시', '조산', '만호', '이순신', ...]
단어 수 : 241
'''

# 단어의 발생횟수를 dict type 당시 : 2 조산:5
word_dict = {}

for i in wordList:
    if i in word_dict: # dict에 없으면 추가 있으면 count+1
        word_dict[i] += 1
    else:
        word_dict[i] = 1
print("word_dict :", word_dict)
'''
word_dict : {'당시': 3, '조산': 1, '만호': 1,  ...}
'''

setdata = set(wordList)
print(setdata)
print("단어 수(중복 제거 후) :", len(setdata)) # 단어 수(중복 제거 후) : 169
print()

# 판다스의 series type으로 처리
import pandas as pd
woList = pd.Series(wordList)
print(woList[:5])
print(woList.value_counts()[:5]) # 단어 별 횟수 총 갯수 top 5
'''
0     당시
1     조산
2     만호
3    이순신
4     북방
dtype: object
이순신    14
척       7
배       7
대한      5
그       4
dtype: int64

당시      3
조산      1
만호      1
이순신    14
북방      1
'''
print()

woDict = pd.Series(word_dict)
print(woDict[:5])
print(woDict.value_counts())
'''
1     133
2      23
3       7
7       2
4       2
14      1
5       1
'''
print()

# DataFrame으로 처리
df1 = pd.DataFrame(wordList, columns =['단어'])
print(df1.head(5))
print()
'''
    단어
0   당시
1   조산
2   만호
3  이순신
4   북방
'''

# 단어 / 빈도수
df2 = pd.DataFrame([word_dict.keys(), word_dict.values()])
df2 = df2.T
df2.columns = ['단어', '빈도수']
print(df2.head(3))
'''
   단어  빈도수
0  당시    3
1  조산    1
2  만호    1
'''
df2.to_csv("./이순신.csv", sep=',', index=False)
df3 = pd.read_csv("./이순신.csv")
print(df3.head(3))
'''
   단어  빈도수
0  당시    3
1  조산    1
2  만호    1
'''

one_hot encoding

* nlp3.py

# Word를 수치화해서 vector에 담기
import numpy as np

# 단어 one_hot encoding
data_list = ['python', 'lan', 'program', 'computer', 'say']
print(data_list)

values = []
for x in range(len(data_list)):
    values.append(x)

print(values) # [0, 1, 2, 3, 4]

values_len = len(values) # 5

one_hot = np.eye(values_len)
print(one_hot) # one_hot encoding
'''
[[1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 1.]]
'''

word2vec

: 단어를 벡터로 변경

anaconda 접속
pip install gensim
conda remove --force scipy
pip install scipy

from gensim.models import word2vec 

sentence = [['python', 'lan', 'program', 'computer', 'say']]
model =word2vec.Word2Vec(sentences = sentence, min_count=1, size = 50) # 50개의 크기
print(model.wv)
word_vectors = model.wv
print("word_vectors.vocab : ", word_vectors.vocab) # key, value로 구성된 vocab obj
print()

vocabs = word_vectors.vocab.keys()
print("vocabs : ", vocabs)
# vocabs :  dict_keys(['python', 'lan', 'program', 'computer', 'say'])

vocab_val = word_vectors.vocab.values()
print("vocab_val : ", vocab_val)
# vocab_val :  dict_values([<gensim.models.keyedvectors.Vocab object at 0x00000234EE66F610>, <gensim.models.keyedvectors.Vocab object at 0x00000234F6674130>, <gensim.models.keyedvectors.Vocab object at 0x00000234F6674190>, <gensim.models.keyedvectors.Vocab object at 0x00000234F6674220>, <gensim.models.keyedvectors.Vocab object at 0x00000234F6809F10>])
print()

word_vectors_list = [word_vectors[v] for v in vocabs]
print(word_vectors_list)
print()
'''
[array([ 1.3842779e-03,  7.4106529e-03,  2.4765935e-03, -8.9635467e-03,
        8.0429604e-03,  7.1699792e-03, -1.5191999e-03,  3.6448328e-04,
       -1.7622416e-03, -5.8619846e-03, -5.2785235e-03,  1.9480551e-03,
'''

similarity() : 코사인 유사도(Cosine Similarity) 수식에 의한 단어 사이의 거리를 출력

print(word_vectors.similarity(w1 = 'lan', w2 = 'program')) # 단어간 유사도 거리 : 0.20018259
print(word_vectors.similarity(w1 = 'lan', w2 = 'say'))     # 단어간 유사도 거리 : -0.10073644
print()
print(model.wv.most_similar(positive='lan'))
# [('program', 0.20018261671066284), ('computer', 0.158003032207489), ('python', 0.11154142022132874), ('say', -0.10073643922805786)]
# 1에 가까울 수록 유사도가 좋음

* nlp4.py

# 웹 뉴스 정보를 읽어 형태소 분석 = > 단어별 유사도 출력
import pandas as pd
from konlpy.tag import Okt

okt = Okt()

with open('news.txt', mode='r', encoding='utf8') as f:
    #print(f.read())
    lines = f.read().split('\n') # 줄 나누기
    print(len(lines))
    
wordDic = {} # 단어 수 확인을 위한 dict type

for line in lines:
    datas = okt.pos(line) # 품사 태깅
    #print(datas) # [('(', 'Punctuation'), ('경남', 'Noun'), ('=', 'Punctuation'), ...

    for word in datas:
        if word[1] == 'Noun':  # 명사만 작업에 참여
            #print(word) # ('경남', 'Noun')
            #print(word[0] in wordDic)
            if not (word[0] in wordDic): # 없으면 0, 있으면 count + 1
                wordDic[word[0]] = 0 # {word[0] : count, ... } 
            wordDic[word[0]] += 1

print(wordDic)
# {'경남': 9, '뉴스': 4, '김다솜': 1, '기자': 1, '가덕도': 14, ...

# 단어 건수 별 내림차순 정렬
keys = sorted(wordDic.items(), key= lambda x : x[1], reverse=True) # count로 내림차순 정렬
print(keys)

# DataFrame에 담기 - 단어, 건수
wordList = []
countList = []

for word, count in keys[:20]: #상위 20개만 작업
    wordList.append(word)
    countList.append(count)

df = pd.DataFrame()
df['word'] = wordList
df['count'] = countList
print(df) 
'''
      word  count
0       공항     19
1      가덕도     14
2       경남      9
3      특별법      9
4   시민사회단체      8
5       도내      6
'''

# word2vec
result = []
with open('news.txt', mode='r', encoding='utf8') as fr:
    lines = fr.read().split('\n') # 줄 나누기
    for line in lines:
        datas = okt.pos(line, stem=True) # 품사 태깅. stem : 원형 어근 형태로 처리. 한가하고 -> 한가하다
        #print(datas) # [('(', 'Punctuation'), ('경남', 'Noun'), ('=', 'Punctuation'), ...
        temp = []
        for word in datas:
            if not word[1] in ['Punctuation', 'Suffix', 'Josa', 'Verb', 'Modifier', 'Number', 'Determiner', 'Foreign']:
                temp.append(word[0]) # 한 줄당 단어
        temp2 = (" ".join(temp)).strip() # 공백 제거 후 띄어쓰기를 채워 합친다.
        result.append(temp2)
        
print(result)

fileName = "news2.txt"
with open(fileName, mode="w", encoding='utf8') as fw:
    fw.write('\n'.join(result)) # 각 줄을 \n로 이어준 후 저장
    print('저장 완료')

* news.txt

(경남=뉴스1) 김다솜 기자 = 가덕도신공항 특별법의 국회통과를 앞둔 26일 경남 도내 시민사회단체와 정당에서 반대 의견이 나오고 있다.
경남기후위기비상행동, 민주노총 경남본부 등 도내 시민사회단체는 이날 오전 경남도청 앞에서 기자회견을 열어 관련 부처에서도 반대 의견을 표명한 가덕도신공항 특별법이 기득권 양당의 담합 행위라고 반발했다.
정의당 경남도당도 같은 입장을 담은 성명을 발표했다.
이들은 관련 부처에서도 반대하는 사업을 받아들일 수 없다고 강조했다. 국토교통부의 가덕도신공항 의견보고서에서 위험성, 효율성 등 부정적인 측면이 지적된 만큼 신중하게 판단해야 한다는 것이다.
국토교통부는 지난 24일 안정성, 시공성, 운영성 등 7가지 점검 내용이 담긴 의견보고서를 제시했다. 보고서는 가덕도신공항 건설 시 안전사고 위험성이 크게 증가하고, 환경훼손도 뒤따른다고 지적하고 있다.
법무부도 가덕도신공항 특별법이 개별적·구체적 사건만 규율하고 있어서 적법 절차 및 평등원칙에 위배될 우려가 있다는 입장을 전했다.
이번 가덕도신공항 특별법에 포함된 예비타당성 조사 면제에 대한 반응도 좋지 않다. 기획재정부는 대규모 신규 사업에서 예산 낭비를 방지하려면 타당성을 검증할 필요가 있다는 의견을 전했다.
도내 시민사회단체는 “관계 부처까지도 수용이 곤란하다고 말하고 있다”며 “설계 없이 공사를 할 수 있게 한 유례를 찾을 수 없는 이 기상천외한 가덕도신공항 특별법은 위험하기 짝이 없다”고 비판했다.
가덕도신공항 특별법 처리를 앞두고 도내 시민사회단체와 민주노총 경남본부, 정의당 경남도당은 26일 법안 폐기를 촉구했다. 이들은 가덕도신공항 건설로 얻는 피해가 막대하다는 점을 거듭 강조했다. (경남 시민사회단체 제공) © 뉴스1
가덕도신공항 특별법 처리를 앞두고 도내 시민사회단체와 민주노총 경남본부, 정의당 경남도당은 26일 법안 폐기를 촉구했다. 이들은 가덕도신공항 건설로 얻는 피해가 막대하다는 점을 거듭 강조했다. (경남 시민사회단체 제공) © 뉴스1
당초 동남권 관문공항을 세우기 위한 목적에도 어긋난다는 지적도 나온다. 부산시가 발표한 가덕도신공항 건설안은 국제선만 개항하고, 국내선은 김해공항만 개항하도록 했는데 동남권 관문공항으로 보기에는 현실적이지 못하다는 설명이다.
국제선과 국내선, 군시설 등 동남권 관문공항의 기본 요소를 갖추려면 부산시에서 추산한 7조5000억 원보다 많은 28조7000억 원이 필요하다는 점도 짚었다. 이는 이명박 정부에 시행된 4대강 사업 예산보다도 많은 액수다.
도내 시민사회단체는 “정부와 여당이 적폐라 비난했던 이명박의 4대강 살리기 사업보다 더 나갔다”며 “동네 하천 정비도 이렇게 하지 않는다”고 일갈했다.
정의당 경남도당도 “모든 분야에서 부적격하다는 가덕도신공항 특별법을 밀어붙이는 건 집권 여당의 명백한 입법권 남용”이라며 “1년 임시 부산시장 자리를 위해 백년지대계인 공항건설을 선거지대계로 전락시킨 더불어민주당과 국민의힘을 규탄한다”고 전했다.
한편 가덕도신공항 특별법은 오늘 국회 본회의를 거쳐 통과될 전망이다. 이 법안은 예비타당성조사를 면제하고, 다른 법률에 우선 적용시키자는 내용 등을 담고 있다.
allcotton@news1.kr
Copyright ⓒ 뉴스1코리아 www.news1.kr 무단복제 및 전재 – 재배포금지

* news2.txt

경남 뉴스 김다솜 기자 가덕도 공항 특별법 국회통과 경남 도내 시민사회단체 정당 반대 의견 있다
경남 기후 위기 비상 행동 민주 노총 경남 본부 등 도내 시민사회단체 날 오전 경남 도청 앞 기자회견 관련 부처 반대 의견 표명 가덕도 공항 특별법 기득권 당 담합 행위 반발
정의당 남도 같다 입장 성명 발표
이 관련 부처 반대 사업 수 없다 강조 국토교통부 가덕도 공항 의견 보고서 위험성 효율 등 부정 측면 지적 만큼 신중하다 판단 것
국토교통부 지난 안정 공성 운영 등 가지 점검 내용 의견 보고서 제시 보고서 가덕도 공항 건설 시 안전 사고 위험성 크게 증가 환경 훼손 지적 있다
법무부 가덕도 공항 특별법 개별 구체 사건 규율 있다 적법 절차 및 평등 원칙 위배 우려 있다 입장 전
이번 가덕도 공항 특별법 포함 예비 타당성 조사 면제 대한 반응 좋다 기획재정부 대규모 신규 사업 예산 낭비 방지 타당성 검증 필요 있다 의견 전
도내 시민사회단체 관계 부처 수용 곤란하다 말 있다 며 설계 없이 공사 수 있다 유례 수 없다 이 기상 외한 가덕도 공항 특별법 위험하다 짝 없다 고 비판
가덕도 공항 특별법 처리 도내 시민사회단체 민주 노총 경남 본부 정의당 남도 법안 폐기 촉구 이 가덕도 공항 건설 피해 막대 점 거듭 강조 경남 시민사회단체 제공 뉴스
가덕도 공항 특별법 처리 도내 시민사회단체 민주 노총 경남 본부 정의당 남도 법안 폐기 촉구 이 가덕도 공항 건설 피해 막대 점 거듭 강조 경남 시민사회단체 제공 뉴스
당초 동남권 관문 공항 위 목적 지적도 부산시 발표 가덕도 공항 건설 국제선 개항 국내선 김해 공항 개항 동남권 관문 공항 보기 현실 못 설명
국제선 국내선 군시설 등 동남권 관문 공항 기본 요소 부산시 추산 원보 많다 원 필요하다 점도 이명박 정부 시행 대강 사업 예산 많다 액수
도내 시민사회단체 정부 여당 적폐 비난 이명박 대강 사업 더 며 동네 하천 정비 이렇게 고 일
정의당 남도 모든 분야 부 적격하다 가덕도 공항 특별법 건 집권 여당 명백하다 입법권 남용 라며 임시 부산시 자리 위해 백년 지대 공항 건설 선거 지대 전락 민주당 국민 힘 규탄 고 전
한편 가덕도 공항 특별법 오늘 국회 본회의 통과 전망 이 법안 예비 타당성조사 면제 다른 법률 우선 적용 내용 등 있다
allcotton@news1.kr
Copyright 뉴스 코리아 www.news1.kr 무단 복제 및 재 재 배포 금지

# 모델 생성
model = word2vec.Word2Vec(genObj, size=100, window=10, min_count=2, sg=1) # window : 참조 주변 단어 수, min_count : 2보다 적은 값은 모델 구성에서 제외, sg=0 cbow, sg=1 skip-gram,
#Cbow 주변 단어로 중심단어 예측
#skip-gram : 중심 단어로 주변단어 추측
print(model)
model.init_sims(replace=True) # 모델 제작 중 생성된 필요없는 메모리 해제

# 학습 시킨 모델은 저장 후 재사용 가능
try:
    model.save('news.model')
except Exception as e:
    print('err', e)

model = word2vec.Word2Vec.load('news.model')
print(model.wv.most_similar(positive=['사업']))
print(model.wv.most_similar(positive=['사업'], topn=3))
print(model.wv.most_similar(positive=['사업', '남도'], topn=3))
# positive : 단어 사전에 해당단어가 있을 확률
# negative : 단어 사전에 해당단어가 없을 확률
result = model.wv.most_similar(positive=['사업', '남도'], negative=['건설'])
print(result)

클라우드 차트

* nlp5.py

# 검색 결과를 형태소 분석하여 단어 빈도수를 구하고 이를 기초로 워드 클라우드 차트 출력
from bs4 import BeautifulSoup
import urllib.request
from urllib.parse import quote

#keyboard = input("검색어")
keyboard = "주식"
#print(quote(keyboard)) # encoding

# 동아일보 검색 기능
target_url ="https://www.donga.com/news/search?query=" + quote(keyboard)
print(target_url)
source_code = urllib.request.urlopen(target_url)
soup = BeautifulSoup(source_code, 'lxml', from_encoding='utf-8')

msg = ""
for title in soup.find_all("p", "tit"):
    title_link = title.select("a")
    #print(title_link)
    article_url = title_link[0]['href'] # [<a href="https://www.donga.com/news/Issue/031407" target="_blank">독감<span cl ... 
    #print(article_url) # https://bizn.donga.com/3/all/20210226/105634947/2 ..
    
    source_article = urllib.request.urlopen(article_url) # 실제 기사
    soup = BeautifulSoup(source_article, 'lxml', from_encoding='utf-8')
    cotents = soup.select('div.article_txt')
    #print(cotents)
    for temp in cotents:
        item = str(temp.find_all(text=True))
        #print(item)
        msg = msg + item

print(msg)

from urllib.parse import quote

quote(문자열) : encoding

from konlpy.tag import Okt
from collections import Counter

nlp = Okt()
nouns = nlp.nouns(msg)
result = []
for temp in nouns:
    if len(temp) > 1:
        result.append(temp)
print(result)
print(len(result))
count = Counter(result)
print(count)
tag = count.most_common(50) # 상위 50개만 작업에 참여

anaconda prompt 실행
pip install simplejson
pip install pytagcloud

import pytagcloud

taglist = pytagcloud.make_tags(tag, maxsize=100)
print(taglist)

pytagcloud.create_tag_image(taglist, "word.png", size=(1000, 600), fontname="korean", rectangular=False)

# 저장된 이미지 읽기
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
# %matplotlib inline # 주피터
img = mpimg.imread("word.png")
plt.imshow(img)
plt.show()

# 브라우저로 출력
import webbrowser
webbrowser.open("word.png")

C:\Windows\Fonts 폰트 복사
C:\anaconda3\Lib\site-packages\pytagcloud\fonts 붙여넣기
fonts.json

{
        "name": "korean",
        "ttf": "malgun.ttf",
        "web": "http://fonts.googleapis.com/css?family=Nobile"
},

'BACK END > Python Library' 카테고리의 다른 글

[LINUX] 리눅스 - 명령어 /eclipse /FlashPlayer /DB /apache /R /anaconda /hadoop (0)	2021.03.26
[Pandas] pandas 정리2 - db, django (0)	2021.03.02
[MatPlotLib] matplotlib 정리 (0)	2021.03.02
[NumPy] numpy 정리 (0)	2021.02.23