'분류 전체보기' 카테고리의 글 목록 (3 Page)

분류 전체보기

[딥러닝] 단순선형 회귀, 다중선형 회귀 2021.03.11
[딥러닝] 선형회귀 2021.03.10
[딥러닝] 공분산, 상관계수 2021.03.10
[딥러닝] 이항검정 2021.03.10
[딥러닝] ANOVA 2021.03.08
[딥러닝] T 검정 2021.03.05
[딥러닝] 카이제곱 2021.03.04
[Pandas] pandas 정리2 - db, django 2021.03.02
[MatPlotLib] matplotlib 정리 2021.03.02
[Pandas] pandas 정리 2021.02.24

PREV 1 2 3 4 5 6 ···11 NEXT

[딥러닝] 단순선형 회귀, 다중선형 회귀

2021. 3. 11. 10:36

단순선형 회귀 Simple Linear Regression

: ols()

: 독립변수 - 연속형, 종속변수 - 연속형.
: 독립변수 1개

* linear_reg4.py

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.rc('font', family='malgun gothic')

df = pd.read_csv('../testdata/drinking_water.csv')
print(df.head(3), '\n', df.describe())
'''
   친밀도  적절성  만족도
0    3    4    3
1    3    3    2
2    4    4    4 
'''

print(df.corr()) # 적절성/만족도 상관계수 : 0.766853

print('----------------------------------------------------------------------')
import statsmodels.formula.api as smf

model = smf.ols(formula='만족도 ~ 적절성', data=df).fit()
print(model.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    만족도   R-squared:                       0.588
Model:                            OLS   Adj. R-squared:                  0.586
Method:                 Least Squares   F-statistic:                     374.0
Date:                Thu, 11 Mar 2021   Prob (F-statistic):           2.24e-52
Time:                        10:07:49   Log-Likelihood:                -207.44
No. Observations:                 264   AIC:                             418.9
Df Residuals:                     262   BIC:                             426.0
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.7789      0.124      6.273      0.000       0.534       1.023
적절성            0.7393      0.038     19.340      0.000       0.664       0.815
==============================================================================
Omnibus:                       11.674   Durbin-Watson:                   2.185
Prob(Omnibus):                  0.003   Jarque-Bera (JB):               16.003
Skew:                          -0.328   Prob(JB):                     0.000335
Kurtosis:                       4.012   Cond. No.                         13.4
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

[해석]

상관계수 ** 2 = 결정계수

print(0.766853 ** 2) # 0.588063523609
R-squared : 결정계수(설명력), 상관계수 R의 제곱 : 0.588
: 1 - (SSE(explain sum of square-추세선과 데이터간 y값) / SST(total sum of square - 평균과 추세선간 y값

차이) )

: 1 - (SSE / SST)

=> over fitting : R2가 1에 아주 가까우면(기존 데이터와 추사) 새로운 데이터에 대해 설명력이 좋지않다.
적절성의 p-value : 0.000 < 0.05 => 모델은 유효하다.
std err(표준 오차) : 0.038
Intercept(y절편) : 0.7789
coef(기울기) : 0.7393
t = 기울기/ 표준오차 : 19.340

print(0.7393 / 0.038) # 19.455263157894738
F-statistic = t**2 : 374.0

print(19.340 ** 2) # 374.0356

독립변수가 많을 경우 R-squared과 Adj. R-squared의 차이가 클 경우 독립변수 이상치를 확인해야한다.
Kurtosis : 4.012 => 3보다 클경우 평균에 데이터가 몰려있다.

print(model.params) # y절편과 기울기 산출
# Intercept    0.778858
#적절성          0.739276

print(model.rsquared) # 0.5880630629464404
print()
print(model.pvalues)
'''
Intercept    1.454388e-09
적절성          2.235345e-52
'''
#print(model.predict()) # 예측값
print(df.만족도[0],' ', model.predict()[0]) # 3   3.7359630488589186

# 새로운 값 예측
print(df.적절성[:5])
'''
3   3.7359630488589186
0    4
1    3
2    4
3    2
4    2
'''

print(df.만족도[:5])
'''
0    3
1    2
2    4
3    2
4    2
'''

print(model.predict()[:5]) # [3.73596305 2.99668687 3.73596305 2.25741069 2.25741069]
print()

new_df = pd.DataFrame({'적절성':[6,5,4,3,22]})
new_pred = model.predict(new_df)
print('new_pred :\n', new_pred)
'''
 0     5.214515
1     4.475239
2     3.735963
3     2.996687
4    17.042934
'''

plt.scatter(df.적절성, df.만족도)
slope, intercept = np.polyfit(df.적절성, df.만족도, 1) # R의 abline 기능
plt.plot(df.적절성, df.적절성 * slope + intercept, 'b') # 추세선
plt.show()

다중 선형회귀 Multiple Linear Regression

: 독립변수가 복수

model2 = smf.ols(formula='만족도 ~ 적절성 + 친밀도', data=df).fit()
print(model2.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    만족도   R-squared:                       0.598
Model:                            OLS   Adj. R-squared:                  0.594
Method:                 Least Squares   F-statistic:                     193.8
Date:                Thu, 11 Mar 2021   Prob (F-statistic):           2.61e-52
Time:                        11:19:33   Log-Likelihood:                -204.37
No. Observations:                 264   AIC:                             414.7
Df Residuals:                     261   BIC:                             425.5
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.6673      0.131      5.096      0.000       0.409       0.925
적절성            0.6852      0.044     15.684      0.000       0.599       0.771
친밀도            0.0959      0.039      2.478      0.014       0.020       0.172
==============================================================================
Omnibus:                       13.103   Durbin-Watson:                   2.174
Prob(Omnibus):                  0.001   Jarque-Bera (JB):               17.256
Skew:                          -0.382   Prob(JB):                     0.000179
Kurtosis:                       3.992   Cond. No.                         18.8
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

단순 선형 회귀

: iris dataset, ols() 사용. 상관관계가 약한/강한 변수로 모델 작성.

* linear_reg5.py

import pandas as pd
import statsmodels.formula.api as smf
import seaborn as sns
iris = sns.load_dataset('iris')
print(iris.head(3))
'''
   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
'''

print(iris.corr())
'''
              sepal_length  sepal_width  petal_length  petal_width
sepal_length      1.000000    -0.117570      0.871754     0.817941
sepal_width      -0.117570     1.000000     -0.428440    -0.366126
petal_length      0.871754    -0.428440      1.000000     0.962865
petal_width       0.817941    -0.366126      0.962865     1.000000
'''

# 단순 선형회귀 모델 : 상관관계 r = -0.117570(sepal_length/sepal_width)
result = smf.ols(formula = 'sepal_length ~ sepal_width', data=iris).fit()
#print(result.summary()) #  R2 : 0.014
print(result.rsquared)# 0.01382 < 0.05      => 의미없는 모델
print(result.pvalues) # 1.518983e-01 > 0.05

result2 = smf.ols(formula = 'sepal_length ~ petal_length', data=iris).fit()
print(result2.summary()) #  R2 : 0.760      => 설명력
print(result2.rsquared)# 0.7599 > 0.05      => 의미있는 모델
print(result2.pvalues) # 1.038667e-47 < 0.05
print()

pred = result2.predict()
print('실제값 :', iris.sepal_length[0]) # 실제값 : 5.1
print('예측값 :', pred[0])              # 예측값 : 4.879094603339241

# 새로운 데이터로 예측
print(iris.petal_length[1:5])
new_data = pd.DataFrame({'petal_length':[1.4, 0.5, 8.5, 12.123]})
print(new_data)
'''
   petal_length
0         1.400
1         0.500
2         8.500
3        12.123
'''
y_pred_new = result2.predict(new_data)
print('새로운 데이터로 sepal_length예측 :\n', y_pred_new)
'''
 0    4.879095
1    4.511065
2    7.782443
3    9.263968
'''

다중 선형 회귀

result3 = smf.ols(formula = 'sepal_length ~ petal_length + petal_width', data=iris).fit()
print(result3.summary()) #  R2 : 0.760      => 설명력

                            OLS Regression Results                            
==============================================================================
Dep. Variable:           sepal_length   R-squared:                       0.766
Model:                            OLS   Adj. R-squared:                  0.763
Method:                 Least Squares   F-statistic:                     241.0
Date:                Thu, 11 Mar 2021   Prob (F-statistic):           4.00e-47
Time:                        12:09:43   Log-Likelihood:                -75.023
No. Observations:                 150   AIC:                             156.0
Df Residuals:                     147   BIC:                             165.1
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept        4.1906      0.097     43.181      0.000       3.999       4.382
petal_length     0.5418      0.069      7.820      0.000       0.405       0.679
petal_width     -0.3196      0.160     -1.992      0.048      -0.637      -0.002
==============================================================================
Omnibus:                        0.383   Durbin-Watson:                   1.826
Prob(Omnibus):                  0.826   Jarque-Bera (JB):                0.540
Skew:                           0.060   Prob(JB):                        0.763
Kurtosis:                       2.732   Cond. No.                         25.3
==============================================================================

print('R-squared :', result3.rsquared)# 0.7662 > 0.05      => 의미있는 모델
print('p-value', result3.pvalues)
# petal_length    9.414477e-13
# petal_width     4.827246e-02
# y = 0.5418 * x1 -0.3196 * x2 + 4.1906

# 새로운 데이터로 예측
new_data2 = pd.DataFrame({'petal_length':[8.5, 12.12], 'petal_width':[8.5, 12.5]})
y_pred_new2 = result3.predict(new_data2)
print('새로운 데이터로 sepal_length예측 :\n', y_pred_new2)
'''
 0    6.079508
1    6.762540
'''

선형 회귀 분석

: mtcars dataset, ols() 사용. 모델작성 후 추정치 얻기

* linear_reg6.py

import statsmodels.api
import statsmodels.formula.api as smf
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

plt.rc('font', family='malgun gothic')

mtcars = statsmodels.api.datasets.get_rdataset('mtcars').data
print(mtcars)
'''
                      mpg  cyl   disp   hp  drat  ...   qsec  vs  am  gear  carb
Mazda RX4            21.0    6  160.0  110  3.90  ...  16.46   0   1     4     4
Mazda RX4 Wag        21.0    6  160.0  110  3.90  ...  17.02   0   1     4     4
'''
print(mtcars.columns) # Index(['mpg', 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec', 'vs', 'am', 'gear', 'carb'], dtype='object')
print(mtcars.describe())
print(np.corrcoef(mtcars.hp, mtcars.mpg)) # 상관계수 : -0.77616837
print(np.corrcoef(mtcars.wt, mtcars.mpg)) # 상관계수 : -0.86765938
print(mtcars.corr())

# 시각화
plt.scatter(mtcars.hp, mtcars.mpg)
plt.xlabel('마력 수')
plt.ylabel('연비')
slope, intercept = np.polyfit(mtcars.hp, mtcars.mpg, 1) # 1차원
plt.plot(mtcars.hp, mtcars.hp * slope + intercept, 'r')
plt.show()

# 단순선형 회귀
result = smf.ols('mpg ~ hp', data=mtcars).fit()
print(result.summary())
print(result.conf_int(alpha=0.05)) # 33.435772
print(result.summary().tables[0])  # coef * x + Intercept
print('마력수  110에 대한 연비 예측 :', -0.0682 * 110 + 30.0989) # 22.5969
print('마력수  50에 대한 연비 예측 :', -0.0682 * 50 + 30.0989)   # 26.6889
# 마력이 증가하면 연비는 줄어든다. 음의 상관관계이므로 결과는 반비례한다. 참고 자료로만 활용해야한다.

# 다중선형 회귀
result2 = smf.ols('mpg ~ hp + wt', data=mtcars).fit()
print(result2.summary())
print(result2.conf_int(alpha=0.05))
print(result2.summary().tables[0])
print('마력수 110 + 무게 5에 대한 연비 예측 :', ((-0.0318 * 110) +(-3.8778 * 5) + 37.2273)) # 14.3403

print('추정치 구하기 차체 무게를 입력해 연비를 추정')
result3 = smf.ols('mpg ~ wt', data=mtcars).fit()
print(result3.summary())
print('결정계수 :', result3.rsquared) # 0.7528327936582646 > 0.05 설명력이 우수한 모델
pred = result3.predict()

# 1개의 자료로 실제값과 예측값(추정값) 저장 후 비교
print(mtcars.mpg[0])
print(pred[0]) # 모든 자동차 차체 무게에 대한 연비 추정치 출력

data = {
    'mpg':mtcars.mpg,
    'mpg_pred':pred
    }
df = pd.DataFrame(data)
print(df)
'''
                      mpg   mpg_pred
Mazda RX4            21.0  23.282611
Mazda RX4 Wag        21.0  21.919770
Datsun 710           22.8  24.885952
'''

# 새로운 차체 무게로 연비 추정하기
mtcars.wt = float(input('차체 무게 입력:'))
new_pred = result3.predict(pd.DataFrame(mtcars.wt))
print('차체 무게 {}일때 예상연비{}이다'.format(mtcars.wt[0], new_pred[0]))
# 차체 무게 1일때 예상연비31.940654594619367이다

# 여러 차제 무게에 대한 연비 추정
new_wt = pd.DataFrame({'wt':[6, 3, 0.5]})
new_pred2 = result3.predict(pd.DataFrame(new_wt))
print('예상연비 : \n', np.round(new_pred2.values, 2)) #  [ 5.22 21.25 34.61]

선형 회귀 분석

: 여러매체의 광고비에 따른 판매량 데이터, ols() 사용. 모델작성 후 추정치 얻기

* linear_reg7

import statsmodels.api
import statsmodels.formula.api as smf
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


adf_df = pd.read_csv('../testdata/Advertising.csv', usecols=[1,2,3,4])
print(adf_df.head(3), ' ', adf_df.shape) # (200, 4)
print(adf_df.index, adf_df.columns)
print(adf_df.info())
'''
      tv  radio  newspaper  sales
0  230.1   37.8       69.2   22.1
1   44.5   39.3       45.1   10.4
2   17.2   45.9       69.3    9.3
'''

print('상관계수 r : \n', adf_df.loc[:, ['sales', 'tv']].corr())
'''
           sales        tv
sales  1.000000  0.782224
tv     0.782224  1.000000
'''
# r : 0.782224 > 0.05 => 강한 양의 상관관계이고, 인과관계임을 알 수 있다.
print()

lm = smf.ols(formula='sales ~ tv', data=adf_df).fit()
print(lm.summary()) # R-squared : 0.612, p : 1.47e-42
print(lm.params)
print(lm.pvalues)
print(lm.rsquared)

# 시각화
plt.scatter(adf_df.tv, adf_df.sales)
plt.xlabel('tv')
plt.ylabel('sales')
x = pd.DataFrame({'tv':[adf_df.tv.min(), adf_df.tv.max()]})
y_pred = lm.predict(x)
plt.plot(x, y_pred, c='red')
plt.title('Linear Regression')
sns.regplot(adf_df.tv, adf_df.sales, scatter_kws = {'color':'r'})
plt.xlim(-50, 350)
plt.ylim(ymin=0)
plt.show()

# 예측 : 새로운 tv값으로 sales를 추정
x_new = pd.DataFrame({'tv':[230.1, 44.5, 100]})
pred = lm.predict(x_new)
print('추정값 :\n', pred)
'''
0    17.970775
1     9.147974
2    11.786258
'''

print('\n다중 선형회귀 모델 ')
lm_mul = smf.ols(formula = 'sales ~ tv + radio + newspaper', data = adf_df).fit()
#  + newspaper 포함시와 미포함시의 R2값 변화가 없어 제거 필요.
print(lm_mul.summary())
print(adf_df.corr())

# 예측2 : 새로운 tv, radio값으로 sales를 추정
x_new2 = pd.DataFrame({'tv':[230.1, 44.5, 100], 'radio':[30.1, 40.1, 50.1],\
                      'newspaper':[10.1, 10.1, 10.1]})
pred2 = lm.predict(x_new2)
print('추정값 :\n', pred2)
'''
0    17.970775
1     9.147974
2    11.786258
'''

회귀분석모형의 적절성을 위한 조건

: 아래의 조건 위배 시에는 변수 제거나 조정을 신중히 고려해야 함.

- 정규성 : 독립변수들의 잔차항이 정규분포를 따라야 한다.
- 독립성 : 독립변수들 간의 값이 서로 관련성이 없어야 한다.
- 선형성 : 독립변수의 변화에 따라 종속변수도 변화하나 일정한 패턴을 가지면 좋지 않다.
- 등분산성 : 독립변수들의 오차(잔차)의 분산은 일정해야 한다. 특정한 패턴 없이 고르게 분포되어야 한다.
- 다중공선성 : 독립변수들 간에 강한 상관관계로 인한 문제가 발생하지 않아야 한다.

# 잔차항
fitted = lm_mul.predict(adf_df)     # 예측값
print(fitted)
'''
0      20.523974
1      12.337855
2      12.307671
'''
residual = adf_df['sales'] - fitted # 잔차

import seaborn as sns
print('선형성 - 예측값과 잔차가 비슷하게 유지')
sns.regplot(fitted, residual, lowess = True, line_kws = {'color':'red'})
plt.plot([fitted.min(), fitted.max()], [0, 0], '--', color='grey')
plt.show() # 선형성을 만족하지 못한다.

print('정규성- 잔차가 정규분포를 따르는 지 확인')
import scipy.stats as stats
sr = stats.zscore(residual)
(x, y), _ = stats.probplot(sr)
sns.scatterplot(x, y)
plt.plot([-3, 3], [-3, 3], '--', color="grey")
plt.show() # 선형성을 만족하지 못한다. 
print('residual test :', stats.shapiro(residual))
# residual test : ShapiroResult(statistic=0.9176644086837769, pvalue=3.938041004403203e-09)
# pvalue=3.938041004403203e-09 < 0.05 => 정규성을 만족하지못함.

print('독립성 - 잔차가 자기상관(인접 관측치의 오차가 상관되어 있음)이 있는지 확인')
# 모델.summary() Durbin-Watson:2.084 => 잔차항이 독립성을 만족하는 지 확인. 2에 가까우면 자기상관이 없다.(서로 독립- 잔차끼리 상관관계가 없다)
# 0에 가까우면 양의 상관, 4에 가까우면 음의 상관.

print('등분산성 - 잔차의 분산이 일정한지 확인')
sns.regplot(fitted, np.sqrt(np.abs(sr)), lowess = True, line_kws = {'color':'red'})
plt.show()
# 추세선이 수평선을 그리지않으므로 등분산성을 만족하지 못한다.

print('다중공선성 - 독립변수들 간에 강한 상관관계 확인')
# VIF(Variance Inflation Factors - 분산 팽창 요인) 값이 10을 넘으면 다중공선성이 발생하는 변수라고 할 수 있다.
from statsmodels.stats.outliers_influence import variance_inflation_factor
print(variance_inflation_factor(adf_df.values, 0)) # 23.198876299003153
print(variance_inflation_factor(adf_df.values, 1)) # 12.570312383503682
print(variance_inflation_factor(adf_df.values, 2)) # 3.1534983754953845
print(variance_inflation_factor(adf_df.values, 3)) # 55.3039198336228

# DataFrame으로 보기
vif_df = pd.DataFrame()
vif_df['vid_value'] = [variance_inflation_factor(adf_df.values, i) for i in range(adf_df.shape[1])]
print(vif_df)
'''
   vid_value
0  23.198876
1  12.570312
2   3.153498
3  55.303920
'''

print('참고 : cooks distance - 극단값을 나타내는 지료 확인')
from statsmodels.stats.outliers_influence import OLSInfluence
cd, _ = OLSInfluence(lm_mul).cooks_distance
print(cd.sort_values(ascending=False).head())
'''
130    0.272956
5      0.128306
75     0.056313
35     0.051275
178    0.045921
'''

import statsmodels.api as sm
sm.graphics.influence_plot(lm_mul, criterion='cooks')
plt.show()

print(adf_df.iloc[[130, 5, 75, 35, 178]]) # 극단 값으로 작업에서 제외 권장.
'''
        tv  radio  newspaper  sales
130    0.7   39.6        8.7    1.6
5      8.7   48.9       75.0    7.2
75    16.9   43.7       89.4    8.7
35   290.7    4.1        8.5   12.8
178  276.7    2.3       23.7   11.8
'''

* linear_reg8.py

from sklearn.linear_model import LinearRegression
import statsmodels.api

mtcars = statsmodels.api.datasets.get_rdataset('mtcars').data
print(mtcars[:3])
'''
                mpg  cyl   disp   hp  drat     wt   qsec  vs  am  gear  carb
Mazda RX4      21.0    6  160.0  110  3.90  2.620  16.46   0   1     4     4
Mazda RX4 Wag  21.0    6  160.0  110  3.90  2.875  17.02   0   1     4     4
Datsun 710     22.8    4  108.0   93  3.85  2.320  18.61   1   1     4     1
'''

# hp(마력수)가 mpg(연비)에 영향을 미치지는 지, 인과관계가 있다면 연비에 미치는 영향값(추정치, 예측치)을 예측 (정량적 분석)
x = mtcars[['hp']].values
y = mtcars[['mpg']].values
print(x[:3])
'''
[[110]
 [110]
 [ 93]]
'''
print(y[:3])
'''
[[21. ]
 [21. ]
 [22.8]]
'''

import matplotlib.pyplot as plt
plt.scatter(x, y) # 산포도 출력
plt.show()

fit_model = LinearRegression().fit(x, y)   # 모델 생성
print('slope :', fit_model.coef_[0])       # 기울기 : [-0.06822828]
print('intercept :', fit_model.intercept_) # y절편 : [30.09886054]
# newY = fit_model.coef_[0] * newX + fit_model.intercept_

pred = fit_model.predict(x)
print(pred[:3])
print('예측값 :', pred[:3].flatten()) # 예측값 : [22.59374995 22.59374995 23.75363068]
print('실제값 :', y[:3].flatten())    # 실제값 : [21.  21.  22.8]
print()

# 모델 성능 파악 시 R2 또는 RMSE
from sklearn.metrics import mean_squared_error
import numpy as np

lin_mse = mean_squared_error(y, pred)   # 평균 제곱 오차
lin_rmse = np.sqrt(lin_mse)             # 루트
print("평균 제곱 오차 : ", lin_mse)          # 평균 제곱 오차 :  13.989822298268805
print("평균 제곱근 편차(RMSE) : ", lin_rmse) # 평균 제곱근 편차(RMSE) :  3.7402970868994894
print()

# 마력에 따른 연비 추정치
new_hp = [[100]]
new_pred = fit_model.predict(new_hp)
print('%s 마력인 경우 연비 추정치는 %s'%(new_hp[0][0], new_pred[0][0]))
# 100 마력인 경우 연비 추정치는 23.27603273246613

선형회귀 분석 : Linear Regression
과적합 방지를 위해 Ridgo, Lasso, ElasticNet

* linear_reg9.py

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris

iris = load_iris()
print(iris)
'''
[[5.1, 3.5, 1.4, 0.2],
[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5. , 3.6, 1.4, 0.2],
[5.4, 3.9, 1.7, 0.4],
[4.6, 3.4, 1.4, 0.3],
'''
print(iris.feature_names) # ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
print(iris.target)       
print(iris.target_names) # ['setosa' 'versicolor' 'virginica']
print()

iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
iris_df['target'] = iris.target
iris_df['target_names'] = iris.target_names[iris.target]
print(iris_df.head(3), ' ', iris_df.shape) # (150, 6)
'''
   sepal length (cm)  sepal width (cm)  ...  target  target_names
0                5.1               3.5  ...       0        setosa
1                4.9               3.0  ...       0        setosa
2                4.7               3.2  ...       0        setosa
'''

출처 : https://www.educative.io/edpresso/overfitting-and-underfitting

# train / test 분리 : 과적합 방지 방법 중 1
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(iris_df, test_size = 0.3) # data를 train 0.7, test 0.3 배율로 나눔
print(train_set.head(2), ' ', train_set.shape) # (105, 6)
print(test_set.head(2), ' ', test_set.shape) # (45, 6)

# 선형회귀
# 정규화 선형회귀 방법은 선형회귀계수(weight)에 대한 제약조건을 추가함으로 해서, 모형이 과도라게 최적화(오버피팅)되는 현상을 방지할 수 있다.
from sklearn.linear_model import LinearRegression as lm
import matplotlib.pyplot as plt

print(train_set.iloc[:, [2]]) # petal.length
print(train_set.iloc[:, [3]]) # petal.width

model_ols = lm().fit(X=train_set.iloc[:, [2]], y=train_set.iloc[:, [3]])
print(model_ols.coef_[0])     # [0.41268804]
print(model_ols.intercept_)   # [-0.35472987]
pred = model_ols.predict(model_ols.predict(test_set.iloc[:, [2]]))
print('ols_pred :\n', pred[:5])
'''
 [[ 0.31044183]
 [ 0.49464775]
 [ 0.3606798 ]
 [ 0.09274392]
 [-0.25892194]]
'''

print('ols_real :\n', test_set.iloc[:, [3]][:5])
'''
      petal width (cm)
138               1.8
143               2.3
142               1.9
79                1.0
45                0.3
'''

# 회귀분석 방법 - Ridge: alpha값을 조정(가중치 제곱합을 최소화)하여 과대/과소적합을 피한다. 다중공선성 문제 처리에 효과적.
from sklearn.linear_model import Ridge
model_ridge = Ridge(alpha=10).fit(X=train_set.iloc[:, [2]], y=train_set.iloc[:, [3]])

#점수
print(model_ridge.score(X=train_set.iloc[:, [2]], y=train_set.iloc[:, [3]])) #0.91923658601
print(model_ridge.score(X=test_set.iloc[:, [2]], y=test_set.iloc[:, [3]]))   #0.935219182367
print('ridge predict : ', model_ridge.predict(test_set.iloc[:, [2]]))
plt.scatter(train_set.iloc[:, [2]], train_set.iloc[:, [3]],  color='red')
plt.plot(test_set.iloc[:, [2]], model_ridge.predict(test_set.iloc[:, [2]]))
plt.show()

print('\nLasso')
# 회귀분석 방법 - Lasso: alpha값을 조정(가중치 절대값의 합을 최소화)하여 과대/과소적합을 피한다.
from sklearn.linear_model import Lasso
model_lasso = Lasso(alpha=0.1, max_iter=1000).fit(X=train_set.iloc[:, [0,1,2]], y=train_set.iloc[:, [3]])

#점수
print(model_lasso.score(X=train_set.iloc[:, [0,1,2]], y=train_set.iloc[:, [3]])) #0.921241848687
print(model_lasso.score(X=test_set.iloc[:, [0,1,2]], y=test_set.iloc[:, [3]]))   #0.913186971647
print('사용한 특성수 : ', np.sum(model_lasso.coef_ != 0))   # 사용한 특성수 :  1
plt.scatter(train_set.iloc[:, [2]], train_set.iloc[:, [3]],  color='red')
plt.plot(test_set.iloc[:, [2]], model_ridge.predict(test_set.iloc[:, [2]]))
plt.show()

# 회귀분석 방법 4 - Elastic Net 회귀모형 : Ridge + Lasso
# 가중치 제곱합을 최소화, 거중치 절대값의 합을 최소화, 두가지를 동시에 제약조건으로 사용
from sklearn.linear_model import ElasticNet

'BACK END > Deep Learning' 카테고리의 다른 글

[딥러닝] 로지스틱 회귀 (0)	2021.03.15
[딥러닝] 다항회귀 (0)	2021.03.12
[딥러닝] 선형회귀 (0)	2021.03.10
[딥러닝] 공분산, 상관계수 (0)	2021.03.10
[딥러닝] 이항검정 (0)	2021.03.10

[딥러닝] 선형회귀

2021. 3. 10. 15:13

회귀분석 Regression

: 각각의 데이터에 대한 잔차 제곱합이 최소가 되는 추세선을 만들고, 이를 통해 독립 변수가 종속변수에 얼마나 영향을 주는지 인과관계를 분석.

: 독립변수 - 연속형, 종속변수 - 연속형.

: 두 변수는 상관관계 및 인과관계가 있어야한다. (상관계수 > 0.3)

: 정량적인 모델 생성.

- 기계 학습(지도학습) : 학습을 통해 모델 생성 후, 새로운 데이터에 대한 예측 및 분류

선형회귀 Linear Regression

최소 제곱법(Least Square Method)

Y = a + b * X

선형회귀분석의 기존 가정 충족 조건

선형성 : 독립변수(feature)의 변화에 따라 종속변수도 일정 크기로 변화해야 한다.
정규성 : 잔차항이 정규분포를 따라야 한다.
독립성 : 독립변수의 값이 서로 관련되지 않아야 한다.
등분산성 : 그룹간의 분산이 유사해야 한다. 독립변수의 모든 값에 대한 오차들의 분산은 일정해야 한다.
다중공선성 : 다중회귀 분석 시 3 개 이상의 독립변수 간에 강한 상관관계가 있어서는 안된다.

- 최소 제곱해를 선형 행렬 방정식으로 얻기

* linear_reg1.py

import numpy.linalg as lin
import numpy as np
import matplotlib.pyplot as plt

x = np.array([0, 1, 2, 3])
y = np.array([-1, 0.2, 0.9, 2.1])

plt.plot(x, y)
plt.grid(True)
plt.show()

A = np.vstack([x, np.ones(len(x))]).T
print(A)
'''
[[0. 1.]
 [1. 1.]
 [2. 1.]
 [3. 1.]]
'''

# y = mx + c
m, c = np.linalg.lstsq(A, y, rcond=None)[0]
print('기울기 :', m, ' y절편:', c) # 기울기 : 0.9999999999999997   y절편 : -0.949999999999999

plt.plot(x, y, 'o', label='Original data', markersize=10)
plt.plot(x, m*x + c, 'r', label='Fitted line')
plt.legend()
plt.show()

np.vstack(x, y) : x에 y 행 추가

np.ones(x) : x X x의 1로 채워진 행렬 생성

x.T : 행열 변경

import numpy.linalg as lin

lin.lstsq() : 최소제곱법

# yhat = 0.9999999999999997 * x -0.949999999999999
print(0.9999999999999997 * 1 -0.949999999999999) # 0.05000000000000071
print(0.9999999999999997 * 3 -0.949999999999999) # 2.0500000000000003
print(0.9999999999999997 * 123 -0.949999999999999) # 122.04999999999995

모델 생성

* linear_reg2.py

방법 1 : make_regression을 사용, model X

import statsmodels.api as sm
from sklearn.datasets import make_regression
import numpy as np

np.random.seed(12)
x, y, coef = make_regression(n_samples=50, n_features=1, bias=100, coef=True)
print('x :{}, y:{}, coef:{}'.format(x, y, coef))
'''
x :[[-1.70073563]
 [-0.67794537]
 y:[ -52.17214291   39.34130801  

  '''
# 기울기 coef:89.47430739278907
# 회귀식 y = a + bx      y = 100 + 89.47430739278907 * x
y_pred = 100 + 89.47430739278907  * -1.70073563
print('y_pred :', y_pred) # y_pred : -52.17214255248879

xx = x
yy = y

방법 2 : Linear Regression을 사용. model O

from sklearn.linear_model import LinearRegression
model = LinearRegression()
fit_model = model.fit(xx, yy) # 학습 데이터로 모형 추정 : y절편, 기울기 get.
print(fit_model.coef_)      # 기울기 89.47430739
print(fit_model.intercept_) # y절편 100.0

# 예측값 확인 함수
y_pred2 = fit_model.predict(xx[[0]])
print('y_pred2 :', y_pred2)

y_pred2_new = fit_model.predict([[66]])
print('y_pred2_new :', y_pred2_new)

방법 3 : ols 사용. model O.

import statsmodels.formula.api as smf
import pandas as pd

x1 = xx.flatten() # 차원 축소
print(x1.shape)
y1 = yy
print(y1)

data = np.array([x1, y1])
df = pd.DataFrame(data.T)
df.columns = ['x1', 'y1']
print(df.head(3))
'''
         x1          y1
0 -1.700736  -52.172143
1 -0.677945   39.341308
2  0.318665  128.512356
'''

model2 = smf.ols(formula='y1 ~ x1', data=df).fit()
print(model2.summary())

# 예측값 확인 함수
print(x1[:2]) # [-1.70073563 -0.67794537]
new_df = pd.DataFrame({'x1':[-1.70073563, -0.67794537]}) # 기존 자료로 검증
new_pred = model2.predict(new_df)
print('new_pred :\n', new_pred)
'''
new_pred :
 0   -52.172143
1    39.341308
'''

new2_df = pd.DataFrame({'x1':[123, -2.34567]}) # 새로운 값에 대한 예측 결과 확인
new2_pred = model2.predict(new2_df)
print('new2_pred :\n', new2_pred)
'''
new2_pred :
 0    11105.339809
1     -109.877199
'''

방법 4 : linregress 사용. model O

* liner_reg3.py

from scipy import stats
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

score_iq = pd.read_csv('../testdata/score_iq.csv')
print(score_iq.head(3))
'''
     sid  score   iq  academy  game  tv
0  10001     90  140        2     1   0
1  10002     75  125        1     3   3
2  10003     77  120        1     0   4
'''
print(score_iq.info())

# iq가 score에 영향을 주는 지 검정
# iq로 score(시험점수) 값 예측 - 정량적 분석

x = score_iq.iq
y = score_iq.score

# 상관계수
print(np.corrcoef(x, y)) # numpy  0.88222034
print(score_iq.corr())   # pandas 0.882220

# 두 변수는 인과 관계가 있다고 보고, 선형회귀 분석 진행.
model = stats.linregress(x, y)
print(model)
print('p-value :', model.pvalue)  # p-value : 2.8476895206683644e-50
print('기울기 :', model.slope)      # 기울기 : 0.6514309527270075
print('y절편 :', model.intercept)  # y절편 : -2.8564471221974657
# pvalue=2.8476895206683644e-50 < 0.05 이므로 현재 모델은 유의하다.
# iq가 score에 영향을 준다.
# y = 0.6514309527270075 * x -2.8564471221974657 
print('예측결과 :', 0.6514309527270075 * 140 -2.8564471221974657)
# 예측결과 : 88.34388625958358
print('예측결과 :', 0.6514309527270075 * 125 -2.8564471221974657)
# 예측결과 : 78.57242196867847
print('예측결과 :', 0.6514309527270075 * 80 -2.8564471221974657)
# 예측결과 : 49.25802909596313
print('예측결과 :', 0.6514309527270075 * 155 -2.8564471221974657)
# 예측결과 : 98.11535055048869
print('예측결과 :', model.slope * 155 + model.intercept)
# 예측결과 : 98.11535055048869

# linregress는 predict()가 지원되지않음. numpy의 polyval 이용.
#print('예측결과 :', np.polyval([model.slope, model.intercept], np.array(score_iq['iq'])))
new_df = pd.DataFrame({'iq':[55, 66, 77, 88, 155]})
print('예측결과 :\n', np.polyval([model.slope, model.intercept], new_df))
'''
예측결과 :
 [[32.97225528]
 [40.13799576]
 [47.30373624]
 [54.46947672]
 [98.11535055]]
'''

np.polyval([기울기, y절편], data) : numpy predict함수

'BACK END > Deep Learning' 카테고리의 다른 글

[딥러닝] 다항회귀 (0)	2021.03.12
[딥러닝] 단순선형 회귀, 다중선형 회귀 (0)	2021.03.11
[딥러닝] 공분산, 상관계수 (0)	2021.03.10
[딥러닝] 이항검정 (0)	2021.03.10
[딥러닝] ANOVA (0)	2021.03.08

[딥러닝] 공분산, 상관계수

2021. 3. 10. 11:33

공분산

: 공분산 : 두 개 이상의 확률변수에 대한 관계를 알려주는 값.
: 값의 범위가 정해져 있지않아 어떤 값을 기준으로 정하기가 모호하다.

* relation1.py

import numpy as np
print(np.cov(np.arange(1, 6), np.arange(2, 7)))          # 공분산 2.5
'''
[[2.5 2.5]
 [2.5 2.5]]
'''
print(np.cov(np.arange(10, 60), np.arange(20, 70)))      # 212.5
print(np.cov(np.arange(1, 6), np.arange(6, 1, -1)))      # -2.5
print(np.cov(np.arange(1, 6), (3, 3, 3, 3, 3)))          # 0

np.cor(x, y) : 공분산

피어슨 상관계수 공식

: -1에서 1사이 값을 가진다.

: 절대값이 1에 가까울 수록 두 데이터가 관련이 높다.

: 양의 값일 경우 독립 변수가 증가할 수록 종속변수도 증가하는 데이터.

: 음의 값일 경우 독립 변수가 증가할 수록 종속변수도 감소하는 데이터.

: 선형 데이터만 사용 가능

r (범위)		관계
-1.0	-0.7	강한 음적 선형관계
-0.7	-0.3	뚜렷한 음적 선형관계
-0.3	-0.1	약한 음적 선형관계
-0.1	0.1	거의 무시될 수 있는 선형관계
0.1	0.3	약한 양적 선형관계
0.3	0.7	뚜렷한 양적 선형관계
0.7	1.0	강한 양적 선형관계

print(np.corrcoef(np.arange(1, 6), np.arange(2, 7)))     # 1
'''
[[1. 1.]
 [1. 1.]]
 '''
print(np.corrcoef(np.arange(10, 60), np.arange(20, 70))) # 1
print()

np.corrcoef(x, y) : 상관계수

x = [8,3,6,6,9,4,3,9,3,4]
x = [800,300,600,600,900,400,300,900,300,400]
print('x 평균 :', np.mean(x)) # x 평균 : 5.5
print('x 분산 :', np.var(x))  # x 분산 : 5.45

y = [6,2,4,6,9,5,1,8,4,5]
y = [600,200,400,600,900,500,100,800,400,500]
print('y 평균 :', np.mean(y)) # y 평균 : 5.0
print('y 분산 :', np.var(y))  # y 분산 : 5.4
print()

# 두 변수 간의 관계확인
print('x, y 공분산 :', np.cov(x,y)[0, 1]) #  두 변수 간에 데이터 크기에 따라 동적
# x, y 공분산 : 5.222222222222222
print('x, y 상관계수 :', np.corrcoef(x, y)[0, 1]) # 두 변수 간에 데이터 크기에 따라 정적 
# x, y 상관계수 : 0.8663686463212853

import matplotlib.pyplot as plt
plt.plot(x, y, 'o')
plt.show()

m = [-3, -2, -1, 0, 1, 2 , 3]
n = [9, 4, 1, 0, 1, 4, 9]

plt.plot(m, n, '+')
plt.show()
print('m, n 상관계수 :', np.corrcoef(m, n)[0, 1])
# m, n 상관계수 : 0.0
# 선형인 데이터만 사용 가능.

상관분석

: 두 변수 간에 상관관계의 강도를 분석
: 이론적 타당성(독립성) 확인. 독립변수 대상 변수들은 서로 간에 독립적이어야 함.
: 독립변수 대상 변수들은 다중 공선성이 발생할 수 있는데, 이를 확인
: 밀도를 수치로 표현. 관계의 친밀함을 수치로 표현.

- 어떤 상품에 대한 친밀도, 적절성, 만족도에 대한 상관관계 확인.

* relation2.py

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.rc('font', family='malgun gothic')


df = pd.read_csv('../testdata/drinking_water.csv')
print(df.head(3), '\n', df.describe())
'''
   친밀도  적절성  만족도
0    3    4    3
1    3    3    2
2    4    4    4
'''
print()
print(np.std(df.친밀도))  # 0.968505126935272
print(np.std(df.적절성))  # 0.8580277077642035
print(np.std(df.만족도))  # 0.8271724742228969

print('* 공분산')
print(np.cov(df.친밀도, df.적절성))
'''[[0.94156873 0.41642182]
 [0.41642182 0.73901083]]
'''
print(np.cov(df.친밀도, df.만족도))
print(np.cov(df.적절성, df.만족도))
print()
print(df.cov()) # pandas
'''
          친밀도       적절성       만족도
친밀도  0.941569  0.416422  0.375663
적절성  0.416422  0.739011  0.546333
만족도  0.375663  0.546333  0.686816
'''

print('* 상관계수')
print(np.corrcoef(df.친밀도, df.적절성))
'''
[[1.         0.49920861]
 [0.49920861 1.        ]]
'''
print(np.corrcoef(df.친밀도, df.만족도))
print(np.corrcoef(df.적절성, df.만족도))

print(df.corr()) # pandas
'''
          친밀도       적절성       만족도
친밀도  1.000000  0.499209  0.467145
적절성  0.499209  1.000000  0.766853
만족도  0.467145  0.766853  1.000000
'''
co_re = df.corr() # default : pearson
print(co_re['만족도'].sort_values(ascending=False))
'''
만족도    1.000000
적절성    0.766853
친밀도    0.467145
'''
print(df.corr())
'''
          친밀도       적절성       만족도
친밀도  1.000000  0.499209  0.467145
적절성  0.499209  1.000000  0.766853
만족도  0.467145  0.766853  1.000000
'''
print(df.corr(method='pearson'))  # 변수가 등간/비율 척도. 정규성을 따를 경우 사용.
print(df.corr(method='spearman')) # 변수가 서열척도. 정규성을 따르지 않을 경우 사용.
print(df.corr(method='kendall'))

# 시각화
df.plot(kind="scatter", x='만족도', y='적절성')
plt.show()

from pandas.plotting import scatter_matrix
attr = ['친밀도', '적절성', '만족도']
scatter_matrix(df[attr], figsize=(10, 6)) # 히스토그램
plt.show()

# heatmap : 밀도를 색으로 표현
import seaborn as sns
sns.heatmap(df.corr())
plt.show()

# hitmap에 텍스트 표시 추가사항 적용해 보기
corr = df.corr()
# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)  # 상관계수값 표시
mask[np.triu_indices_from(mask)] = True
# Draw the heatmap with the mask and correct aspect ratio
vmax = np.abs(corr.values[~mask]).max()
fig, ax = plt.subplots()     # Set up the matplotlib figure

sns.heatmap(corr, mask=mask, vmin=-vmax, vmax=vmax, square=True, linecolor="lightgray", linewidths=1, ax=ax)

for i in range(len(corr)):
    ax.text(i + 0.5, len(corr) - (i + 0.5), corr.columns[i], ha="center", va="center", rotation=45)
    for j in range(i + 1, len(corr)):
        s = "{:.3f}".format(corr.values[i, j])
        ax.text(j + 0.5, len(corr) - (i + 0.5), s, ha="center", va="center")
ax.axis("off")
plt.show()

공공 데이터(외국인 관광객의 국내 관광지 입장자료로 상관관계 분석)

import json
import matplotlib.pyplot as plt
plt.rc('font', family='malgun gothic')
import pandas as pd


def Start():
    # 서울 관광지 정보
    fname = '서울특별시_관광지.json'
    jsonTP = json.loads(open(fname, 'r', encoding='utf-8').read()) # str => json
    #print(jsonTP)
    
    tour_table = pd.DataFrame(jsonTP, columns=('yyyymm', 'resNm', 'ForNum')) # 년월, 관광지명, 입장객수
    tour_table = tour_table.set_index('yyyymm')
    print(tour_table)
    '''
                    resNm  ForNum
    yyyymm                   
    201101        창덕궁   14137
    201101        운현궁       0
    '''
    resNm = tour_table.resNm.unique()
    print('resNum :', resNm[:5]) # 5개 샘플. resNum : ['창덕궁' '운현궁' '경복궁' '창경궁' '종묘']
    
    
    # 중국인 관광정보
    cdf = '중국인방문객.json'
    jdata = json.loads(open(cdf, 'r', encoding='utf-8').read())
    #print(jdata)
    
    china_table = pd.DataFrame(jdata, columns=('yyyymm', 'visit_cnt')) # 년월, 방문객수
    china_table = china_table.rename(columns={'visit_cnt':'china'})
    china_table = china_table.set_index('yyyymm')
    print(china_table)
    '''
             china
    yyyymm        
    201101   91252
    201102  140571
    201103  141457
    '''
    
    # 일본인 관광정보
    jdf = '일본인방문객.json'
    jdata = json.loads(open(jdf, 'r', encoding='utf-8').read())
    #print(jdata)
    
    japan_table = pd.DataFrame(jdata, columns=('yyyymm', 'visit_cnt')) # 년월, 방문객수
    japan_table = japan_table.rename(columns={'visit_cnt':'japan'})
    japan_table = japan_table.set_index('yyyymm')
    print(japan_table)
    '''
             japan
    yyyymm        
    201101  209184
    201102  230362
    '''
    
    # 미국인 관광정보
    udf = '미국인방문객.json'
    jdata = json.loads(open(udf, 'r', encoding='utf-8').read())
    #print(jdata)
    
    usa_table = pd.DataFrame(jdata, columns=('yyyymm', 'visit_cnt')) # 년월, 방문객수
    usa_table = usa_table.rename(columns={'visit_cnt':'usa'})
    usa_table = usa_table.set_index('yyyymm')
    print(usa_table)
    '''
              usa
    yyyymm       
    201101  43065
    201102  41077
    '''
    
    all_table = pd.merge(china_table, japan_table, left_index=True, right_index=True)
    all_table = pd.merge(all_table, usa_table, left_index=True, right_index=True)
    print(all_table)
    '''
                 china   japan    usa
    yyyymm                       
    201101   91252  209184  43065
    201102  140571  230362  41077
    '''
    r_list = []
    for tourPoint in resNm[:5]:
        r_list.append(SetScatterGraph(tour_table, all_table, tourPoint))
        #print(r_list)
        
    r_df = pd.DataFrame(r_list, columns=['관광지명', '중국','일본','미국'])
    r_df = r_df.set_index('관광지명')
    print(r_df)
    '''
                중국        일본        미국
    관광지명                              
    창덕궁  -0.058791  0.277444  0.402816
    운현궁   0.445945  0.302615  0.281258
    경복궁   0.525673 -0.435228  0.425137
    창경궁   0.451233 -0.164586  0.624540
    종묘   -0.583422  0.529870 -0.121127
    '''
    r_df.plot(kind='bar', rot=60)
    plt.show()

def SetScatterGraph(tour_table, all_table, tourPoint):
    tour = tour_table[tour_table['resNm'] == tourPoint]
    #print(tour)
    merge_table = pd.merge(tour, all_table, left_index=True, right_index=True)
    print(merge_table) # 광광지 자료중 앞에 5개만 참여
    '''
               resNm  ForNum   china   japan    usa
    yyyymm                                     
    201101   창덕궁   14137   91252  209184  43065
    201102   창덕궁   18114  140571  230362  41077
    '''
    # 시각화 + 상관관계
    fig = plt.figure()
    fig.suptitle(tourPoint + ' 상관관계 분석')
    
    plt.subplot(1, 3, 1)
    plt.xlabel('중국인 수')
    plt.ylabel('외국인 입장수')
    lamb1 = lambda p:merge_table['china'].corr(merge_table['ForNum'])
    r1 = lamb1(merge_table)
    print('r1 :', r1)
    plt.title('r={:.3f}'.format(r1))
    plt.scatter(merge_table['china'], merge_table['ForNum'], s=6, c='black')
    
    plt.subplot(1, 3, 2)
    plt.xlabel('일본인 수')
    plt.ylabel('외국인 입장수')
    lamb2 = lambda p:merge_table['japan'].corr(merge_table['ForNum'])
    r2 = lamb2(merge_table)
    print('r2 :', r2)
    plt.title('r={:.3f}'.format(r2))
    plt.scatter(merge_table['japan'], merge_table['ForNum'], s=6, c='red')
    
    plt.subplot(1, 3, 3)
    plt.xlabel('미국인 수')
    plt.ylabel('외국인 입장수')
    lamb3 = lambda p:merge_table['usa'].corr(merge_table['ForNum'])
    r3 = lamb3(merge_table)
    print('r3 :', r3)
    plt.title('r={:.3f}'.format(r3))
    plt.scatter(merge_table['usa'], merge_table['ForNum'], s=6, c='blue')
    
    plt.show()
    return [tourPoint, r1, r2, r3]
    '''
        r1 : -0.05879110406006314
    r2 : 0.2774443570141011
    r3 : 0.4028160633050156
    r1 : 0.44594488384450376
    r2 : 0.30261521828798604
    r3 : 0.2812576500158649
    r1 : 0.5256734293511215
    r2 : -0.43522818613412334
    r3 : 0.4251372638704492
    r1 : 0.4512325398089607
    r2 : -0.16458589402253013
    r3 : 0.6245403780269381
    r1 : -0.5834218986767473
    r2 : 0.5298702802205213
    r3 : -0.1211266682929496
    '''
if __name__ == '__main__':
    Start()

'BACK END > Deep Learning' 카테고리의 다른 글

[딥러닝] 단순선형 회귀, 다중선형 회귀 (0)	2021.03.11
[딥러닝] 선형회귀 (0)	2021.03.10
[딥러닝] 이항검정 (0)	2021.03.10
[딥러닝] ANOVA (0)	2021.03.08
[딥러닝] T 검정 (0)	2021.03.05

[딥러닝] 이항검정

2021. 3. 10. 10:17

이항검정

: 결과가 두 가지 값을 가지는 확률변수의 분포를 판단하는데 효과적
: 이산변량을 대상으로 한다.

형식 : stats.binom_test() : 명목척도의 비율을 바탕으로 이항분포 검정

귀무가설 : 직원을 대상으로 고객 대응 교육 후 고객 안내 서비스 만족율이 80%다.
대립가설 : 직원을 대상으로 고객 대응 교육 후 고객 안내 서비스 만족율이 80%가 아니다.

* binom.py

import pandas as pd
import scipy.stats as stats
from pandas.core.reshape.pivot import crosstab

data = pd.read_csv("../testdata/one_sample.csv")
print(data.head(3))
print()
'''
   no    gender  survey time
0   1         2       1  5.1
1   2         2       0  5.2
2   3         2       1  4.7
'''

ctab = crosstab(index=data['survey'], columns = 'count')
ctab.index = ['불만족', '만족']
print(ctab)
'''
col_0  count
불만족       14
만족       136
'''

print('양측 검정(기존 80% 만족율 기준 검증을 실시) : 방향성이 없다.')
x = stats.binom_test([136, 14], p = 0.8, alternative="two-sided")
print(x)

stats.binom_test(x, p = , alternative="two-sided") : 양측 검정

p-value : 0.00067 < 0.05 이므로 귀무가설 기각.
대립가설 : 직원을 대상으로 고객 대응 교육 후 고객 안내 서비스 만족율이 80%가 아니다.
양측검정에서는 크다, 작다로 방향성으로 제시 하지않는다.

print('\n양측 검정(기존 80% 불만족율 기준 검증을 실시) : 방향성이 없다.')
x = stats.binom_test([14, 136], p = 0.2, alternative="two-sided")
print(x)

stats.binom_test(x, p = , alternative="two-sided") : 양측 검정(조건 반전)

p-value : 0.00067 < 0.05 이므로 귀무가설 기각.

print('\n단측 검정(기존 80% 만족율 기준 검증을 실시) : 방향성이 있다.')
x = stats.binom_test([136, 14], p = 0.8, alternative="greater")
print(x)# p-value : 0.000317 < 0.05 이므로 귀무가설 기각.

stats.binom_test(x, p = , alternative="greater") : 단측검정

p-value : 0.000317 < 0.05 이므로 귀무가설 기각.

print('\n단측 검정(기존 80% 불만족율 기준 검증을 실시) : 방향성이 있다.')
x = stats.binom_test([14, 136], p = 0.2, alternative="less")
print(x)# p-value : 0.000317 < 0.05 이므로 귀무가설 기각.

stats.binom_test(x, p = , alternative="less") : 단측검정(조건 반전)

p-value : 0.000317 < 0.05 이므로 귀무가설 기각.

비율 검정

: 집단의 비율이 어떤 특정한 값과 같은지를 검정

* one-sample
a회사에는 100명 중 45명이 흡연을 한다. 국가통계에서는 국민 흡연율은 35%라고 한다. 비율의 동일여부를 검정하라.

귀무가설 : a회사의 흡연율과 국민 흡연율의 비율은 같다.
대립가설 : a회사의 흡연율과 국민 흡연율의 비율은 다르다.

import numpy as np
from statsmodels.stats.proportion import proportions_ztest

count = np.array([45])
nobs = np.array([100])
val = 0.35

z, p = proportions_ztest(count=count, nobs=nobs, value=val)
print('z : {}, p : {}'.format(z, p)) # z : [2.01007563], p : [0.04442318]

proportions_ztest(count=count, nobs=nobs, value=val) : 비율 검정.

p-value : 0.04442318 < 0.05 이므로 귀무가설 기각.
대립가설 : a회사의 흡연율과 국민 흡연율의 비율은 다르다.

* two-sample
a회사 직원 300명중 100명이 햄버거를 취식, b회사 직원 400명중 170명이 햄버거를 취식시 두 집단의 햄버거 취식비율의 차이 검정.

귀무 가설 : 차이가 없다.
대립 가설 : 차이가 있다.

count = np.array([100, 170])
nobs = np.array([300, 400])

z, p = proportions_ztest(count=count, nobs=nobs, value=0)
print('z : {}, p : {}'.format(z, p)) # z : -2.4656701201792273, p : 0.013675721698622408

proportions_ztest(count=count, nobs=nobs, value=0) : 비율 검정.

p-value : 0.013675 < 0.05 이므로 귀무가설 기각.
대립 가설 : 차이가 있다.

'BACK END > Deep Learning' 카테고리의 다른 글

[딥러닝] 선형회귀 (0)	2021.03.10
[딥러닝] 공분산, 상관계수 (0)	2021.03.10
[딥러닝] ANOVA (0)	2021.03.08
[딥러닝] T 검정 (0)	2021.03.05
[딥러닝] 카이제곱 (0)	2021.03.04

[딥러닝] ANOVA

2021. 3. 8. 16:17

ANOVA(analysis of variance)

: 독립변수 범주형(3개 이상), 종속변수 연속형

독립변수 x	종속변수 y	분석 방법
범주형	범주형	카이제곱 검정(교차분석)	일원 카이제곱 : 변인 1 개 import scipy.stats as stats stats.chisquare()
범주형	범주형	카이제곱 검정(교차분석)	일원 카이제곱 : 변인 2 개 이상 stats.chi2_contingency()
범주형	연속형	T 검정 : 범주형 값 2개 이하	단일 표본 검정 (one sample t-test) : 집단 1개 stats.ttest_1samp(데이터, popmean=모집단 평균)
			독립 표본 검정(independent samples t test) : 두 집단 정규분포/ 분산 동일 stats.ttest_ind(데이터,..., , equal_var=False)
			대응 표본 검정(paired samples t test) : 동일한 관찰 대상의 처리 전과 처리 후 비교 stats.ttest_rel(데이터, .. )
		ANOVA :범주형 값 3개 이상	일원 분산분석(one-way anova) : 1개의 요인에 집단이 3개 import statsmodels.api as sm from statsmodels.formula.api import ols model = ols('종속변수 ~ 독립변수', data).fit() sm.stats.anova_lm(model, type=2) model = ols('종속변수 ~ 독립변수1 + 독립변수2', data).fit() stats.f_oneway(gr1, gr2, gr3)
연속형	범주형	로지스틱 회귀 분석
연속형	연속형	회귀분석, 구조 방정식

일원 분산분석(one-way anova)

: 1개의 요인에 집단이 3개

실습 1 : 세 가지 교육방법을 적용하여 1개월 동안 교육받은 교육생 80 명을 대상으로 실기시험을 실시 .

귀무가설 : 교육생을 대상으로 3가지 교육방법에 따른 실기시험 평균의 차이가 없다.
대립가설 : 교육생을 대상으로 3가지 교육방법에 따른 실기시험 평균의 차이가 있다.

* anova1.py

import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt

data = pd.read_csv('../testdata/three_sample.csv')
print(data.head(3), data.shape)
'''
   no  method  survey  score
0   1       1       1     72
1   2       3       1     87
2   3       2       1     78 (80, 4)
'''
print(data.describe())

- 이상치 제거

data = data.query('score <= 100')
plt.boxplot(data.score)
#plt.show()

# 독립성 : 상관관계를 확인가능

# 등분산성
result = data[['method', 'score']]
m1 = result[result['method'] == 1]
m2 = result[result['method'] == 2]
m3 = result[result['method'] == 3]
#print(m1)
score1 = m1['score']
score2 = m2['score']
score3 = m3['score']
print('등분산성 :', stats.levene(score1, score2, score3).pvalue)    # 등분산성 : 0.11322 > 0.05 이므로 만족
print('등분산성 :', stats.fligner(score1, score2, score3).pvalue)   # 등분산성 : 0.10847
print('등분산성 :', stats.bartlett(score1, score2, score3).pvalue)  # 등분산성 : 0.15251

- 정규성

print(stats.shapiro(score1))
print('정규성 확인 :', stats.ks_2samp(score1, score2).pvalue) # pvalue=0.3096 > 0.05 이므로 만족
print('정규성 확인 :', stats.ks_2samp(score1, score3).pvalue) # pvalue=0.7162 > 0.05 이므로 만족
print('정규성 확인 :', stats.ks_2samp(score2, score3).pvalue) # pvalue=0.7724 > 0.05 이므로 만족

stats.ks_2samp(data1, data2).pvalue : 정규성 확인

print('교육방법별 건수')
data2 = pd.crosstab(index = data['method'], columns = 'count')
print(data2)
'''
교육방법별 건수
col_0   count
method       
1          26
2          28
3          24
'''

print('교육방법별 만족 여부 건수')
data3 = pd.crosstab(data['method'], data['survey'])
data3.index = ['방법1', '방법2', '방법3']
data3.columns = ['만족', '불만족']
print(data3)
'''
교육방법별 만족 여부 건수
     만족  불만족
방법1   9   17
방법2  10   18
방법3   8   16
'''

- ANOVA

import statsmodels.api as sm
from statsmodels.formula.api import ols
model = ols('score ~ method', data).fit()
table = sm.stats.anova_lm(model, type=2)
print(table)
'''            df        sum_sq     mean_sq         F    PR(>F)
method     1.0     27.980888   27.980888  0.122228  0.727597
Residual  76.0  17398.134497  228.922822       NaN       NaN
'''
print(model.summary())

import statsmodels.api as sm

from statsmodels.formula.api import ols

model = ols('종속변수 ~ 독립변수', data).fit() : model
sm.stats.anova_lm(model, type=2) : anova

p-value : 0.727597 > 0.05 이므로 귀무가설 채택.
# 귀무가설 : 교육생을 대상으로 3가지 교육방법에 따른 실기시험 평균의 차이가 없다.

# ANOVA 다중회귀 : 독립변수 2
model2 = ols('score ~ method + survey', data).fit()
table2 = sm.stats.anova_lm(model2, type=2)
print(table2)
'''
            df        sum_sq     mean_sq         F    PR(>F)
method     1.0     27.980888   27.980888  0.120810  0.729131
survey     1.0     27.324458   27.324458  0.117976  0.732201
Residual  75.0  17370.810039  231.610801       NaN       NaN
'''
# mean_sq = sum_sq / df

import numpy as np
print()
print(np.mean(score1)) # 67.38461538461539
print(np.mean(score2)) # 68.35714285714286
print(np.mean(score3)) # 68.875
print()

model = ols('종속변수 ~ 독립변수1 + 독립변수2', data).fit() : 다중회귀

사후 검정(Post Hoc Test)

: 가능한 모든 쌍을 비교하며 예상되는 표준 오류보다 큰 두 가지 방법의 차이를 정확하게 식별하는 데 사용 할 수 있다.
: 그룹 간에 평균값 차이가 의미가 있는 지 확인.

from statsmodels.stats.multicomp import pairwise_tukeyhsd
tResult = pairwise_tukeyhsd(data, data.method)
print(tResult)
'''
 Multiple Comparison of Means - Tukey HSD, FWER=0.05 
=====================================================
group1 group2 meandiff p-adj   lower    upper  reject
-----------------------------------------------------
     1      2   3.8352 0.7997 -11.3827  19.053  False
     1      3  -0.0577    0.9 -15.8743 15.7589  False
     2      3  -3.8929 0.8018  -19.436 11.6503  False
-----------------------------------------------------
'''
tResult.plot_simultaneous()
plt.show()

from statsmodels.stats.multicomp import pairwise_tukeyhsd

result = pairwise_tukeyhsd(data, data.독립변수) : 사후검정.

result.plot_simultaneous() : 시각화.

일원분산으로 집단 간의 평균 차이 검증

강남구 소재 GS 편의점 3개 지역 , 알바생의 급여에 대한 평균의 차이를 검정

귀무가설 : 3개 지역 급여에 대한 평균에 차이가 없다.
대립가설 : 3개 지역 급여에 대한 평균에 차이가 있다.

* anova2.py

import scipy.stats as stats
import pandas as pd
import  numpy as np
import urllib.request
import matplotlib.pyplot as plt
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

url = "https://raw.githubusercontent.com/pykwon/python/master/testdata_utf8/group3.txt"
data = np.genfromtxt(urllib.request.urlopen(url), delimiter=',')
print(data)
'''
[[243.   1.]
 [251.   1.]
 [275.   1.]
 [291.   1.]
 ...
 
'''

np.genfromtxt(urllib.request.urlopen(url), delimiter=',') : url 읽기

# 세 개 집단
gr1 = data[data[:, 1] == 1, 0]
gr2 = data[data[:, 1] == 2, 0]
gr3 = data[data[:, 1] == 3, 0]
print(gr1, np.average(gr1))
# [243. 251. 275. 291. 347. 354. 380. 392.] 316.625
print(gr2, np.average(gr2))
# [206. 210. 226. 249. 255. 273. 285. 295. 309.] 256.44444444444446
print(gr3, np.average(gr3))
# [241. 258. 270. 293. 328.] 278.0

# 정규성
print(stats.shapiro(gr1)) # pvalue=0.3336 > 0.05 정규성 만족
print(stats.shapiro(gr2)) # pvalue=0.6561 > 0.05 정규성 만족
print(stats.shapiro(gr3)) # pvalue=0.8324 > 0.05 정규성 만족

# 등분산성
print(stats.bartlett(gr1, gr2, gr3)) # pvalue=0.3508 > 0.05 등분산성 만족

# 시각화
plot_data = [gr1, gr2, gr3]
plt.boxplot(plot_data)
#plt.show()

# 방법 1
df = pd.DataFrame(data, columns=['value', 'group'])
print(df)
'''
    value  group
0   243.0    1.0
1   251.0    1.0
2   275.0    1.0
3   291.0    1.0
4   347.0    1.0
5   354.0    1.0
'''
model = ols('value ~ C(group)', df).fit() # C(변수명 + ..) : 범주형임을 명시적으로 표시
print(anova_lm(model))
print()

# 방법 2
f_statistic, p_val = stats.f_oneway(gr1, gr2, gr3)
print('f_statistic : {}, p_val : {}'.format(f_statistic, p_val))
# f_statistic : 3.7113359882669763, p_val : 0.043589334959178244

model = ols('종속변수 ~ C(독립변수, ... )', df).fit() : C(변수명 + ..) : 범주형임을 명시적으로 표시

f_statistic, p_val = stats.f_oneway(gr1, gr2, gr3) : 일원 분산 검정

일원 분산분석

어느 음식점의 매출자료와 날씨 자료를 이용하여 온도에 따른 매출의 평균의 차이에 대한 검정.

온도를 3 그룹으로 분리.

귀무가설 : 온도에 따른 매출액 평균에 차이가 없다.
대립가설 : 온도에 따른 매출액 평균에 차이가 있다.

* anova3.py

import numpy as np
import scipy.stats as stats
import pandas as pd
import matplotlib.pyplot as plt

# 매출자료
sales_data = pd.read_csv('https://raw.githubusercontent.com/pykwon/python/master/testdata_utf8/tsales.csv', dtype={'YMD':'object'})
print(sales_data.head(3)) # 328행
'''
        YMD    AMT  CNT
0  20190514      0    1
1  20190519  18000    1
2  20190521  50000    4
'''
print(sales_data.info())

# 날씨 자료
wt_data = pd.read_csv('https://raw.githubusercontent.com/pykwon/python/master/testdata_utf8/tweather.csv')
print(wt_data.head(3)) # 702행
'''
   stnId          tm  avgTa  minTa  maxTa  sumRn  maxWs  avgWs  ddMes
0    108  2018-06-01   23.8   17.5   30.2    0.0    4.3    1.9    0.0
1    108  2018-06-02   23.4   17.6   30.1    0.0    4.5    2.0    0.0
2    108  2018-06-03   24.0   16.9   30.8    0.0    4.2    1.6    0.0
'''
print(wt_data.info())

# 날짜를 기준으로 join
wt_data.tm = wt_data.tm.map(lambda x : x.replace('-','')) # wt_data.tm에서 '-' 제거 
#print(wt_data.head(3))
frame = sales_data.merge(wt_data, how='left', left_on='YMD', right_on='tm') # join
print(frame.head(3), frame.shape) # (328, 12)
'''
        YMD    AMT  CNT  stnId        tm  ...  maxTa  sumRn  maxWs  avgWs  ddMes
0  20190514      0    1    108  20190514  ...   26.9    0.0    4.1    1.6    0.0
1  20190519  18000    1    108  20190519  ...   21.6   22.0    2.7    1.2    0.0
2  20190521  50000    4    108  20190521  ...   23.8    0.0    5.9    2.9    0.0
'''
print(frame.columns)

# 분석에 참여할 칼럼만 추출
data = frame.iloc[:, [0,1,7,8]]
print(data.head(3))
'''
        YMD    AMT  maxTa  sumRn
0  20190514      0   26.9    0.0
1  20190519  18000   21.6   22.0
2  20190521  50000   23.8    0.0
'''

- 일별 최고온도를 구간설정을 통해 연속형 변수를 명목형(범주형) 변수로 변경

print(data.maxTa.describe())
plt.boxplot(data.maxTa)
plt.show()

# 온도 추움, 보통, 더움(0, 1, 2)
data['Ta_gubun'] = pd.cut(data.maxTa, bins = [-5, 8, 24, 37], labels = [0, 1, 2])
print(data.head(5))
#print(data.isnull().sum())
#data = data[data.Ta_gubun.notna()] # na가 있다면 제거

print(data['Ta_gubun'].unique())

# 상관관계
print(data.corr())
'''
            AMT     maxTa     sumRn
AMT    1.000000 -0.660066 -0.080907
maxTa -0.660066  1.000000  0.119268
sumRn -0.080907  0.119268  1.000000
'''

# 3그룹으로 데이터를 나눈 후 등분산성, 정규성 검정.
x1 = np.array(data[data.Ta_gubun == 0].AMT)
x2 = np.array(data[data.Ta_gubun == 1].AMT)
x3 = np.array(data[data.Ta_gubun == 2].AMT)
print(x1)
print(x2)
print(x3)

print(stats.levene(x1, x2, x3)) # pvalue=0.0390 < 0.05 등분산성 만족 X
print(stats.ks_2samp(x1, x2).pvalue) # 9.28938415079017e-09   < 0.05 정규성 만족 X
print(stats.ks_2samp(x1, x3).pvalue) # 1.198570472122961e-28  < 0.05 정규성 만족 X
print(stats.ks_2samp(x2, x3).pvalue) # 1.4133139103478243e-13 < 0.05 정규성 만족 X

# 온도별 매출액 평균
spp = data.loc[:, ['AMT', 'Ta_gubun']]
print(spp.groupby('Ta_gubun').mean())
print(pd.pivot_table(spp, index = ['Ta_gubun'], aggfunc = 'mean'))
'''
                   AMT
Ta_gubun              
0         1.032362e+06
1         8.181069e+05
2         5.537109e+05
'''

# ANOVA 진행
sp = np.array(spp)
group1 = sp[sp[:, 1] == 0, 0]
group2 = sp[sp[:, 1] == 1, 0]
group3 = sp[sp[:, 1] == 2, 0]
print(group1)
print(group2)
print(group3)

print(stats.f_oneway(group1, group2, group3))

pvalue=2.360737101089604e-34 < 0.05 이므로 귀무가설 기각.
# 대립가설 : 온도에 따른 매출액 평균에 차이가 있다.

등분산성 만족 하지않을 경우 Welch's ANOVA를 사용

anaconda prompt 접속

pip install pingouin

from pingouin import welch_anova
df = data
print(welch_anova(data = df, dv = 'AMT', between='Ta_gubun')) # p-unc = 7.907874e-35 < 0.05
'''
     Source  ddof1     ddof2           F         p-unc       np2
0  Ta_gubun      2  189.6514  122.221242  7.907874e-35  0.379038
'''

정규성을 만족하지 못한 경우 kruskal-wallis test 사용

print(stats.kruskal(group1, group2, group3))
#KruskalResult(statistic=132.7022591443371, pvalue=1.5278142583114522e-29)

pvalue < 0.05 이므로 귀무가설 기각.
#결론 : 온도에 따른 매출액의 차이가 있다.

# 사후 검정
from statsmodels.stats.multicomp import pairwise_tukeyhsd
posthoc = pairwise_tukeyhsd(spp['AMT'], spp['Ta_gubun'])
print(posthoc)
'''
       Multiple Comparison of Means - Tukey HSD, FWER=0.05       
=================================================================
group1 group2   meandiff   p-adj    lower        upper     reject
-----------------------------------------------------------------
     0      1 -214255.4486 0.001 -296759.7083  -131751.189   True
     0      2 -478651.3813 0.001 -561488.5315 -395814.2311   True
     1      2 -264395.9327 0.001 -333329.5099 -195462.3555   True
-----------------------------------------------------------------
'''
posthoc.plot_simultaneous()
plt.show()

이원 분산분석

: 요인 2개

귀무가설 : 태아와 관측자수는 태아의 머리둘레의 평균과 관련이 없다.
대립가설 : 태아와 관측자수는 태아의 머리둘레의 평균과 관련이 있다.

* anova.py

import scipy.stats as stats
import pandas as pd
import  numpy as np
import urllib.request
import matplotlib.pyplot as plt
plt.rc('font', family="malgun gothic")
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

url = "https://raw.githubusercontent.com/pykwon/python/master/testdata_utf8/group3_2.txt"
data = pd.read_csv(urllib.request.urlopen(url), delimiter=',')
print(data)
'''
    머리둘레  태아수  관측자수
0   14.3    1     1
1   14.0    1     1
2   14.8    1     1
3   13.6    1     2
4   13.6    1     2
 
'''

data.boxplot(column='머리둘레', by='태아수', grid=False)
plt.show()

reg = ols('data["머리둘레"] ~ C(data["태아수"]) + C(data["관측자수"])', data = data).fit()
result = anova_lm(reg, type=2)
print(result)
'''
                   df      sum_sq     mean_sq            F        PR(>F)
C(data["태아수"])    2.0  324.008889  162.004444  2023.182239  1.006291e-32
C(data["관측자수"])   3.0    1.198611    0.399537     4.989593  6.316641e-03
Residual         30.0    2.402222    0.080074          NaN           NaN
'''

# 두개의 요소 상호작용이 있는 형태로 처리
formula = '머리둘레 ~ C(태아수) + C(관측자수) + C(태아수):C(관측자수)'
reg2 = ols(formula, data).fit()
print(reg2)
result2 = anova_lm(reg2, type=2)
print(result2)
'''
                  df      sum_sq     mean_sq            F        PR(>F)
C(태아수)           2.0  324.008889  162.004444  2113.101449  1.051039e-27
C(관측자수)          3.0    1.198611    0.399537     5.211353  6.497055e-03
C(태아수):C(관측자수)   6.0    0.562222    0.093704     1.222222  3.295509e-01
Residual        24.0    1.840000    0.076667          NaN           NaN
'''

p-value PR(>F) C(태아수):C(관측자수) : 3.295509e-01 > 0.05 이므로 귀무가설 채택.
# 귀무가설 : 태아와 관측자수는 태아의 머리둘레의 평균과 관련이 없다.

jikwon 테이블 정보로 chi, t검정, anova

* anova5_t.py

import MySQLdb
import ast
import pandas as pd
import numpy as np
import scipy.stats as stats
import statsmodels.stats.api as sm
import matplotlib.pyplot as plt

try:
    with open('mariadb.txt', 'r') as f:
        config = f.read()
except Exception as e:
    print('err :', e)

config = ast.literal_eval(config)

conn = MySQLdb.connect(**config)
cursor = conn.cursor()

print("=================================================================")
print(' * 교차분석 (이원 카이제곱 검정 : 각 부서(범주형)와 직원평가 점수(범주형) 간의 관련성 분석) *')
# 독립변수 : 범주형, 종속변수 : 범주형
# 귀무가설 : 각 부서와 직원평가 점수 간 관련이 없다.(독립)
# 대립가설 : 각 부서와 직원평가 점수 간 관련이 있다.

df = pd.read_sql("select * from jikwon", conn)
print(df.head(3))
'''
   jikwon_no jikwon_name  buser_num  ... jikwon_ibsail  jikwon_gen jikwon_rating
0          1         홍길동         10  ...    2008-09-01           남             a
1          2         한송이         20  ...    2010-01-03           여             b
2          3         이순신         20  ...    2010-03-03           남             b
'''

buser = df['buser_num']
rating = df['jikwon_rating']

ctab = pd.crosstab(buser, rating) # 교차표 작성
print(ctab)
'''
jikwon_rating  a  b  c
buser_num             
10             5  1  1
20             3  6  3
30             5  2  0
40             2  2  0
'''

chi, p, df, exp = stats.chi2_contingency(ctab)
print('chi : {}, p : {}, df : {}'.format(chi, p, df))
# chi : 7.339285714285714, p : 0.2906064076671985, df : 6
# p-value : 0.2906 > 0.05 이므로 귀무가설 채택.
# 귀무가설 : 각 부서와 직원평가 점수 간 관련이 없다.(독립)

print("=================================================================")
print(' * 교차분석 (이원 카이제곱 검정 : 각 부서(범주형)와 직급(범주형) 간의 관련성 분석) *')
# 귀무가설 : 각 부서와 직급 간 관련이 없다.(독립)
# 대립가설 : 각 부서와 직급 간 관련이 있다.

df2 = pd.read_sql("select buser_num, jikwon_jik from jikwon", conn)
print(df2.head(3))
buser = df2.buser_num
jik = df2.jikwon_jik

ctab2 = pd.crosstab(buser, jik) # 교차표 작성
print(ctab2)

chi, p, df, exp = stats.chi2_contingency(ctab2)
print('chi : {}, p : {}, df : {}'.format(chi, p, df))
# chi : 9.620617477760335, p : 0.6492046290079438, df : 12
# p-value : 0.6492 > 0.05 이므로 귀무가설 채택.
# 귀무가설 : 각 부서와 직급 간 관련이 없다.(독립)
print()

print("=================================================================")
print(' * 차이분석 (t-test : 10, 20번 부서(범주형)와 평균 연봉(연속형) 간의 차이 분석) *')
# 독립변수 : 범주형, 종속변수 : 연속형 

# 귀무가설 : 두 부서 간 연봉 평균의 차이가 없다.
# 대립가설 : 두 부서 간 연봉 평균의 차이가 있다.

#df_10 = pd.read_sql("select buser_num, jikwon_pay from jikwon where buser_num in (10, 20)", conn)
df_10 = pd.read_sql("select buser_num, jikwon_pay from jikwon where buser_num = 10", conn)
df_20 = pd.read_sql("select buser_num, jikwon_pay from jikwon where buser_num = 20", conn)
buser_10 = df_10['jikwon_pay']
buser_20 = df_20['jikwon_pay']

print('평균 :',np.mean(buser_10), ' ', np.mean(buser_20))
# 평균 : 5414.285714285715   4908.333333333333

t_result = stats.ttest_ind(buser_10, buser_20)
print(t_result)
# pvalue=0.6523 > 0.05 이므로 귀무가설 채택
# 귀무가설 : 두 부서 간 연봉 평균의 차이가 없다.
print()

print("=================================================================")
print(' * 분산분석 (ANOVA : 각 부서(부서라는 1개의 요인에 4그룹으로 분리. 범주형)와 평균 연봉(연속형) 간의 차이 분석) *')
# 독립변수 : 범주형, 종속변수 : 연속형 

# 귀무가설 : 4개의 부서 간 연봉 평균의 차이가 없다.
# 대립가설 : 4개의 부서 간 연봉 평균의 차이가 있다.

df3 = pd.read_sql("select buser_num, jikwon_pay from jikwon", conn)
buser = df3['buser_num']
pay = df3['jikwon_pay']

gr1 = df3[df3['buser_num'] == 10 ]['jikwon_pay']
gr2 = df3[df3['buser_num'] == 20 ]['jikwon_pay']
gr3 = df3[df3['buser_num'] == 30 ]['jikwon_pay']
gr4 = df3[df3['buser_num'] == 40 ]['jikwon_pay']
print(gr1)

# 시각화
plt.boxplot([gr1, gr2, gr3, gr4])
#plt.show()

# 방법 1
f_sta, pv = stats.f_oneway(gr1, gr2, gr3, gr4)
print('f : {}, p : {}'.format(f_sta, pv))
# f : 0.41244077160708414, p : 0.7454421884076983
# p > 0.05 이므로 귀무가설 채택
# 귀무가설 : 4개의 부서 간 연봉 평균의 차이가 없다.


# 방법 2
lmodel = ols('jikwon_pay ~ C(buser_num)', data = df3).fit()
result = anova_lm(lmodel, type=2)
print(result)
'''
                df        sum_sq       mean_sq         F    PR(>F)
C(buser_num)   3.0  5.642851e+06  1.880950e+06  0.412441  0.745442
Residual      26.0  1.185739e+08  4.560535e+06       NaN       NaN
'''
# P : 0.745442 > 0.05  이므로 귀무가설 채택
# 귀무가설 : 4개의 부서 간 연봉 평균의 차이가 없다.

# 사후검정
from statsmodels.stats.multicomp import pairwise_tukeyhsd

tukey = pairwise_tukeyhsd(df3.jikwon_pay, df3.buser_num)
print(tukey)
'''
   Multiple Comparison of Means - Tukey HSD, FWER=0.05    
==========================================================
group1 group2  meandiff p-adj    lower      upper   reject
----------------------------------------------------------
    10     20 -505.9524    0.9 -3292.2958  2280.391  False
    10     30  -85.7143    0.9 -3217.2939 3045.8654  False
    10     40  848.2143    0.9 -2823.8884 4520.3169  False
    20     30  420.2381    0.9 -2366.1053 3206.5815  False
    20     40 1354.1667 0.6754  -2028.326 4736.6593  False
    30     40  933.9286 0.8955 -2738.1741 4606.0312  False
'''
tukey.plot_simultaneous()
plt.show()

'BACK END > Deep Learning' 카테고리의 다른 글

[딥러닝] 선형회귀 (0)	2021.03.10
[딥러닝] 공분산, 상관계수 (0)	2021.03.10
[딥러닝] 이항검정 (0)	2021.03.10
[딥러닝] T 검정 (0)	2021.03.05
[딥러닝] 카이제곱 (0)	2021.03.04

[딥러닝] T 검정

2021. 3. 5. 15:24

T 검정

: 집단 간 평균(비율) 차이 검정
: 평균값의 차이와 표준편차의 비율이 얼마나 큰지 혹은 작은지 통계적으로 검정하는 방법

독립변수 : 범주형 => t-test(2개이하) / anova(3개이상)
종속변수 : 연속형

T 검정 종류

One sample t test
Indepenent t test
Paired t test

단일 표본 검정 (one sample t-test)

: 하나의 집단에 대한 표본평균이 예측된 평균과 같은지 여부를 검정하나의 집단에 대한 표본평균과 새롭게 수집된 데이터의 예측된 평균이 같은지 여부를 검정

실습 1 : 어느 남성 집단의 평균 키 검정

귀무가설 : 남성 집단의 평균 키가 177이다. 샘플의 평균과 모집단의 평균은 같다.
대립가설 : 남성 집단의 평균 키가 177아니다. 샘플의 평균과 모집단의 평균은 다르다.

* t1.py

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats

one_sample =[177.0, 182.7, 169.6, 176.8, 180.0]
print(np.array(one_sample).mean())  # 평균 : 177.21999999999997

one_sample2 =[167.0, 162.7, 169.6, 176.8, 170.0]
print(np.array(one_sample2).mean()) # 평균 : 169.21999999999997

result1 = stats.ttest_1samp(one_sample, popmean=177) # popmean : 모집단의 평균
print('result1 :', result1)
# result1 : Ttest_1sampResult(statistic=0.10039070766877535, pvalue=0.9248646407498543)

p-value=0.92486 > 0.05 이므로 귀무가설 채택
# 귀무가설 : 남성 집단의 평균 키가 177이다. 샘플의 평균과 모집단의 평균은 같다.

result2 = stats.ttest_1samp(one_sample2, popmean=177)
print('result2 :', result2)
# result2 : Ttest_1sampResult(statistic=-3.3850411682038235, pvalue=0.02765632703927135)

p-value=0.02765 < 0.05 이므로 귀무가설 기각
# 대립가설 : 남성 집단의 평균 키가 177아니다. 샘플의 평균과 모집단의 평균은 다르다.

result3 = stats.ttest_1samp(one_sample, popmean=165)
print('result3 :', result3)

p-value=0.005 < 0.05 이므로 귀무가설 기각
# 대립가설 : 남성 집단의 평균 키가 177아니다. 샘플의 평균과 모집단의 평균은 다르다.

stats.ttest_1samp(데이터, popmean=모집단 평균) : one sample t-test 함수

실습 2 : 어느 집단 자료 평균 검정

귀무가설 : 자료들의 평균은 0이다.
대립가설 : 자료들의 평균은 0이 아니다.

np.random.seed(123)
mu = 0
n = 10
x = stats.norm(mu).rvs(n) # 평균이 0인 가우시안 정규분포를 따르는 난수 10개 출력
print(x)
#[-1.0856306   0.99734545  0.2829785  -1.50629471 -0.57860025  1.65143654 -2.42667924 -0.42891263  1.26593626 -0.8667404 ]
print(np.mean(x)) #-0.26951611032632805

import seaborn as sns

# x 데이터의 정규성 만족 여부 확인
sns.distplot(x, kde=False, fit=stats.norm) # 분포 시각화
plt.show()

import seaborn as sns

sns.distplot(x, kde=, fit=) : 히스토그램

print(stats.shapiro(x)) # 정규성 검정 함수
# ShapiroResult(statistic=0.9674148559570312, pvalue=0.8658965229988098)

p-value 0.865 > 0.05 이므로 정규성 만족

stats.shapiro(데이터) : 정규성 검정 함수

result4 = stats.ttest_1samp(x, popmean=0) # 모수의 평균 키
print('result4 :', result4)
# result4 : Ttest_1sampResult(statistic=-0.6540040368674593, pvalue=0.5294637946339893)

p-value=0.529 > 0.05 이므로 귀무가설 채택
# 귀무가설 : 자료들의 평균은 0이다.

# 참고 : 모수의 평균이 0.8이라고 하면
result5 = stats.ttest_1samp(x, popmean=0.8) # 모수의 평균 키
print('result5 :', result5)
# result5 : Ttest_1sampResult(statistic=-2.595272886660806, pvalue=0.028961904737220684)

p-value=0.028 < 0.05 이므로 귀무가설 채택
# 대립가설 : 자료들의 평균은 0이 아니다.

실습 3 : A 중학교 1 학년 1 반 학생들의 시험결과가 담긴 파일을 읽어 처리 (국어 점수 80점에 대한 평균검정)

귀무가설 : 학생들의 국어점수의 평균은 80이다.
대립가설 : 학생들의 국어점수의 평균은 80이 아니다.

* t2.py

import pandas as pd
import scipy.stats as stats
import numpy as np

data = pd.read_csv("../testdata/student.csv")
print(data.head(3))
'''
    이름  국어  영어  수학
0  박치기  90  85  55
1  홍길동  70  65  80
2  김치국  92  95  76
'''
print(data.describe())

print(np.mean(data.국어)) # 72.9와 80은 평균에 차이가 있는지 확인.
print(stats.shapiro(data.국어))

result = stats.ttest_1samp(data.국어, popmean=80)
print('result :',result)

pvalue=0.19 > 0.05 이므로 귀무가설 채택.
# 귀무가설 : 학생들의 국어점수의 평균은 80이다.

result2 = stats.ttest_1samp(data.국어, popmean=60)
print('result2 :',result2)

pvalue=0.02568 < 0.05 이므로 귀무가설 기각.
# 대립가설 : 학생들의 국어점수의 평균은 80이 아니다.

실습4 : 여아 신생아 몸무게의 평균 검정 수행

여아 신생아의 몸무게는 평균이 2800(g) 으로 알려져 왔으나 이보다 더 크다는 주장이 나왔다.
표본으로 여아 18 명을 뽑아 체중을 측정하였다고 할 때 새로운 주장이 맞는지 검정해 보자

귀무가설 : 여아 신생아 몸무게의 평균이 2800g이다.
대립가설 : 여아 신생아 몸무게의 평균이 2800g 보다 크다.

data = pd.read_csv("https://raw.githubusercontent.com/pykwon/python/master/testdata_utf8/babyboom.csv")
print(data.head(3))
'''
   time  gender  weight  minutes
0     5       1    3837        5
1   104       1    3334       64
2   118       2    3554       78
'''
print(data.describe()) # gender - 1 : 여아, 2: 남아

fdata = data[data.gender == 1]
print(fdata.head(3), len(fdata), np.mean(fdata.weight))
'''
   time  gender  weight  minutes
0     5       1    3837        5
1   104       1    3334       64
5   405       1    2208      245
18명, 평균 : 3132.4444444444443
'''
# 3132(g)과 2800(g)에 대해 평균에 차이가 있는 지 검정.

# 정규성 확인
# 통계 추론시 대부분이 모집단은 정규분포를 따른다는 가정하에 진행하는 것이 일반적이다. 중심 극한의 원리에 의해
# 중심 극한의 원리 : 표본의 크기가 커질 수록 표본 평균의 분포는 모집단의 분포 모양과는 관계없이 정규 분포에 가까워진다.
print(stats.shapiro(fdata.weight)) # p-value=0.0179 < 0.05 => 정규성 위반

import matplotlib.pyplot as plt
import seaborn as sns
sns.distplot(fdata.iloc[:, 2], fit=stats.norm)
plt.show()

stats.probplot(fdata.iloc[:, 2], plot=plt) # Q-Q plot 상에서 정규성 확인 - 잔차의 정규성
plt.show()

stats.probplot(데이터, plot=plt) : Q-Q plot 상에서 정규성 확인 - 잔차의 정규성

result3 = stats.ttest_1samp(fdata.weight, popmean=2800)
print('result3 :',result3) # pvalue=0.03926844173060218

pvalue=0.0392 < 0.05 이므로 귀무가설 기각.
# 대립가설 : 여아 신생아 몸무게의 평균이 2800g 보다 크다.

서로 독립인 두 집단의 평균 차이 검정 (independent samples t test)

선행조건 : 두 집단은 정규분포를 따라야 하며, 두 집단의 분산이 동일해야한다.

남녀의 성적 A 반과 B 반의 키 경기도와 충청도의 소득 따위의 서로 독립인 두 집단에서 얻은 표본을 독립표본 (two sample) 이라고 한다

실습1 : 남녀 두 집단 간 파이썬 시험의 평균 차이 검정

귀무가설 : 남녀 두 집단 간 파이썬 시험의 평균에 차이가 없다.
대립가설 : 남녀 두 집단 간 파이썬 시험의 평균에 차이가 있다.

* t3.py

import numpy as np
import scipy.stats as stats

male= [75, 85, 100, 72.5, 86.5]
female = [63.2, 76, 52, 100, 70]

print(np.mean(male))    # 83.8
print(np.mean(female))  # 72.24

# 데이터에 대한 정규성/ 등분산성은 생략

two_sample = stats.ttest_ind(male, female)
two_sample = stats.ttest_ind(male, female, equal_var = True)
# equal_var=True(default): 등분산을 만족한 경우
print(two_sample.pvalue)
print(two_sample)
# Ttest_indResult(statistic=1.233193127514512, pvalue=0.25250768448532773)

stats.ttest_ind(x, y, equal_var =) : 독립 샘플 t 검정

pvalue=0.2525 > 0.05 이므로 귀무채택.
# 귀무가설 : 남녀 두 집단 간 파이썬 시험의 평균에 차이가 없다.

실습2 : 두 가지 교육방법에 따른 평균시험 점수에 대한 검정 수행

귀무가설 : 두 가지 교육방법에 따른 평균시험 점수에 차이가 없다.
대립가설 : 두 가지 교육방법에 따른 평균시험 점수에 차이가 있다.

import pandas as pd
data = pd.read_csv('../testdata/two_sample.csv')
print(data.head(3))
'''
   no  gender  method  survey  score
0   1       1       1       1    5.1
1   2       1       2       0    NaN
2   3       1       1       1    4.7
'''

ms = data[['method', 'score']]
print(data['method'].unique()) # [1 2]

# 교육방법 별로 데이터 분리
m1 = ms[ms['method'] == 1]
m2 = ms[ms['method'] == 2]
print(m1[:5])
'''
    method  score
0        1    5.1
2        1    4.7
4        1    5.4
7        1    5.2
10       1    5.7
'''
print(m2[:5])
'''
   method  score
1       2    NaN
3       2    NaN
5       2    4.4
6       2    4.9
8       2    4.3
'''

sco1 = m1['score']
sco2 = m2['score']

# NaN : 임의의 값으로 대체, 평균으로 대체, 제거
#sco1 = sco1.fillna(0)
sco1 = sco1.fillna(sco1.mean())
sco2 = sco2.fillna(sco2.mean())
print()
print(sco1[:5])
'''
0     5.1
2     4.7
4     5.4
7     5.2
10    5.7
'''
print(sco2[:5])
'''
1    5.246667
3    5.246667
5    4.400000
6    4.900000
8    4.300000
'''

fillna(대체값) : 결측치 임의의 값으로 대체.

- 정규성 확인

import matplotlib.pyplot as plt
import seaborn as sns
sns.distplot(sco1, kde = False, fit=stats.norm)
sns.distplot(sco2, kde = False, fit=stats.norm)
plt.show()

print(stats.shapiro(sco1)) # pvalue=0.3679 > 0.05 정규성 만족.
print(stats.shapiro(sco2)) # pvalue=0.6714 > 0.05

print(stats.levene(sco1, sco2))   # pvalue=0.4568 > 0.05 등분산성 만족
print(stats.fligner(sco1, sco2))  # pvalue=0.4432
print(stats.bartlett(sco1, sco2)) # 비모수 30개 이상. pvalue=0.2678

result = stats.ttest_ind(sco1, sco2, equal_var=True) # 정규성 만족, 등분산성 만족할 경우
print(result) # pvalue=0.8450

stats.levene(x, y) : 등분산성 검증(레빈).

stats.fligner(x, y) : 등분산성 검증(플리그너).

stats.bartlett(x, y) : 등분산성 검증(바틀렛).

결론 : pvalue=0.8450 > 0.05 이므로 귀무채택.
# 귀무가설 : 두 가지 교육방법에 따른 평균시험 점수에 차이가 없다.

# 참고
result = stats.ttest_ind(sco1, sco2, equal_var=False) # 정규성 만족, 등분산성 만족하지 못할 경우
result = stats.wilcoxon(sco1, sco2)
print(result)

실습3 : 어느 음식점의 매출자료와 날씨 자료를 이용하여 강수여부에 따른 매출의 평균의 차이에 대한 검정

집단 1 : 비가 올때의 매출, 집단 2 : 비가 안올때의 매출

귀무가설 : 강수여부에 따른 매출액 평균에 차이가 없다.
대립가설 : 강수여부에 따른 매출액 평균에 차이가 있다.

* t4.py

- 매출자료

import numpy as np
import scipy.stats as stats
import pandas as pd
import matplotlib.pyplot as plt

sales_data = pd.read_csv('https://raw.githubusercontent.com/pykwon/python/master/testdata_utf8/tsales.csv', dtype={'YMD':'object'})
print(sales_data.head(3)) # 328행
'''
        YMD    AMT  CNT
0  20190514      0    1
1  20190519  18000    1
2  20190521  50000    4
'''
print(sales_data.info())

- 날씨 자료

wt_data = pd.read_csv('https://raw.githubusercontent.com/pykwon/python/master/testdata_utf8/tweather.csv')
print(wt_data.head(3)) # 702행
'''
   stnId          tm  avgTa  minTa  maxTa  sumRn  maxWs  avgWs  ddMes
0    108  2018-06-01   23.8   17.5   30.2    0.0    4.3    1.9    0.0
1    108  2018-06-02   23.4   17.6   30.1    0.0    4.5    2.0    0.0
2    108  2018-06-03   24.0   16.9   30.8    0.0    4.2    1.6    0.0
'''
print(wt_data.info())

- 날짜를 기준으로 join

wt_data.tm = wt_data.tm.map(lambda x : x.replace('-','')) # wt_data.tm에서 '-' 제거 
#print(wt_data.head(3))
frame = sales_data.merge(wt_data, how='left', left_on='YMD', right_on='tm') # join
print(frame.head(3), frame.shape) # (328, 12)
'''
        YMD    AMT  CNT  stnId        tm  ...  maxTa  sumRn  maxWs  avgWs  ddMes
0  20190514      0    1    108  20190514  ...   26.9    0.0    4.1    1.6    0.0
1  20190519  18000    1    108  20190519  ...   21.6   22.0    2.7    1.2    0.0
2  20190521  50000    4    108  20190521  ...   23.8    0.0    5.9    2.9    0.0
'''
print(frame.columns)

df1.merge(df2, how='left', left_on='df1 칼럼', right_on='df2 칼럼') : join

- 분석에 참여할 칼럼만 추출

data = frame.iloc[:, [0,1,7,8]]
print(data.head(3))
'''
        YMD    AMT  maxTa  sumRn
0  20190514      0   26.9    0.0
1  20190519  18000   21.6   22.0
2  20190521  50000   23.8    0.0
'''

# 방법 1
print(data['sumRn'] > 0)
data['rain_yn'] = (data['sumRn'] > 0).astype(int)
print(data.head(3))
'''
        YMD    AMT  maxTa  sumRn  rain_yn
0  20190514      0   26.9    0.0        0
1  20190519  18000   21.6   22.0        1
2  20190521  50000   23.8    0.0        0
'''

# 방법 2
print(True * 1, False * 1) # 1 0
data['rain_yn'] = (data.loc[:, ('sumRn')] > 0) * 1
print(data.head(3))

df.astype(타입명) : 타입변경

- 강수여부에 따른 매출액 비교용 시각화

sp = np.array(data.iloc[:, [1, 4]])  #  AMT, rain_yn
tg1 = sp[sp[:, 1] == 0, 0] # 비가 안올때의 매출액
tg2 = sp[sp[:, 1] == 1, 0] # 비가 올때의 매출액
print(tg1[:3]) # [     0  50000 125000]
print(tg2[:3]) # [ 18000 274000 318000]
print(np.mean(tg1), np.mean(tg2)) # 761040.2542372881 757331.5217391305

plt.plot(tg1)
plt.show()
plt.plot(tg2)
plt.show()

plt.boxplot([tg1, tg2], meanline=True, showmeans=True, notch=True)
plt.show()

plt.boxplot([data1, data2], meanline=True, showmeans=True, notch=True) : 상자 막대그래프

- 두 집단 평균차이 검정

# 정규성 (N > 30)
print(len(tg1), ' ', len(tg2))    # 236   92
print(stats.shapiro(tg1).pvalue) # 0.0560 > 0.05 => 정규성 만족 
print(stats.shapiro(tg2).pvalue) # 0.8827

# 등분산
print(stats.levene(tg1, tg2).pvalue) # 0.7123 > 0.05 => 등분산성 만족

print(stats.ttest_ind(tg1, tg2, equal_var=True)) # pvalue=0.9195

결론 : pvalue=0.9195 > 0.05 이므로 귀무가설 채택.
# 귀무가설 : 강수여부에 따른 매출액 평균에 차이가 없다.

서로 대응인 두 집단의 평균 차이 검정 (paired samples t test)

: 처리 이전과 처리 이후를 각각의 모집단으로 판단하여 동일한 관찰 대상으로부터 처리 이전과 처리 이후를 1:1 로 대응시킨 두 집단으로 부터의 표본을 대응표본 (paired sample) 이라고 한다.
: 대응인 두 집단의 평균 비교는 동일한 관찰 대상으로부터 처리 이전의 관찰과 이후의 관찰을 비교하여 영향을 미친 정도를 밝히는데 주로 사용하고 있다 집단 간 비교가 아니므로 등분산 검정을 할 필요가 없다.
: 광고 전/후의 상품 판매량의 차이, 운동 전/후의 근육량의 차이

실습 1 : 특강 전/후의 시험점수는 차이

귀무가설 : 특강 전/후의 시험점수는 차이가 없다.
대립가설 : 특강 전/후의 시험점수는 차이가 있다.

* t5.py

import numpy as np
import scipy.stats as stats
from seaborn.distributions import distplot

# 대응 표본 t 검정 1
np.random.seed(12)
x1 = np.random.normal(80, 10, 100)
x2 = np.random.normal(77, 10, 100)

# 정규성
import matplotlib.pyplot as plt
import seaborn as sns
sns.distplot(x1, kde=False, fit=stats.norm)
sns.distplot(x2, kde=False, fit=stats.norm)
plt.show()

print(stats.shapiro(x1)) # pvalue=0.9942 > 0.05 => 정규성 만족
print(stats.shapiro(x2)) # pvalue=0.7985

# 집단이 하나이므로 등분산성 검즘은 하지않는다.

print(stats.ttest_rel(x1, x2))

stats.ttest_rel(x1, x2) : 서로 대응인 두 집단의 평균 차이 검정

결론 : pvalue=0.0187 < 0.05 이므로 귀무가설 기각.
# 대립가설 : 특강 전/후의 시험점수는 차이가 있다.

실습 2 : 환자 9명의 복부 수술 전/후 몸무게 변화

귀무가설 : 복부 수술 전/후 몸무게 변화가 없다.
대립가설 : 복부 수술 전/후 몸무게 변화가 있다.

baseline = [67.2, 67.4, 71.5, 77.6, 86.0, 89.1, 59.5, 81.9, 105.5]
follow_up = [62.4, 64.6, 70.4, 62.6, 80.1, 73.2, 58.2, 71.0, 101.0]

print(np.mean(baseline)) # 78.41111111111111

print(np.mean(follow_up)) # 71.5
print(np.mean(baseline) - np.mean(follow_up)) # 6.911111111111111

result = stats.ttest_rel(baseline, follow_up)
print(result)

pvalue=0.0063 < 0.05 => 귀무가설 기각.
# 대립가설 : 복부 수술 전/후 몸무게 변화가 있다.

'BACK END > Deep Learning' 카테고리의 다른 글

[딥러닝] 선형회귀 (0)	2021.03.10
[딥러닝] 공분산, 상관계수 (0)	2021.03.10
[딥러닝] 이항검정 (0)	2021.03.10
[딥러닝] ANOVA (0)	2021.03.08
[딥러닝] 카이제곱 (0)	2021.03.04

[딥러닝] 카이제곱

2021. 3. 4. 09:46

귀무가설(영가설, H0) : 변함없는 생각.
대립가설(연구가설, H1) : 귀무가설에 반대하는 새로운 의견. 연구가설.

점추정 : 단일값을 모집단에서 샘플링
구간추정 : 범위를 모집단에서 샘플링. 신뢰도 상승.

가설검증

① 분포표 사용 -> 임계값 산출
임계값이 critical value 왼쪽에 있을 경우 귀무가설 채택/기각.

② p-value 사용(tool에서 산출). 유의확률. (경험적 수치)
p-value > 0.05(a)  : 귀무가설 채택. 우연히 발생할 확률이 0.05보다 낮아 연구가설 기각.
p-value < 0.05(a)  : 귀무가설 기각. 좋다
p-value < 0.01(a)  : 귀무가설 기각. 더 좋다
p-value < 0.001(a) : 귀무가설 기각. 더 더 좋다 => 신뢰 할 수 있다.

척도에 따른 데이터 분석방법

독립변수 x	종속변수 y	분석 방법
범주형	범주형	카이제곱 검정(교차분석)
범주형	연속형	T 검정(범주형 값 2개 : 집단 2개 이하), ANOVA(범주형 값 3개 : 집단 2개 이상)
연속형	범주형	로지스틱 회귀 분석
연속형	연속형	회귀분석, 구조 방정식

1종 오류 : 귀무가설이 채택(참)인데, 이를 기각.
2종 오류 : 귀무가설이 기각(거짓)인데, 이를 채택.

분산(표준편차)의 중요도 - 데이터의 분포

* temp.py

import scipy.stats as stats
import numpy as np
import matplotlib.pyplot as plt

centers = [1, 1.5, 2] # 3개 집단
col = 'rgb'
data = []
std = 0.01 # 표준편차 - 작으면 작을 수록 응집도가 높다.

for i in range(len(centers)):
    data.append(stats.norm(centers[i], std).rvs(100)) # norm() : 가우시안 정규분포. rvs(100) : 랜덤값 생성.
    plt.plot(np.arange(100) + i * 100, data[i], color=col[i])   # 그래프
    plt.scatter(np.arange(100) + i * 100, data[i], color=col[i]) # 산포도
plt.show()
print(data)

import scipy.stats as stats

stats.norm(평균, 표준편차).rvs(size=크기, random_state=seed) : 가우시안 정규 분포 랜덤 값 생성.

import matplotlib.pyplot as plt

plt.plot(x, y) : 그래프 출력.

교차분석 (카이제곱) 가설 검정

데이터나 집단의 분산을 추정하고 검정할때 사용
독립변수, 종속변수 : 범주형
일원 카이제곱 : 변수 단수. 적합성(선호도) 검정. 교차분할표 사용하지않음.
이원 카이제곱 : 변수 복수. 독립성(동질성) 검정. 교차분할표 사용.
절차 : 가설 설정 -> 유의 수준 결정 -> 검정통계량 계산 -> 귀무가설 채택여부 판단 -> 검정결과 진술
수식 : sum((관찰빈도 - 기대빈도)^2 / 기대빈도)

수식에 따른 카이제곱 검정

* chi1.py

import pandas as pd

data = pd.read_csv('../testdata/pass_cross.csv', encoding='euc-kr') # csv 파일 읽기
print(data.head(3))
'''
   공부함  공부안함  합격  불합격
0    1     0   1    0
1    1     0   1    0
2    0     1   0    1
'''
print(data.shape) # (행, 열) : (50, 4)

귀무가설 : 벼락치기와 합격여부는 관계가 없다.
대립가설 : 벼락치기와 합격여부는 관계가 있다.

print(data[(data['공부함'] == 1) & (data['합격'] == 1)].shape[0]) # 18 - 공부하고 합격
print(data[(data['공부함'] == 1) & (data['불합격'] == 1)].shape[0]) # 7 - 공부하고 불합격

빈도표

data2 = pd.crosstab(index=data['공부안함'], columns=data['불합격'], margins=True)
'''
불합격     0   1
공부안함        
0     18   7
1     12  13
'''
data2.columns = ['합격', '불합격', '행 합']
data2.index = ['공부함', '공부안함', '열 합']
print(data2)
'''
             합격  불합격  행 합
공부함    18    7   25
공부안함  12   13   25
열 합      30   20   50
'''

기대도수 = 각 행 합 * 각 열 합 / 총합

'''
        합격  불합격  행 합
공부함    15    10   25
공부안함  15   10   25
열 합      30   20   50
'''

카이제곱

ch2 = ((18 - 15) ** 2 / 15) + ((7 - 10) ** 2 / 10) + ((12 - 15) ** 2 / 15) + ((13 - 10) ** 2 / 10)
print('카이제곱 :', ch2) # 3.0

자유도(df) = (행 개수 - 1) * (열 개수 - 1) = (2-1) * (2-1) = 1

카이제곱 분포표에서 확인한 임계치 : 3.84

math100.tistory.com/45

카이제곱분포표 보는 법

카이제곱분포는 t분포와 마찬가지로 확률을 구할 때 사용하는 분포가 아니라, 나중에 신뢰구간이랑 가설검정에서 사용하는 분포다. 그래서 카이제곱분포표는 “t분포표 보는 법”과 얼추 비슷

math100.tistory.com

# 결론1 : 카이제곱 3 < 유의수준 3.84 => 귀무가설 채택 내에서 존재하여 귀무가설 채택.
( 귀무가설 : 벼락치기와 합격여부는 관계가 없다. )

전문가 제공 모듈 사용(chi2_contingency())

import scipy.stats as stats
chi2, p, ddof, expected = stats.chi2_contingency(data2)
print('chi2\t:', chi2)
print('p\t:', p)
print('ddof\t:', ddof)
print('expected :\n', expected)
'''
chi2    : 3.0
p    : 0.5578254003710748
ddof    : 4
expected :
 [[15. 10. 25.]
 [15. 10. 25.]
 [30. 20. 50.]]
'''

# 결론2 : 유의확률(p-value) 0.5578 > 유의수준 0.05 => 귀무가설 채택.
( 귀무가설 : 벼락치기와 합격여부는 관계가 없다. )

stats.chi2_contigency(데이터) : 이원카이제곱 함수

일원카이제곱(chisquare())

: 관찰도수가 기대도수와 일치하는 지를 검정하는 방법
: 종류 적합도 선호도 검정 - 범주형 변수가 한 가지로 관찰도수가 기대도수에 일치하는지 검정한다

적합도검정

: 자연현상이나 각종 실험을 통해 관찰되는 도수들이 귀무가설 하의 분포 범주형 자료의 각 수준별 비율 에 얼마나 일치하는 가에 대한 분석을 적합도 검정이라 한다
: 관측값들이 어떤 이론적 분포를 따르고 있는지를 검정으로 한 개의 요인을 대상으로 함

<실습 : 적합도 검정>

주사위를 60 회 던져서 나온 관측도수 기대도수가 아래와 같이 나온 경우에 이 주사위는 적합한 주사위가 맞는가를 일원카이제곱 검정
으로 분석하자

주사위 눈금	1	2	3	4	5	6
관측도수	4	6	17	16	8	9
기대도수	10	10	10	10	10	10

귀무가설(영가설, H0) : 기대빈도와 관찰빈도는 차이가 없다. 현재 주사위는 평범한 주사위다.
대립가설(연구가설, H1) : 기대빈도와 관찰빈도는 차이가 있다. 현재 주사위는 평범하지않은 주사위다.

변수 1개 : 주사위를 던진 횟수. 기대빈도와 관찰빈도의 차이를 확인. stats.chisquare(관찰빈도, 기대빈도)

* chi2.py

import pandas as pd
import scipy.stats as stats

data = [4, 6, 17, 16, 8, 9] # 관찰빈도
result = stats.chisquare(data)
print(result)
# Power_divergenceResult(statistic=14.200000000000001, pvalue=0.014387678176921308)

data2 = [10, 10, 10, 10, 10, 10] # 기대빈도
result2 = stats.chisquare(data, data2)
print(result2)

print('통계량(x2) : %.5f, p-value : %.5f'%result)
# 통계량(x2) : 14.20000, p-value : 0.01439
# 통계량과 p-value는 반비례 관계.

# 결론 1 : p-value 0.01439 < 0.05 이므로 유의미한 수준(a=0.05)에서 귀무 가설 기각. 연구가설 채택.
( 대립가설 : 기대빈도와 관찰빈도는 차이가 있다. 현재 주사위는 평범하지않은 주사위다. )

# 결론 2 : df 5(N-1), 임계값 : 카이제곱 분포표를 참고시 11.07
statistic(x2) 14.200 > 임계값 11.07 이므로 귀무 가설 기각. 연구가설 채택.
( 대립가설 : 기대빈도와 관찰빈도는 차이가 있다. 현재 주사위는 평범하지않은 주사위다. )

관찰빈도와 기대빈도 사이에 유의한 차이가 있는 지 일원 카이제곱을 사용하여 검정

stats.chisquare(데이터) : 일원카이제곱 함수

<실습 : 선호도 분석>

5개의 스포츠 음료에 대한 선호도에 차이가 있는지 검정하기

귀무 가설 : 스포츠 음료에 대한 선호도 차이가 없다.
대립 가설 : 스포츠 음료에 대한 선호도 차이가 있다.

import numpy as np
data3 = pd.read_csv('../testdata/drinkdata.csv')
print(data3, ' ', sum(data3['관측도수']))
'''
  음료종류  관측도수
0   s1    41
1   s2    30
2   s3    51
3   s4    71
4   s5    61   254
'''
print(stats.chisquare(data3['관측도수']))
# Power_divergenceResult(statistic=20.488188976377952, pvalue=0.00039991784008227264)

# 결론 : pvalue 0.00039 < 유의수준 0.05 이기 때문에 귀무가설 기각. 대립가설 채택.
( 대립 가설 : 스포츠 음료에 대한 선호도 차이가 있다. 향후 작업에 대한 참조자료로 이용. )

이원카이제곱 - 교차분할표 이용

: 두 개 이상의 변인 집단 또는 범주 을 대상으로 검정을 수행한다
분석대상의 집단 수에 의해서 독립성 검정과 동질성 검정으로 나뉜다

독립성 검정 : 두 변인의 관계가 독립인지 검정
동질성 검정 : 두 변인 간의 비율이 서로 동일한지를 검정

독립성 (관련성) 검정

- 동일 집단의 두 변인 학력수준과 대학진학 여부 을 대상으로 관련성이 있는가 없는가
- 독립성 검정은 두 변수 사이의 연관성을 검정한다

<실습 : 교육수준과 흡연율 간의 관련성 분석>

집단 2개 : 교육수준, 흡연율
귀무가설 : 교육수준과 흡연율 간에 관련이 없다. (독립이다)
대립가설 : 교육수준과 흡연율 간에 관련이 있다. (독립이 아니다)

*chi3.py

import pandas as pd
import scipy.stats as stats
data = pd.read_csv('../testdata/smoke.csv')
print(data)
'''
     education  smoking
0            1        1
1            1        1
2            1        1
3            1        1
4            1        1
..         ...      ...
350          3        3
351          3        3
352          3        3
353          3        3
354          3        3
'''
print(data['education'].unique()) # [1 2 3]
print(data['smoking'].unique())   # [1 2 3]

교육 수준, 흡연인원를 이용한 교차표 작성

# education : 독립변수, smoking : 종속변수
ctab = pd.crosstab(index = data['education'], columns = data['smoking']) # 빈도수.
ctab = pd.crosstab(index = data['education'], columns = data['smoking'], normalize=True) # normalize : 비율
ctab.index = ['대학원졸', '대졸', '고졸']
ctab.columns = ['골초', '보통', '노담']
print(ctab)
'''
             골초  보통  노담
대학원졸  51  92  68
대졸       22  21   9
고졸       43  28  21
'''

pd.crosstab(index=, columns=) : 교차 테이블

이원 카이 제곱을 지원하는 함수

chi_result = [ctab.loc['대학원졸'], ctab.loc['대졸'], ctab.loc['고졸']]
chi, p, _, _ = stats.chi2_contingency(chi_result)
chi, p, _, _ = stats.chi2_contingency(ctab)

print('chi :', chi)    # chi : 18.910915739853955
print('p-value :', p)  # p-value : 0.0008182572832162924

# 결론 : p-value 0.0008 < 0.05 이므로 귀무가설 기각.
( 대립가설 : 교육수준과 흡연율 간에 관련이 있다. (독립이 아니다) )

stats.chi2_contigency(데이터) : 이원카이제곱 함수

야트보정

분할표의 자유도가 1인 경우는 x^2값이 약간 높게 계산된다. 그러므로 이에 하기 식으로 야트보정이라 한다.
x^2=∑(|O-E|-0.5)^2/E

=> 상기 함수는 자동으로 야트보정이 이루어진다.

<실습 : 국가전체와 지역에 대한 인종 간 인원수로 독립성 검정 실습>

두 집단 국가전체 national, 특정지역 la) 의 인종 간 인원수의 분포가 관련이 있는가

귀무가설 : 국가전체와 지역에 대한 인종 간 인원수는 관련이 없다. 독립적.
대립가설 : 국가전체와 지역에 대한 인종 간 인원수는 관련이 있다. 독립적이지 않다.

national = pd.DataFrame(["white"] * 100000 + ["hispanic"] * 60000 + 
                         ["black"] * 50000 + ["asian"] * 15000 +["other"] * 35000)
la = pd.DataFrame(["white"] * 600 + ["hispanic"] * 300 + ["black"] * 250 + 
                  ["asian"] * 75 + ["other"] * 150)
print(national) # [260000 rows x 1 columns]
print(la)       # [1375 rows x 1 columns]

na_table = pd.crosstab(index = national[0], columns = 'count')
la_table = pd.crosstab(index = la[0], columns = 'count')
print(la_table)
'''
col_0     count
0              
asian        75
black       250
hispanic    300
other       150
white       600
'''
na_table['count_la'] = la_table['count'] # 칼럼 추가
print(na_table)

'''
col_0      count  count_la
0                         
asian      15000        75
black      50000       250
hispanic   60000       300
other      35000       150
white     100000       600
'''
chi, p, _, _ = stats.chi2_contingency(na_table)
print('chi :', chi)    # chi : 18.099524243141698
print('p-value :', p)  # p-value : 0.0011800326671747886

# 결론 : p-value : 0.0011 < 0.05 이므로 귀무가설 기각.
( 대립가설 : 국가전체와 지역에 대한 인종 간 인원수는 관련이 있다. 독립적이지 않다.)

이원카이제곱

동질성 검정

- 두 집단의 분포가 동일한가 다른 분포인가 를 검증하는 방법이다 두 집단 이상에서 각 범주 집단 간의 비율이 서로 동일한가를 검정하게 된다 두 개 이상의 범주형 자료가 동일한 분포를 갖는 모집단에서 추출된 것인지 검정하는 방법이다

<실습 1 : 교육방법에 따른 교육생들의 만족도 분석 - 동질성 검정>

귀무가설 : 교육방법에 따른 교육생들의 만족도에 차이가 없다.
대립가설 : 교육방법에 따른 교육생들의 만족도에 차이가 있다.

* chi4.py

import pandas as pd
import scipy.stats as stats

data = pd.read_csv("../testdata/survey_method.csv")
print(data.head(6))
'''
   no  method  survey
0   1       1       1
1   2       2       2
2   3       3       3
3   4       1       4
4   5       2       5
5   6       3       2
'''
print(data['survey'].unique()) # [1 2 3 4 5] - survey 데이터 종류
print(data['method'].unique()) # [1 2 3]     - method 데이터 종류

# 교차표 작성
ctab = pd.crosstab(index=data['method'], columns=data['survey'])
ctab.columns = ['매우만족', '만족', '보통', '불만족', '매우불만족']
ctab.index = ['방법1', '방법2', '방법3']
print(ctab)
'''
     매우만족  만족  보통  불만족  매우불만족
방법1     5   8  15   16      6
방법2     8  14  11   11      6
방법3     8   7  11   15      9
'''

# 카이제곱
chi, p, df, ex = stats.chi2_contingency(ctab)
msg = "chi2 : {}, p-value:{}, df:{}"
print(msg.format(chi, p, df))
# chi2 : 6.544667820529891, p-value:0.5864574374550608, df:8

# 결론 : p-value 0.586 > 0.05 이므로 귀무가설 채택.
( 귀무가설 : 교육방법에 따른 교육생들의 만족도에 차이가 없다. )

<실습 : 연령대별 sns 이용률의 동질성 검정>

20대에서 40 대까지 연령대별로 서로 조금씩 그 특성이 다른 SNS 서비스들에 대해 이용 현황을 조사한 자료를 바탕으로 연령대별로 홍보 전략을 세우고자 한다. 연령대별로 이용 현황이 서로 동일한지 검정해 보도록 하자

귀무가설 : 연령대별로 SNS 서비스 이용 현황에 차이가 없다(동질이다).
대립가설 : 연령대별로 SNS 서비스 이용 현황에 차이가 있다(동질이 없다).

import pandas as pd
import scipy.stats as stats

data2 = pd.read_csv("../testdata/snsbyage.csv")
print(data2.head(), ' ', data2.shape)
'''
   age service
0    1       F
1    1       F
2    1       F
3    1       F
4    1       F   (1439, 2)
'''
print(data2['age'].unique())     # [1 2 3]
print(data2['service'].unique()) # ['F' 'T' 'K' 'C' 'E']

# 교차표 작성
ctab2 = pd.crosstab(index = data2['age'], columns = data2['service'])
print(ctab2)
'''
service    C   E    F    K    T
age                            
1         81  16  207  111  117
2        109  15  107  236  104
3         32  17   78  133   76
'''

# 카이제곱
chi, p, df, ex = stats.chi2_contingency(ctab2)
msg2 = "chi2 : {}, p-value:{}, df:{}"
print(msg2.format(chi, p, df))
# chi2 : 102.75202494484225, p-value:1.1679064204212775e-18, df:8

# 결론 : p-value 1.1679064204212775e-18 < 0.05 이므로 귀무가설 기각.
( 대립가설 : 연령대별로 SNS 서비스 이용 현황에 차이가 있다(동질이 없다) )

# 참고 : 상기 데이터가 모집단이었다면 샘플링을 진행하여 샘플링 데이터로 작업 진행.
sample_data = data2.sample(500, replace=True)

카이제곱 + Django

= django_use02_chi(PyDev Django Project)

Create application - coffesurvey

* settings

...

INSTALLED_APPS = [
    
    ...
    
    'coffeesurvey',
]

...

DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.mysql', 
        'NAME': 'coffeedb',            # DB명 : db는 미리 작성되어 있어야 함.       
        'USER': 'root',                # 계정명 
        'PASSWORD': '123',             # 계정 암호           
        'HOST': '127.0.0.1',           # DB가 설치된 컴의 ip          
        'PORT': '3306',                # DBMS의 port 번호     
    }
}

* MariaDB

create database coffeedb;

use coffeedb;

create table survey(rnum int primary key auto_increment, gender varchar(4),age int(3),co_survey varchar(10));

insert into survey (gender,age,co_survey) values ('남',10,'스타벅스');
insert into survey (gender,age,co_survey) values ('여',20,'스타벅스');

* anaconda prompt

cd C:\work\psou\django_use02_chi

python manage.py inspectdb > aaa.py

* models

from django.db import models

# Create your models here.
class Survey(models.Model):
    rnum = models.AutoField(primary_key=True)
    gender = models.CharField(max_length=4, blank=True, null=True)
    age = models.IntegerField(blank=True, null=True)
    co_survey = models.CharField(max_length=10, blank=True, null=True)

    class Meta:
        managed = False
        db_table = 'survey'

Make migrations - coffeesurvey

Migrate

* urls(django_use02_chi)

from django.contrib import admin
from django.urls import path
from coffeesurvey import views
from django.urls.conf import include

urlpatterns = [
    path('admin/', admin.site.urls),
    path('', views.Main), # url 없을 경우
    path('coffee/', include('coffeesurvey.urls')), # 위임
]

* views

from django.shortcuts import render
from coffeesurvey.models import Survey

# Create your views here.
def Main(request):
    return render(request, 'main.html')

...

* main.html

<body>
<h2>메인</h2>
<ul>
	<li>메뉴</li>
	<li>신제품</li>
	<li><a href="coffee/survey">설문조사</a></li>
</ul>
</body>

* urls(coffeesurvey)

from django.urls import path
from coffeesurvey import views

urlpatterns = [
    path('survey', views.SurveyView), 
    path('surveyprocess', views.SurveyProcess),
]

* views

...

def SurveyView(request):
    return render(request, 'survey.html')

* survey.html

<body>
<h2>* 커피전문점에 대한 소비자 인식조사 *</h2>
<form action="/coffee/surveyprocess" method="post">{% csrf_token %}
<table>
	<tr>
		<td>귀하의 성별은?</td>
		<td>
			<label for="genM">남</label>
			<input type="radio" id="genM" name="gender" value="남" checked="checked">
			&nbsp
			<label for="genF">여</label>
			<input type="radio" id="genF" name="gender" value="여">
		</td>
	</tr>
	<tr>
		<td>귀하의 나이는?</td>
		<td>
			<label for="age10">10대</label>
			<input type="radio" id="age10" name="age" value="10" checked="checked">
			&nbsp
			<label for="age20">20대</label>
			<input type="radio" id="age20" name="age" value="20">
			&nbsp
			<label for="age10">30대</label>
			<input type="radio" id="age30" name="age" value="30">
			&nbsp
			<label for="age40">40대</label>
			<input type="radio" id="age40" name="age" value="40">
			&nbsp
			<label for="age50">50대</label>
			<input type="radio" id="age50" name="age" value="50">
		</td>
	</tr>
	<tr>
		<td>선호하는 커피전문점은?</td>
		<td>
			<label for="startbucks">스타벅스</label>
			<input type="radio" id="genM" name="co_survey" value="스타벅스" checked="checked">
			&nbsp
			<label for="coffeebean">커피빈</label>
			<input type="radio" id="coffeebean" name="co_survey" value="커피빈">
			&nbsp
			<label for="ediya">이디아</label>
			<input type="radio" id="ediya" name="co_survey" value="이디아">
			&nbsp
			<label for="tomntoms">탐앤탐스</label>
			<input type="radio" id="tomntoms" name="co_survey" value="탐앤탐스">
		</td>
	</tr>
	<tr>
		<td colspan="2">
		<br>
		<input type="submit" value="설문완료">
		<input type="reset" value="초기화">
	</tr>
</table>
</form>
</body>

* views

...

import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt

plt.rc('font', family="malgun gothic")

def SurveyProcess(request):
    InsertData(request)
    rdata = list(Survey.objects.all().values())
    # print(rdata) # [{'rnum': 1, 'gender': '남', 'age': 10, 'co_survey': '스타벅스'}, ... 
    
    df, crossTab, results = Analysis(rdata)
    
    # 시각화
    fig = plt.gcf()
    gen_group = df['co_survey'].groupby(df['coNum']).count()
    gen_group.index = ['스타벅스', '커피빈', '이디아', '탐앤탐스']
    gen_group.plot.bar(subplots=True, color=['red', 'blue'], width=0.5)
    plt.xlabel('커피샵')
    plt.ylabel('선호도 건수')
    plt.title('커피샵 별 선호건수')
    fig.savefig('django_use02_chi/coffeesurvey/static/images/co.png')
    
    return render(request, 'list.html', {'df':df.to_html(index=False), \
                                         'crossTab':crossTab.to_html(), 'results':results})

def InsertData(request): # 입력자료 DB에 저장
    if request.method == 'POST':
        Survey(
            gender = request.POST.get('gender'),
            age = request.POST.get('age'),
            co_survey = request.POST.get('co_survey')
            ).save()
            # 또는 SQL문을 직접 사용해도 무관.

def Analysis(rdata): # 분석
    df = pd.DataFrame(rdata)
    
    
    df.dropna() # 결측치 처리가 필요한 경우에는 진행. 이상치도 처리.
    df['genNum'] = df['gender'].apply(lambda g:1 if g =='남' else 2)
    df['coNum'] = df['co_survey'].apply(lambda c:1 if c =='스타벅스' \
        else 2 if c =='커피빈' else 3 if c =='이디아' else 4)
    print(df)
    '''
            rnum gender  age co_survey  genNum  coNum
    0      1      남   10      스타벅스       1      1
    1      2      여   20      스타벅스       2      1
    2      3      남   10      스타벅스       1      1
    3      4      남   10      스타벅스       1      1
    4      5      남   10      탐앤탐스       1      4
    5      6      남   40       커피빈       1      2
    6      7      남   10      스타벅스       1      1
    7      8      남   10      스타벅스       1      1
    8      9      남   10      스타벅스       1      1
    9     10      남   10      스타벅스       1      1
    10    11      남   10      스타벅스       1      1
    11    12      남   10      스타벅스       1      1
    '''
    
    # 교차 빈도표
    crossTab = pd.crosstab(index=df['gender'], columns=df['co_survey'])
    #crossTab = pd.crosstab(index=df['genNum'], columns=df['coNum'])
    print(crossTab)
    
    chi, pv, _, _ = stats.chi2_contingency(crossTab)
    
    if pv > 0.05:
        results = 'p값이 {}이므로 유의수준 0.05 <b>이상</b>의 값을 가지므로 <br> "
        +"성별에 따라 선호하는 커피브랜드에는 <b>차이가 없다.</b> (귀무가설 채택)'.format(pv)
    else:
        results = 'p값이 {}이므로 유의수준 0.05 <b>이하</b>의 값을 가지므로 <br> "
        +"성별에 따라 선호하는 커피브랜드에는 <b>차이가 있다.</b> (연구가설 채택)'.format(pv)
    
    return df, crossTab, results

* list.html

<body>
<h2>* 커피전문점에 대한 소비자 인식 조사 결과 *</h2>

<a href="/">메인화면</a><br>
<a href="/coffee/survey">다시 설문조사 하기</a><br>
{% if df %}
	{{df|safe}}
{% endif %}<br>

{% if crossTab %}
	{{crossTab|safe}}
{% endif %}<br>

{% if results %}
	{{results|safe}}
{% endif %}<br>
<img src="/static/images/co.png" width="400" />
</body>

'BACK END > Deep Learning' 카테고리의 다른 글

[딥러닝] 선형회귀 (0)	2021.03.10
[딥러닝] 공분산, 상관계수 (0)	2021.03.10
[딥러닝] 이항검정 (0)	2021.03.10
[딥러닝] ANOVA (0)	2021.03.08
[딥러닝] T 검정 (0)	2021.03.05

[Pandas] pandas 정리2 - db, django

2021. 3. 2. 15:23

local db

* db1.py

# sqlite : db 자료 -> DataFrame -> db

import sqlite3

sql = "create table if not exists test(product varchar(10), maker varchar(10), weight real, price integer)"

conn = sqlite3.connect(':memory:')
#conn = sqlite3.connect('mydb.db')
conn.execute(sql)

data = [('mouse', 'samsung', 12.5, 6000), ('keyboard', 'lg', 502.0, 86000)]
stmt = "insert into test values(?, ?, ?, ?)"
conn.executemany(stmt, data)

data1 = ('연필', '모나미', 3.5, 500)
conn.execute(stmt, data1)
conn.commit()

cursor = conn.execute("select * from test")
rows = cursor.fetchall()
for a in rows:
    print(a)
print()

# DataFrame에 저장 1 - cursor.fetchall() 이용
import pandas as pd
#df1 = pd.DataFrame(rows, columns = ['product', 'maker', 'weight', 'price'])
print(*cursor.description)
df1 = pd.DataFrame(rows, columns = list(zip(*cursor.description))[0])
print(df1)

# DataFrame에 저장 2 - pd.read_sql() 이용
df2 = pd.read_sql("select * from test", conn)
print(df2)
print()

print(df2.to_html())
print()

# DataFrame의 자료를 DB로 저장
data = {
    'irum':['신선해', '신기해', '신기한'],
    'nai':[22, 25, 27]
}
frame = pd.DataFrame(data)
print(frame)
print()

conn = sqlite3.connect('test.db')
frame.to_sql('mytable', conn, if_exists = 'append', index = False)
df3 = pd.read_sql("select * from mytable", conn)
print(df3)

cursor.close()
conn.close()

교차 테이블(교차표) - 행과 열로 구성된 교차표로 결과(빈도수)를 요약

* cross_test

import pandas as pd

y_true = pd.Series([2, 0, 2, 2, 0, 1, 1, 2, 2, 0, 1, 2])
y_pred = pd.Series([0, 0, 2, 1, 0, 2, 1, 0, 2, 0, 2, 2])

result = pd.crosstab(y_true, y_pred, rownames=['True'], colnames=['Predicted'], margins=True)
print(result)
'''
Predicted  0  1  2  All
True                   
0          3  0  0    3
1          0  1  2    3
2          2  1  3    6
All        5  2  5   12
'''

# 인구통계 dataset 읽기
des = pd.read_csv('https://raw.githubusercontent.com/pykwon/python/master/testdata_utf8/descriptive.csv') 
print(des.info())

# 5개 칼럼만 선택하여 data frame 생성 
data = des[['resident','gender','age','level','pass']]
print(data[:5])

# 지역과 성별 칼럼 교차테이블 
table = pd.crosstab(data.resident, data.gender)
print(table)

# 지역과 성별 칼럼 기준 - 학력수준 교차테이블 
table = pd.crosstab([data.resident, data.gender], data.level)
print(table)

원격 DB 연동

* db2_remote.py

import MySQLdb
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.rc('font', family='malgun gothic')

import csv
import ast
import sys

try:
    with open('mariadb.txt', 'r') as f:
        config = f.read()
except Exception as e:
    print('read err :', e)
    sys.exit()
    
config = ast.literal_eval(config)
print(config)
# {'host': '127.0.0.1', 'user': 'root', 'password': '123', 'database': 'test', 
# 'port': 3306, 'charset': 'utf8', 'use_unicode': True}

try:
    conn = MySQLdb.connect(**config)
    cursor = conn.cursor()
    sql = """
    select jikwon_no, jikwon_name, jikwon_jik, buser_name, jikwon_gen, jikwon_pay
    from jikwon inner join buser
    on jikwon.buser_num = buser.buser_no
    """
    cursor.execute(sql)
    
    for (jikwon_no, jikwon_name, jikwon_jik, buser_name, jikwon_gen, jikwon_pay) in cursor:
        print(jikwon_no, jikwon_name, jikwon_jik, buser_name, jikwon_gen, jikwon_pay)

    # jikwon.csv 파일로 저장
    with open('jikwon.csv', 'w', encoding='utf-8') as fw:
        writer = csv.writer(fw)
        for row in cursor:
            writer.writerow(row)
        print('저장성공')

    # csv 파일 읽기 1
    df1 = pd.read_csv('jikwon.csv', header=None, names = ('번호', '이름', '직급', '부서', '성별', '연봉'))
    print(df1.head(3))
    print(df1.shape)  # (30, 6)

    # csv 파일 읽기 2
    df2 = pd.read_sql(sql, conn)
    df2.columns = ('번호', '이름', '직급', '부서', '성별', '연봉')
    print(df2.head(3))
    '''
           번호   이름  직급   부서 성별    연봉
    0   1  홍길동  이사  총무부  남  9900
    1   2  한송이  부장  영업부  여  8800
    2   3  이순신  과장  영업부  남  7900
    '''

    print('건수 :', len(df2))
    print('건수 :', df2['이름'].count()) # 건수 : 30
    print()
    
    print('직급별 인원 수 :\n', df2['직급'].value_counts())
    print()
    
    print('연봉 평균 :\n', df2.loc[:,'연봉'].sum() / len(df2))
    print('연봉 평균 :\n', df2.loc[:,'연봉'].mean())
    print()
    
    print('연봉 요약 통계 :\n', df2.loc[:,'연봉'].describe())
    print()
    
    print('연봉이 8000이상 : \n', df2.loc[df2['연봉'] >= 8000])
    print()
    
    print('연봉이 5000이상인 영업부 : \n', df2.loc[(df2['연봉'] >= 5000) & (df2['부서'] == '영업부')])
    print()
    
    print('* crosstab')
    ctab = pd.crosstab(df2['성별'], df2['직급'], margins=True)
    print(ctab)
    print()
    
    print('* groupby')
    print(df2.groupby(['성별', '직급'])['이름'].count())
    print()
    
    print('* pivot table')
    print(df2.pivot_table(['연봉'], index=['성별'], columns=['직급'], aggfunc = np.mean))
    print()

    # 시각화 - pie 차트
    # 직급별 연봉 평균
    jik_ypay = df2.groupby(['직급'])['연봉'].mean()
    print(jik_ypay, type(jik_ypay)) # Series
    print(jik_ypay.index)
    print(jik_ypay.values)
    
    plt.pie(jik_ypay,
            labels=jik_ypay.index, 
            labeldistance=0.5,
            counterclock=False,
            shadow=True,
            explode=(0.2, 0, 0, 0.3, 0))
    plt.show()
    
except Exception as e:
    print('process err :', e)
    
finally:
    cursor.close()
    conn.close()

# DataFrame의 자료를 DB로 저장
data = {
    'irum':['tom', 'james', 'john'],
    'nai':[22, 25, 27]
}
frame = pd.DataFrame(data)
print(frame)
print()

# pip install sqlalchemy
# pip install pymysql
from sqlalchemy import create_engine
import pymysql # MySQL Connector using pymysql

pymysql.install_as_MySQLdb()
engine = create_engine("mysql+mysqldb://root:"+"123"+"@Localhost/test", encoding='utf-8')
conn = engine.connect()

# MySQL에 저장하기
# pandas의 to_sql 함수 사용 저장
frame.to_sql(name='mytable', con = engine, if_exists = 'append', index = False)
df3 = pd.read_sql("select * from mytable", conn)
print(df3)

Django

PyDev Django Project 생성

* Django - Create application - myjikwonapp

= django_use01

* settings

...

INSTALLED_APPS = [
    
    ...
    
    'myjikwonapp',
]

...

DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.mysql', 
        'NAME': 'test',                # DB명 : db는 미리 작성되어 있어야 함.       
        'USER': 'root',                # 계정명 
        'PASSWORD': '123',             # 계정 암호           
        'HOST': '127.0.0.1',           # DB가 설치된 컴의 ip          
        'PORT': '3306',                # DBMS의 port 번호     
    }
}

* anaconda prompt

cd C:\work\psou\django_use01
python manage.py inspectdb > aaa.py

* models

from django.db import models

# Create your models here.
class Jikwon(models.Model):
    jikwon_no = models.IntegerField(primary_key=True)
    jikwon_name = models.CharField(max_length=10)
    buser_num = models.IntegerField()
    jikwon_jik = models.CharField(max_length=10, blank=True, null=True)
    jikwon_pay = models.IntegerField(blank=True, null=True)
    jikwon_ibsail = models.DateField(blank=True, null=True)
    jikwon_gen = models.CharField(max_length=4, blank=True, null=True)
    jikwon_rating = models.CharField(max_length=3, blank=True, null=True)

    class Meta:
        managed = False
        db_table = 'jikwon'

* Django - Create Migrations - myjikwonapp

* Django - Migrate

* urls

from django.contrib import admin
from django.urls import path
from myjikwonapp import views

urlpatterns = [
    path('admin/', admin.site.urls),
    path('', views.MainFunc),
    path('showdata', views.ShowFunc),
]

* views

from django.shortcuts import render
from myjikwonapp.models import Jikwon
import pandas as pd
import matplotlib.pyplot as plt
plt.rc('font', family='malgun gothic')

# Create your views here.
def MainFunc(request):
    return render(request, 'main.html')

def ShowFunc(request):
    #datas = Jikwon.objects.all()         # jikwon table의 모든 데이터 조회
    datas = Jikwon.objects.all().values() # dict type
    #print(datas)                         # <QuerySet [{'jikwon_no': 1, 'jikwon_name': '홍길동', ...
    
    pd.set_option('display.max_columns', 500) # width : 500
    df = pd.DataFrame(datas)
    df.columns = ['사번', '직원명','부서코드', '직급', '연봉', '입사일', '성별', '평점']
    #print(df)
    '''
            사번  직원명  부서코드  직급    연봉         입사일 성별 평점
    0    1  홍길동    10  이사  9900  2008-09-01  남  a
    1    2  한송이    20  부장  8800  2010-01-03  여  b
    2    3  이순신    20  과장  7900  2010-03-03  남  b
    3    4  이미라    30  대리  4500  2014-01-04  여  b
    4    5  이순라    20  사원  3000  2017-08-05  여  b
    '''
    
    # 부서별 급여 합/평균
    buser_group = df['연봉'].groupby(df['부서코드'])
    buser_group_detail = {'sum':buser_group.sum(), 'avg':buser_group.mean()}
    #print(buser_group_detail)
    '''
    {'sum': 부서코드
    10    37900
    20    58900
    30    37300
    40    25050,
    'avg': 부서코드
    10    5414.285714
    20    4908.333333
    30    5328.571429
    40    6262.500000
    }
    '''
    
    # 차트를 이미지로 저장
    bu_result = buser_group.agg(['sum', 'mean'])
    bu_result.plot.bar()
    #bu_result.plot(kind='bar')
    plt.title("부서별 급여 합/평균")
    fig = plt.gcf()
    fig.savefig('django_use01/myjikwonapp/static/images/jik.png')
    
    return render(request, 'show.html', {'msg':'직원정보', 'datas':df.to_html(), 'buser_group':buser_group_detail})

* main.html

<body>
 <h2>메인</h2>
 <a href="showdata">직원정보</a>
</body>

* show.html

<body>
  <h2>{{msg}} (DB -> pandas 이용)</h2>
  {% if datas %}
  {{datas|safe}}
  {% endif %}
  <hr>
  <h2>부서별 급여합</h2>
  총무부 : {{buser_group.sum.10}}<br>
  영업부 : {{buser_group.sum.20}}<br>
  전산부 : {{buser_group.sum.30}}<br>
  관리부 : {{buser_group.sum.40}}<br><br>
  <h2>부서별 급여평균</h2>
  총무부 : {{buser_group.avg.10}}<br>
  영업부 : {{buser_group.avg.20}}<br>
  전산부 : {{buser_group.avg.30}}<br>
  관리부 : {{buser_group.avg.40}}<br><br>
  <img alt="사진" src="/static/images/jik.png" title="차트 1-1">
</body>

* desc_stat

# 기술 통계
'''
기술통계(descriptive statistics)란 수집한 데이터의 특성을 표현하고 요약하는 통계 기법이다. 
기술통계는 샘플(전체 자료일수도 있다)이 있으면, 그 자료들에 대해  수치적으로 요약정보를 표현하거나, 
데이터 시각화를 한다. 
즉, 자료의 특징을 파악하는 관점으로 보면 된다. 평균, 분산, 표준편차 등이 기술통계에 속한다.
'''

# 도수 분포표
import pandas as pd

frame = pd.read_csv('../testdata/ex_studentlist.csv')
print(frame.head(2))
print(frame.info())
'''
  name sex  age  grade absence bloodtype  height  weight
0  김길동  남자   23      3       유         O   165.3    68.2
1  이미린  여자   22      2       무        AB   170.1    53.0
'''
print('나이\t:', frame['age'].mean())
print('나이\t:', frame['age'].var())
print('나이\t:', frame['age'].std())
print('혈액형\t:', frame['bloodtype'].unique())
print(frame.describe().T)
print()

# 혈액형별 인원수
data1 = frame.groupby(['bloodtype'])['bloodtype'].count()
print('혈액형별 인원수 : ', data1)
'''
혈액형별 인원수 :  bloodtype
A     3
AB    3
B     4
O     5
'''
print()

data2 = pd.crosstab(index = frame['bloodtype'], columns = 'count')
print('혈액형별 인원수 : ', data2)
'''
혈액형별 인원수 :  col_0      count
bloodtype       
A              3
AB             3
B              4
O              5
'''
print()

# 성별, 혈액형별 인원수
data3 = pd.crosstab(index = frame['bloodtype'], columns = frame['sex'])
data3 = pd.crosstab(index = frame['bloodtype'], columns = frame['sex'], margins=True) # 소계
data3.columns = ['남', '여', '행합']
data3.index = ['A', 'AB', 'B', 'O', '열합']
print('성별, 혈액형별 인원수 : ', data3)
'''
성별, 혈액형별 인원수 :      남  여  행합
A   1  2   3
AB  2  1   3
B   3  1   4
O   2  3   5
열합  8  7  15
'''
print()

print(data3/data3.loc['열합','행합'])
print()
'''
           남         여        행합
A   0.066667  0.133333  0.200000
AB  0.133333  0.066667  0.200000
B   0.200000  0.066667  0.266667
O   0.133333  0.200000  0.333333
열합  0.533333  0.466667  1.000000
'''

'BACK END > Python Library' 카테고리의 다른 글

[LINUX] 리눅스 - 명령어 /eclipse /FlashPlayer /DB /apache /R /anaconda /hadoop (0)	2021.03.26
[MatPlotLib] matplotlib 정리 (0)	2021.03.02
[Pandas] pandas 정리 (0)	2021.02.24
[NumPy] numpy 정리 (0)	2021.02.23

[MatPlotLib] matplotlib 정리

2021. 3. 2. 10:35

matplotlib : ploting library. 그래프 생성을 위한 다양한 함수 지원.

* mat1.py

import numpy as np
import matplotlib.pyplot as plt

plt.rc('font', family = 'malgun gothic')   # 한글 깨짐 방지
plt.rcParams['axes.unicode_minus'] = False # 음수 깨짐 방지

x = ["서울", "인천", "수원"]
y = [5, 3, 7]
# tuple 가능. set 불가능.

plt.xlabel("지역")                 # x축 라벨
plt.xlim([-1, 3])                 # x축 범위
plt.ylim([0, 10])                 # y축 범위
plt.yticks(list(range(0, 11, 3))) # y축 칸 number 지정
plt.plot(x, y)                    # 그래프 생성
plt.show()                        # 그래프 출력

data = np.arange(1, 11, 2)
print(data)    # [1 3 5 7 9]
plt.plot(data) # 그래프 생성
x = [0, 1, 2, 3, 4]
for a, b in zip(x, data):
    plt.text(a, b, str(b)) # 그래프 라인에 text 표시
plt.show()     # 그래프 출력

plt.plot(data)
plt.plot(data, data, 'r')
plt.show()

x = np.arange(10)
y = np.sin(x)
print(x, y)
plt.plot(x, y, 'bo')  # 파란색 o로 표시. style 옵션 적용.
plt.plot(x, y, 'r+')  # 빨간색 +로 표시
plt.plot(x, y, 'go--', linewidth = 2, markersize = 10) # 초록색 o 표시
plt.show()

# 홀드 : 그림 겹쳐 보기
x = np.arange(0, np.pi * 3, 0.1)
print(x)
y_sin = np.sin(x)
y_cos = np.cos(x)
plt.plot(x, y_sin, 'r')     # 직선, 곡선
plt.scatter(x, y_cos)       # 산포도
#plt.plot(x, y_cos, 'b')
plt.xlabel('x축')
plt.ylabel('y축')
plt.legend(['사인', '코사인']) # 범례
plt.title('차트 제목')         # 차트 제목
plt.show()

# subplot : Figure를 여러 행열로 분리
x = np.arange(0, np.pi * 3, 0.1)
y_sin = np.sin(x)
y_cos = np.cos(x)

plt.subplot(2, 1, 1)
plt.plot(x, y_sin)
plt.title('사인')

plt.subplot(2, 1, 2)
plt.plot(x, y_cos)
plt.title('코사인')
plt.show()

irum = ['a', 'b', 'c', 'd', 'e']
kor = [80, 50, 70, 70, 90]
eng = [60, 70, 80, 70, 60]
plt.plot(irum, kor, 'ro-')
plt.plot(irum, eng, 'gs--')
plt.ylim([0, 100])
plt.legend(['국어', '영어'], loc = 4)
plt.grid(True)          # grid 추가

fig = plt.gcf()         # 이미지 저장 선언
plt.show()              # 이미지 출력
fig.savefig('test.png') # 이미지 저장

from matplotlib.pyplot import imread
img = imread('test.png') # 이미지 read
plt.imshow(img)          # 이미지 출력
plt.show()

차트의 종류 결정

* mat2.py

import numpy as np
import matplotlib.pyplot as plt

fig = plt.figure()             # 명시적으로 차트 영역 객체 선언
ax1 = fig.add_subplot(1, 2, 1) # 1행 2열 중 1열에 표기
ax2 = fig.add_subplot(1, 2, 2) # 1행 2열 중 2열에 표기

ax1.hist(np.random.randn(10), bins=10, alpha=0.9) # 히스토그램 출력 - bins : 구간 수, alpha: 투명도
ax2.plot(np.random.randn(10)) # plot 출력
plt.show()

fig, ax = plt.subplots(nrows=2, ncols=1)
ax[0].plot(np.random.randn(10))
ax[1].plot(np.random.randn(10) + 20)
plt.show()

data = [50, 80, 100, 70, 90]
plt.bar(range(len(data)), data)
plt.barh(range(len(data)), data)
plt.show()

data = [50, 80, 100, 70, 90]
err = np.random.randn(len(data))
plt.barh(range(len(data)), data, xerr=err, alpha = 0.6)
plt.show()

plt.pie(data, explode=(0, 0.2, 0, 0, 0), colors = ['yellow', 'blue', 'red'])
plt.show()

n = 30
np.random.seed(42)
x = np.random.rand(n)
y = np.random.rand(n)
color = np.random.rand(n)
scale = np.pi * (15 * np.random.rand(n)) ** 2
plt.scatter(x, y, s = scale, c = color)
plt.show()

import pandas as pd
sdata = pd.Series(np.random.rand(10).cumsum(), index = np.arange(0, 100, 10))
plt.plot(sdata)
plt.show()

# DataFrame
fdata = pd.DataFrame(np.random.randn(1000, 4), index = pd.date_range('1/1/2000', periods = 1000),\
                     columns = list('ABCD'))
print(fdata)
fdata = fdata.cumsum()
plt.plot(fdata)
plt.show()

seaborn : matplotlib 라이브러리의 기능을 보완하기 위해 사용

* mat3.py

import matplotlib.pyplot as plt
import seaborn as sns

titanic = sns.load_dataset("titanic")
print(titanic.info())
print(titanic.head(3))

age = titanic['age']
sns.kdeplot(age) # 밀도
plt.show()

sns.distplot(age) # kdeplot + hist
plt.show()

import matplotlib.pyplot as plt
import seaborn as sns

titanic = sns.load_dataset("titanic")
print(titanic.info())
print(titanic.head(3))

age = titanic['age']
sns.kdeplot(age) # 밀도
plt.show()

sns.relplot(x = 'who', y = 'age', data=titanic)
plt.show()

sns.countplot(x = 'class', data=titanic, hue='who') # hue : 카테고리
plt.show()

t_pivot = titanic.pivot_table(index='class', columns='age', aggfunc='size')
print(t_pivot)
sns.heatmap(t_pivot, cmap=sns.light_palette('gray', as_cmap=True), fmt = 'd', annot=True)
plt.show()

import pandas as pd
iris_data = pd.read_csv('../testdata/iris.csv')
print(iris_data.info())
print(iris_data.head(3))
'''
   Sepal.Length  Sepal.Width  Petal.Length  Petal.Width Species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
'''

plt.scatter(iris_data['Sepal.Length'], iris_data['Petal.Length'])
plt.xlabel('Sepal.Length')
plt.ylabel('Petal.Length')
plt.show()

cols = []
for s in iris_data['Species']:
    choice = 0
    if s == 'setosa' : choice = 1
    elif s == 'versicolor' : choice = 2
    else: choice = 3
    cols.append(choice)
    
plt.scatter(iris_data['Sepal.Length'], iris_data['Petal.Length'], c = cols)
plt.xlabel('Sepal.Length')
plt.ylabel('Petal.Length')
plt.show()

# pandas의 시각화 기능
print(type(iris_data)) # DataFrame
from pandas.plotting import scatter_matrix 
#scatter_matrix(iris_data, diagonal='hist')
scatter_matrix(iris_data, diagonal='kde')
plt.show()

# seabon
sns.pairplot(iris_data, hue = 'Species', height = 1)
plt.show()

x = iris_data['Sepal.Length'].values
sns.rugplot(x)
plt.show()

sns.kdeplot(x)
plt.show()

# pandas의 시각화 기능
import numpy as np 

df = pd.DataFrame(np.random.randn(10, 3), index = pd.date_range('1/1/2000', periods=10), columns=['a', 'b', 'c'])
print(df)

#df.plot() # 꺽은선
#df.plot(kind= 'bar')
df.plot(kind= 'box')
plt.xlabel('time')
plt.ylabel('data')
plt.show()

df[:5].plot.bar(rot=0)
plt.show()

* mat4.py

import pandas as pd

tips = pd.read_csv('../testdata/tips.csv')
print(tips.info())
print(tips.head(3))
'''
   total_bill   tip     sex smoker  day    time  size
0       16.99  1.01  Female     No  Sun  Dinner     2
1       10.34  1.66    Male     No  Sun  Dinner     3
2       21.01  3.50    Male     No  Sun  Dinner     3
'''

tips['gender'] = tips['sex'] # sex 칼럼을 gender 칼럼으로 변경
del tips['sex']
print(tips.head(3))
'''
   total_bill   tip smoker  day    time  size  gender
0       16.99  1.01     No  Sun  Dinner     2  Female
1       10.34  1.66     No  Sun  Dinner     3    Male
2       21.01  3.50     No  Sun  Dinner     3    Male
'''

# tip 비율 추가
tips['tip_pct'] = tips['tip'] / tips['total_bill']
print(tips.head(3))
'''
   total_bill   tip smoker  day    time  size  gender   tip_pct
0       16.99  1.01     No  Sun  Dinner     2  Female  0.059447
1       10.34  1.66     No  Sun  Dinner     3    Male  0.160542
2       21.01  3.50     No  Sun  Dinner     3    Male  0.166587
'''

tip_pct_group = tips['tip_pct'].groupby([tips['gender'], tips['smoker']]) # 성별, 흡연자별 그륩화
print(tip_pct_group)
print(tip_pct_group.sum())
print(tip_pct_group.max())
print(tip_pct_group.min())

result = tip_pct_group.describe()
print(result)
'''
               count      mean       std  ...       50%       75%       max
gender smoker                             ...                              
Female No       54.0  0.156921  0.036421  ...  0.149691  0.181630  0.252672
       Yes      33.0  0.182150  0.071595  ...  0.173913  0.198216  0.416667
Male   No       97.0  0.160669  0.041849  ...  0.157604  0.186220  0.291990
       Yes      60.0  0.152771  0.090588  ...  0.141015  0.191697  0.710345

print(tip_pct_group.agg('sum'))
print(tip_pct_group.agg('mean'))
print(tip_pct_group.agg('max'))
print(tip_pct_group.agg('min'))

def diffFunc(group):
    diff = group.max() - group.min()
    return diff

result2 = tip_pct_group.agg(['var', 'mean', 'max', diffFunc])
print(result2)
'''
                    var      mean       max  diffFunc
gender smoker                                        
Female No      0.001327  0.156921  0.252672  0.195876
       Yes     0.005126  0.182150  0.416667  0.360233
Male   No      0.001751  0.160669  0.291990  0.220186
       Yes     0.008206  0.152771  0.710345  0.674707
'''

import matplotlib.pyplot as plt
result2.plot(kind = 'barh', title = 'aff fund', stacked = True)
plt.show()

'BACK END > Python Library' 카테고리의 다른 글

[LINUX] 리눅스 - 명령어 /eclipse /FlashPlayer /DB /apache /R /anaconda /hadoop (0)	2021.03.26
[Pandas] pandas 정리2 - db, django (0)	2021.03.02
[Pandas] pandas 정리 (0)	2021.02.24
[NumPy] numpy 정리 (0)	2021.02.23

[Pandas] pandas 정리

2021. 2. 24. 10:00

pandas : 고수준의 자료구조(Series, DataFrame)를 지원

축약연산, 누락된 데이터 처리, Sql Query, 데이터 조작, 인덱싱, 시각화 .. 등 다양한 기능을 제공.

1. Series : 일련의 데이터를 기억할 수 있는 1차원 배열과 같은 자료 구조로 명시적인 색인을 갖는다.

* pdex1.py

Series(순서가 있는 자료형) : series 생성

from pandas import Series
import numpy as np
obj = Series([3, 7, -5, 4])     # int
obj = Series([3, 7, -5, '4'])  # string - 전체가 string으로 변환
obj = Series([3, 7, -5, 4.5])  # float  - 전체가 float으로 변환
obj = Series((3, 7, -5, 4))    # tuple  - 순서가 있는 자료형 사용 가능
obj = Series({3, 7, -5, 4})    # set    - 순서가 없어 error 발생
print(obj, type(obj))
'''
0    3
1    7
2   -5
3    4
dtype: int64 <class 'pandas.core.series.Series'>
'''

Series( , index = ) : 색인 지정

obj2 = Series([3, 7, -5, 4], index = ['a', 'b', 'c', 'd']) # index : 색인 지정
print(obj2)
'''
a    3
b    7
c   -5
d    4
dtype: int64
'''

np.sum(obj2) = obj2.sum()

print(sum(obj2), np.sum(obj2), obj2.sum()) # pandas는 numpy 함수를 기본적으로 계승해서 사용.
'''
9 9 9
'''

obj.values : value를 list형으로 리턴

obj.index : index를 object형으로 리턴

print(obj2.values) # 값만 출력
print(obj2.index)  # index만 출력
'''
[ 3  7 -5  4]
Index(['a', 'b', 'c', 'd'], dtype='object')
'''

슬라이싱

'''
a    3
b    7
c   -5
d    4
'''
print(obj2['a'], obj2[['a']])
# 3 a    3

print(obj2[['a', 'b']])
'''
a    3
b    7
'''
print(obj2['a':'c'])
'''
a    3
b    7
c   -5
'''
print(obj2[2]) # -5
print(obj2[1:4])
'''
b    7
c   -5
d    4
'''
print(obj2[[2,1]])
'''
c   -5
b    7
'''
print(obj2 > 0)
'''
a     True
b     True
c    False
d     True
'''
print('a' in obj2)
# True

dict type으로 Series 생성

names = {'mouse':5000, 'keyboard':25000, 'monitor':55000}
print(names)
# {'mouse': 5000, 'keyboard': 25000, 'monitor': 55000}

obj3 = Series(names)
print(obj3, ' ', type(obj3))
'''
mouse        5000
keyboard    25000
monitor     55000
dtype: int64   <class 'pandas.core.series.Series'>
'''
print(obj3['mouse'])
# 5000

obj3.name = '상품가격' # Series 객체에 이름 부여
print(obj3)
'''
mouse        5000
keyboard    25000
monitor     55000
Name: 상품가격, dtype: int64
'''

DataFrame : 표 모양(2차원)의 자료 구조. Series가 여러개 합쳐진 형태. 각 칼럼마다 type이 다를 수 있다.

from pandas import DataFrame
df = DataFrame(obj3) # Series로 DataFrame 생성
print(df)
'''
           상품가격
mouse      5000
keyboard  25000
monitor   55000
'''
data = {
    'irum':['홍길동', '한국인', '신기해', '공기밥', '한가해'],
    'juso':['역삼동', '신당동', '역삼동', '역삼동', '신사동'],
    'nai':[23, 25, 33, 30, 35]
    }
print(data, type(data))
'''
{'irum': ['홍길동', '한국인', '신기해', '공기밥', '한가해'],
 'juso': ['역삼동', '신당동', '역삼동', '역삼동', '신사동'],
 'nai': [23, 25, 33, 30, 35]}
 <class 'dict'>
'''
frame = DataFrame(data) # dict로 DataFrame 생성
print(frame)
'''
  irum juso  nai
0  홍길동  역삼동   23
1  한국인  신당동   25
2  신기해  역삼동   33
3  공기밥  역삼동   30
4  한가해  신사동   35
'''
print(frame['irum']) # 칼럼을 dict 형식으로 접근
'''
0    홍길동
1    한국인
2    신기해
3    공기밥
4    한가해
'''
print(frame.irum, ' ', type(frame.irum))    # 칼럼을 속성 형식으로 접근
'''
0    홍길동
1    한국인
2    신기해
3    공기밥
4    한가해
Name: irum, dtype: object
<class 'pandas.core.series.Series'>
'''
print(DataFrame(data, columns=['juso', 'irum', 'nai'])) # 칼럼의 순서변경
'''
  juso irum  nai
0  역삼동  홍길동   23
1  신당동  한국인   25
2  역삼동  신기해   33
3  역삼동  공기밥   30
4  신사동  한가해   35
'''
print()
frame2 = DataFrame(data, columns=['juso', 'irum', 'nai', 'tel'], index = ['a', 'b', 'c', 'd', 'e'])
print(frame2)
'''
  juso irum  nai  tel
a  역삼동  홍길동   23  NaN
b  신당동  한국인   25  NaN
c  역삼동  신기해   33  NaN
d  역삼동  공기밥   30  NaN
e  신사동  한가해   35  NaN
'''
frame2['tel'] = '111-1111' # tel 칼럼의 모든행에 적용
print(frame2)
'''
  juso irum  nai       tel
a  역삼동  홍길동   23  111-1111
b  신당동  한국인   25  111-1111
c  역삼동  신기해   33  111-1111
d  역삼동  공기밥   30  111-1111
e  신사동  한가해   35  111-1111
'''
print()
val = Series(['222-2222', '333-2222', '444-2222'], index = ['b', 'c', 'e'])
frame2['tel'] = val
print(frame2)
'''
  juso irum  nai       tel
a  역삼동  홍길동   23       NaN
b  신당동  한국인   25  222-2222
c  역삼동  신기해   33  333-2222
d  역삼동  공기밥   30       NaN
e  신사동  한가해   35  444-2222

'''
print()
print(frame2.T) # 행과 열 swap
'''
        a         b         c    d         e
juso  역삼동       신당동       역삼동  역삼동       신사동
irum  홍길동       한국인       신기해  공기밥       한가해
nai    23        25        33   30        35
tel   NaN  222-2222  333-2222  NaN  444-2222
'''
print(frame2.values)
'''
[['역삼동' '홍길동' 23 nan]
 ['신당동' '한국인' 25 '222-2222']
 ['역삼동' '신기해' 33 '333-2222']
 ['역삼동' '공기밥' 30 nan]
 ['신사동' '한가해' 35 '444-2222']]
'''
print(frame2.values[0, 1]) # 0행 1열. 홍길동
print(frame2.values[0:2])  # 0 ~ 1행
'''
[['역삼동' '홍길동' 23 nan]
 ['신당동' '한국인' 25 '222-2222']]
'''

행/열 삭제

#frame3 = frame2.drop('d')         # index가 d인 행 삭제
frame3 = frame2.drop('d', axis=0) # index가 d인 행 삭제
print(frame3)
'''
  juso irum  nai       tel
a  역삼동  홍길동   23       NaN
b  신당동  한국인   25  222-2222
c  역삼동  신기해   33  333-2222
e  신사동  한가해   35  444-2222
'''
frame3 = frame2.drop('tel', axis = 1) # index가 tel인 열 삭제
print(frame3)
'''
  juso irum  nai
a  역삼동  홍길동   23
b  신당동  한국인   25
c  역삼동  신기해   33
d  역삼동  공기밥   30
e  신사동  한가해   35
'''

정렬

print(frame2.sort_index(axis=0, ascending=False)) # 행 단위. 내림차순
'''
  juso irum  nai       tel
e  신사동  한가해   35  444-2222
d  역삼동  공기밥   30       NaN
c  역삼동  신기해   33  333-2222
b  신당동  한국인   25  222-2222
a  역삼동  홍길동   23       NaN
'''
print(frame2.sort_index(axis=1, ascending=True)) # 열 단위. 오름차순
'''
  irum juso  nai       tel
a  홍길동  역삼동   23       NaN
b  한국인  신당동   25  222-2222
c  신기해  역삼동   33  333-2222
d  공기밥  역삼동   30       NaN
e  한가해  신사동   35  444-2222
'''
print(frame2.rank(axis=0)) # 행 단위. 사전 순위로 칼럼 값 순서를 매김
'''
   juso  irum  nai  tel
a   4.0   5.0  1.0  NaN
b   1.0   4.0  2.0  1.0
c   4.0   2.0  4.0  2.0
d   4.0   1.0  3.0  NaN
e   2.0   3.0  5.0  3.0
'''
print(frame2['juso'].value_counts()) # 칼럼의 개수
'''
역삼동    3
신사동    1
신당동    1
'''

문자열 자르기

data = {
    'juso':['강남구 역삼동', '중구 신당동', '강남구 대치동'],
    'inwon':[22,23,24]
}
fr = DataFrame(data)
print(fr)
'''
      juso  inwon
0  강남구 역삼동     22
1   중구 신당동     23
2  강남구 대치동     24
'''
result1 = Series([x.split()[0] for x in fr.juso])
result2 = Series([x.split()[1] for x in fr.juso])
print(result1)
'''
0    강남구
1     중구
2    강남구
'''
print(result2)
'''
0    역삼동
1    신당동
2    대치동
'''
print(result1.value_counts())
'''
강남구    2
중구     1
'''

재 색인, NaN, bool처리, 슬라이싱 관련 메소드, 연산

* pdex2.py

Series 재 색인

data = Series([1,3,2], index = (1,4,2))
print(data)
'''
1    1
4    3
2    2
'''

data2 = data.reindex((1,2,4)) # 해당 index 순서로 정렬
print(data2)
'''
1    1
2    2
4    3
'''

재색인 시 값 채워 넣기

data3 = data2.reindex([0,1,2,3,4,5]) # 대응 값이 없는 인덱스는 NaN(결측값)이 됨.
print(data3)
'''
0    NaN
1    1.0
2    2.0
3    NaN
4    3.0
5    NaN
'''

data3 = data2.reindex([0,1,2,3,4,5], fill_value = 333) # 대응 값이 없는 인덱스는 fill_value으로 대입.
print(data3)
'''
0    333
1      1
2      2
3    333
4      3
5    333
'''

data3 = data2.reindex([0,1,2,3,4,5], method='ffill') # 대응 값이 없는 인덱스는 이전 index의 값으로 대입.
print(data3)
data3 = data2.reindex([0,1,2,3,4,5], method='pad') # 대응 값이 없는 인덱스는 이전 index의 값으로 대입.
print(data3)
'''
0    NaN
1    1.0
2    2.0
3    2.0
4    3.0
5    3.0
'''

data3 = data2.reindex([0,1,2,3,4,5], method='bfill') # 대응 값이 없는 인덱스는 다음 index의 값으로 대입.
print(data3)
data3 = data2.reindex([0,1,2,3,4,5], method='backfill') # 대응 값이 없는 인덱스는 다음 index의 값으로 대입.
print(data3)
'''
0    1.0
1    1.0
2    2.0
3    3.0
4    3.0
5    NaN
'''

bool 처리 / 슬라이싱

df = DataFrame(np.arange(12).reshape(4,3), index=['1월', '2월', '3월', '4월'], columns=['강남', '강북', '서초'])
print(df)
'''
    강남  강북  서초
1월   0   1   2
2월   3   4   5
3월   6   7   8
4월   9  10  11
'''

print(df['강남'])
'''
1월    0
2월    3
3월    6
4월    9
'''

print(df['강남'] > 3 ) # True나 False 반환
'''
1월    False
2월    False
3월     True
4월     True
'''

print(df[df['강남'] > 3] ) # 강남이 3보다 큰 값 월의 값을 출력
'''
    강남  강북  서초
3월   6   7   8
4월   9  10  11
'''

df[df < 3] = 0 # 3보다 작으면 0 대입
print(df)
'''
    강남  강북  서초
1월   0   0   0
2월   3   4   5
3월   6   7   8
4월   9  10  11
'''

print(df.loc['3월', :]) # 복수 indexing. loc : 라벨 지원. iloc : 숫자 지원
print(df.loc['3월', ]) 
# 3월 행의 모든 값을 출력
'''
강남    6
강북    7
서초    8
'''
print(df.loc[:'2월']) # 2월행 이하의 모든 값 출력
'''
    강남  강북  서초
1월   0   0   0
2월   3   4   5
'''
print(df.loc[:'2월',['서초']]) # 2월 이하 서초 열만 출력
'''
    서초
1월   0
2월   5
'''
print(df.iloc[2]) # 2행의 모든 값 출력
print(df.iloc[2, :]) # 2행의 모든 값 출력
'''
강남    6
강북    7
서초    8
'''

print(df.iloc[:3]) # 3행 미만 행 모든 값 출력
'''
    강남  강북  서초
1월   0   0   0
2월   3   4   5
3월   6   7   8
'''
print(df.iloc[:3, 2]) # 3행 미만 행 2열 미만 출력
'''
1월    0
2월    5
3월    8
'''
print(df.iloc[:3, 1:3]) # 3행 미만 행 1~2열 출력
'''
    강북  서초
1월   0   0
2월   4   5
3월   7   8
'''

연산

s1 = Series([1,2,3], index = ['a', 'b', 'c'])
s2 = Series([4,5,6,7], index = ['a', 'b', 'd', 'c'])
print(s1)
'''
a    1
b    2
c    3
'''
print(s2)
'''
a    4
b    5
d    6
c    7
'''

print(s1 + s2) # 인덱스가 값은 경우만 연산, 불일치시 NaN
print(s1.add(s2))
'''
a     5.0
b     7.0
c    10.0
d     NaN
'''

df1 = DataFrame(np.arange(9).reshape(3,3), columns=list('kbs'), index=['서울', '대전', '부산'])
print(df1)
'''
    k  b  s
서울  0  1  2
대전  3  4  5
부산  6  7  8
'''

df2 = DataFrame(np.arange(12).reshape(4,3), columns=list('kbs'), index=['서울', '대전', '제주', '수원'])
print(df2)
'''
    k   b   s
서울  0   1   2
대전  3   4   5
제주  6   7   8
수원  9  10  11
'''

print(df1 + df2) # 대응되는 index만 연산
print(df1.add(df2))
'''
      k    b     s
대전  6.0  8.0  10.0
부산  NaN  NaN   NaN
서울  0.0  2.0   4.0
수원  NaN  NaN   NaN
제주  NaN  NaN   NaN
'''
print(df1.add(df2, fill_value = 0)) # 대응되지않는 값은 0으로 대입 후 연산
'''
      k     b     s
대전  6.0   8.0  10.0
부산  6.0   7.0   8.0
서울  0.0   2.0   4.0
수원  9.0  10.0  11.0
제주  6.0   7.0   8.0
'''

seri = df1.iloc[0] # 0행만 추출
print(seri)
'''
k    0
b    1
s    2
'''

print(df1)
'''
    k  b  s
서울  0  1  2
대전  3  4  5
부산  6  7  8
'''

print(df1 - seri)
'''
    k  b  s
서울  0  0  0
대전  3  3  3
부산  6  6  6
'''

함수

df = DataFrame([[1.4, np.nan], [7, -4.5], [np.NaN, np.NaN, None], [0.5, -1]]) # , index=['one', 'two', 'three', 'four']
print(df)
'''
     0    1     2
0  1.4  NaN  None
1  7.0 -4.5  None
2  NaN  NaN  None
3  0.5 -1.0  None
'''
print(df.isnull()) # null 여부 확인
'''
       0      1     2
0  False   True  True
1  False  False  True
2   True   True  True
3  False  False  True
'''
print(df.notnull())
'''
       0      1      2
0   True  False  False
1   True   True  False
2  False  False  False
3   True   True  False
'''
print(df.drop(1)) # 행 삭제
'''
     0    1     2
0  1.4  NaN  None
2  NaN  NaN  None
3  0.5 -1.0  None
'''
print()
print(df.dropna()) # na가 포함된 행 삭제
'''
Empty DataFrame
Columns: [0, 1, 2]
Index: []
'''
print()
print(df.dropna(how='any')) # na가 포함된 행 삭제
'''
Empty DataFrame
Columns: [0, 1, 2]
Index: []
'''
print(df.dropna(how='all')) # 모든 값이 na인 행 삭제
'''
     0    1     2
0  1.4  NaN  None
1  7.0 -4.5  None
3  0.5 -1.0  None
'''
print(df.dropna(axis='rows')) # na가 포함된 행 삭제
'''
Empty DataFrame
Columns: [0, 1, 2]
Index: []
'''
print(df.dropna(axis='columns')) # na가 포함된 열 삭제
'''
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3]
'''
print(df.fillna(0)) # na를 0으로 채움
'''
     0    1  2
0  1.4  0.0  0
1  7.0 -4.5  0
2  0.0  0.0  0
3  0.5 -1.0  0
'''
#print(df.dropna(subset = ['one']))
print(df)
print('--------')
print(df.sum()) # 열 단위 합
print(df.sum(axis = 0)) # 열 단위 합
'''
0    8.9
1   -5.5
2    0.0
'''
print(df.sum(axis = 1)) # 행 단위 합
'''
0    1.4
1    2.5
2    0.0
3   -0.5
'''
print(df.mean(axis = 1)) # 행단위 평균
'''
0    1.40
1    1.25
2     NaN
3   -0.25
'''
print(df.mean(axis = 1, skipna= False)) # 행단위 평균
'''
0     NaN
1    1.25
2     NaN
3   -0.25
'''
print(df.mean(axis = 1, skipna= True)) # 행단위 평균
'''
0    1.40
1    1.25
2     NaN
3   -0.25
'''
print()
print(df.describe()) # 요약 통계량
'''
              0         1
count  3.000000  2.000000
mean   2.966667 -2.750000
std    3.521837  2.474874
min    0.500000 -4.500000
25%    0.950000 -3.625000
50%    1.400000 -2.750000
75%    4.200000 -1.875000
max    7.000000 -1.000000
'''
print(df.info()) # 구조확인
'''
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       3 non-null      float64
 1   1       2 non-null      float64
 2   2       0 non-null      object 
dtypes: float64(2), object(1)
memory usage: 224.0+ bytes
None
'''
print()
words = Series(['봄', '여름', '가을', '봄'])
print(words.describe())
'''
count     4
unique    3
top       봄
freq      2
dtype: object
'''
#print(words.info()) #AttributeError: 'Series' object has no attribute 'info'

구조 : stack, unstack, cut, merge, concat, pivot

* pdex3.py

import pandas as pd
import numpy as np

df = pd.DataFrame(1000 + np.arange(6).reshape(2, 3), index=['대전', '서울'],\
                   columns = ['2019', '2020', '2021'])
print(df)
'''
    2019  2020  2021
대전  1000  1001  1002
서울  1003  1004  1005
'''
#print(df.info()) # 정보 보기
print()

df_row = df.stack() # 칼럼을 기준으로 stack 구조로 변경
print(df_row)
'''
대전  2019    1000
    2020    1001
    2021    1002
서울  2019    1003
    2020    1004
    2021    1005
'''
df_col = df_row.unstack()
print(df_col)
'''
    2019  2020  2021
대전  1000  1001  1002
서울  1003  1004  1005
'''

범주화

price = [10.3, 5.5, 7.8, 3.6] # data
cut = [3, 7, 9, 11] # 범주
result_cut = pd.cut(price, cut) # data를 범주 기준으로 나눠 범주화 진행. 
print(result_cut)
'''
[(9, 11], (3, 7], (7, 9], (3, 7]]
Categories (3, interval[int64]): [(3, 7] < (7, 9] < (9, 11]]
'''

print(pd.value_counts(result_cut)) # value의 수
'''
(3, 7]     2
(9, 11]    1
(7, 9]     1
'''
print()

datas = pd.Series(np.arange(1, 1001))
print(datas.head(3))
'''
0    1
1    2
2    3
'''
print(datas.tail(4))
'''
996     997
997     998
998     999
999    1000
'''
result_cut2 = pd.cut(datas, 3)
print(result_cut2)
'''
0       (0.001, 334.0]
1       (0.001, 334.0]
2       (0.001, 334.0]
3       (0.001, 334.0]
4       (0.001, 334.0]
            ...       
995    (667.0, 1000.0]
996    (667.0, 1000.0]
997    (667.0, 1000.0]
998    (667.0, 1000.0]
999    (667.0, 1000.0]
Length: 1000, dtype: category
Categories (3, interval[float64]): [(0.001, 334.0] < (334.0, 667.0] < (667.0, 1000.0]]
'''
print(pd.value_counts(result_cut2))
'''
(0.001, 334.0]     334
(667.0, 1000.0]    333
(334.0, 667.0]     333
'''

merge

df1 = pd.DataFrame({'data1':range(7), 'key':['b', 'b', 'a', 'c', 'a', 'a', 'b']})
print(df1)
'''
   data1 key
0      0   b
1      1   b
2      2   a
3      3   c
4      4   a
5      5   a
6      6   b
'''
df2 = pd.DataFrame({'key':['a', 'b', 'd'], 'data2':range(3)})
print(df2)
'''
  key  data2
0   a      0
1   b      1
2   d      2
'''
print()

print(pd.merge(df1, df2)) # inner join
print(pd.merge(df1, df2, on = 'key')) # inner join
print(pd.merge(df1, df2, on = 'key', how = 'inner')) # inner join
'''
   data1 key  data2
0      0   b      1
1      1   b      1
2      6   b      1
3      2   a      0
4      4   a      0
5      5   a      0
'''

print(pd.merge(df1, df2, on = 'key', how = 'outer')) # full outer join
'''
   data1 key  data2
0    0.0   b    1.0
1    1.0   b    1.0
2    6.0   b    1.0
3    2.0   a    0.0
4    4.0   a    0.0
5    5.0   a    0.0
6    3.0   c    NaN
7    NaN   d    2.0
'''
print(pd.merge(df1, df2, on = 'key', how = 'left')) # left outer join
'''
   data1 key  data2
0      0   b    1.0
1      1   b    1.0
2      2   a    0.0
3      3   c    NaN
4      4   a    0.0
5      5   a    0.0
6      6   b    1.0
'''
print(pd.merge(df1, df2, on = 'key', how = 'right')) # right outer join
'''
   data1 key  data2
0    2.0   a      0
1    4.0   a      0
2    5.0   a      0
3    0.0   b      1
4    1.0   b      1
5    6.0   b      1
6    NaN   d      2
'''

공통 칼럼명이 없는 경우

df3 = pd.DataFrame({'key2':['a','b','d'], 'data2':range(3)})
print(df3)
'''
  key2  data2
0    a      0
1    b      1
2    d      2
'''

print(df1)
'''
   data1 key
0      0   b
1      1   b
2      2   a
3      3   c
4      4   a
5      5   a
6      6   b
'''

print(pd.merge(df1, df3, left_on = 'key', right_on = 'key2'))
'''
   data1 key key2  data2
0      0   b    b      1
1      1   b    b      1
2      6   b    b      1
3      2   a    a      0
4      4   a    a      0
5      5   a    a      0
'''
print()
print(pd.concat([df1, df3]))
print(pd.concat([df1, df3], axis = 0)) # 열 단위
'''
   data1  key key2  data2
0    0.0    b  NaN    NaN
1    1.0    b  NaN    NaN
2    2.0    a  NaN    NaN
3    3.0    c  NaN    NaN
4    4.0    a  NaN    NaN
5    5.0    a  NaN    NaN
6    6.0    b  NaN    NaN
0    NaN  NaN    a    0.0
1    NaN  NaN    b    1.0
2    NaN  NaN    d    2.0
'''

print(pd.concat([df1, df3], axis = 1)) # 행 단위
'''
   data1 key key2  data2
0      0   b    a    0.0
1      1   b    b    1.0
2      2   a    d    2.0
3      3   c  NaN    NaN
4      4   a  NaN    NaN
5      5   a  NaN    NaN
6      6   b  NaN    NaN
'''

피벗 테이블: 데이터의 행렬을 재구성하여 그륩화 처리

data = {'city':['강남', '강북', '강남', '강북'],
        'year':[2000, 2001, 2002, 2002],
        'pop':[3.3, 2.5, 3.0, 2]
        }
df = pd.DataFrame(data)
print(df)
'''
  city  year  pop
0   강남  2000  3.3
1   강북  2001  2.5
2   강남  2002  3.0
3   강북  2002  2.0
'''

print(df.pivot('city', 'year', 'pop')) # city별, year별 pop의 평균
print(df.set_index(['city', 'year']).unstack()) # 기존의 행 인덱스를 제거하고 첫번째 열 인덱스 설정 
'''
year  2000  2001  2002
city                  
강남     3.3   NaN   3.0
강북     NaN   2.5   2.0
'''
print(df.pivot('year', 'city', 'pop')) # year별 , city별 pop의 평균
'''
city   강남   강북
year          
2000  3.3  NaN
2001  NaN  2.5
2002  3.0  2.0
'''
print(df['pop'].describe())
'''
count    4.000000
mean     2.700000
std      0.571548
min      2.000000
25%      2.375000
50%      2.750000
75%      3.075000
max      3.300000
'''

groupby

hap = df.groupby(['city'])
print(hap.sum()) # city별 합

print(df.groupby(['city']).sum()) # city별 합
'''
      year  pop
city           
강남    4002  6.3
강북    4003  4.5
'''
print(df.groupby(['city', 'year']).sum()) # city, year 별 합
'''
           pop
city year     
강남   2000  3.3
     2002  3.0
강북   2001  2.5
     2002  2.0
'''
print(df.groupby(['city', 'year']).mean()) # city, year별 평균
'''
           pop
city year     
강남   2000  3.3
     2002  3.0
강북   2001  2.5
     2002  2.0
'''

pivot_table : pivot, groupby의 중간 성격

print(df)
'''
  city  year  pop
0   강남  2000  3.3
1   강북  2001  2.5
2   강남  2002  3.0
3   강북  2002  2.0
'''

print(df.pivot_table(index=['city']))
print(df.pivot_table(index=['city'], aggfunc=np.mean)) # default : aggfunc=np.mean 
'''
       pop    year
city              
강남    3.15  2001.0
강북    2.25  2001.5
'''

print(df.pivot_table(index=['city', 'year'], aggfunc=[len, np.sum]))
'''
           len  sum
           pop  pop
city year          
강남   2000  1.0  3.3
     2002  1.0  3.0
강북   2001  1.0  2.5
     2002  1.0  2.0
'''

print(df.pivot_table(values=['pop'], index = 'city')) # city별 합의 평균
print(df.pivot_table(values=['pop'], index = 'city', aggfunc=np.mean))
'''
       pop
city      
강남    3.15
강북    2.25
'''

print(df.pivot_table(values=['pop'], index = 'city', aggfunc=len))
'''
      pop
city     
강남    2.0
강북    2.0
'''

print(df.pivot_table(values=['pop'], index = ['year'], columns=['city']))
'''
      pop     
city   강남   강북
year          
2000  3.3  NaN
2001  NaN  2.5
2002  3.0  2.0
'''

print(df.pivot_table(values=['pop'], index = ['year'], columns=['city'], margins=True))
'''
       pop           
city    강남    강북  All
year                 
2000  3.30   NaN  3.3
2001   NaN  2.50  2.5
2002  3.00  2.00  2.5
All   3.15  2.25  2.7
'''

print(df.pivot_table(values=['pop'], index = ['year'], columns=['city'], margins=True, fill_value=0))
'''
       pop           
city    강남    강북  All
year                 
2000  3.30  0.00  3.3
2001  0.00  2.50  2.5
2002  3.00  2.00  2.5
All   3.15  2.25  2.7
'''

file i/o

* pdex4_fileio.py

import pandas as pd

#local
df = pd.read_csv(r'../testdata/ex1.csv')
#web
df = pd.read_csv(r'https://raw.githubusercontent.com/pykwon/python/master/testdata_utf8/ex1.csv')

print(df, type(df))
'''
   bunho irum  kor  eng
0      1  홍길동   90   90
1      2  신기해   95   80
2      3  한국인  100   85
3      4  박치기   67   54
4      5  마당쇠   55  100 <class 'pandas.core.frame.DataFrame'>
'''

df = pd.read_csv(r'../testdata/ex2.csv', header=None)
print(df, type(df))
'''
   0   1   2   3      4
0  1   2   3   4  hello
1  5   6   7   8  world
2  9  10  11  12    foo <class 'pandas.core.frame.DataFrame'>
'''

df = pd.read_csv(r'../testdata/ex2.csv', names=['a', 'b', 'c', 'd', 'e'])
print(df, type(df))
'''
   a   b   c   d      e
0  1   2   3   4  hello
1  5   6   7   8  world
2  9  10  11  12    foo <class 'pandas.core.frame.DataFrame'>
'''

df = pd.read_csv(r'../testdata/ex2.csv', names=['a', 'b', 'c', 'd', 'e'], index_col = 'e')
print(df, type(df))
'''
       a   b   c   d
e                   
hello  1   2   3   4
world  5   6   7   8
foo    9  10  11  12 <class 'pandas.core.frame.DataFrame'>
'''

df = pd.read_csv(r'../testdata/ex3.txt')
print(df, type(df))
'''
               A         B         C
0  aaa -0.264438 -1.026059 -0.619500
1  bbb  0.927272  0.302904 -0.032399
2  ccc -0.264273 -0.386314 -0.217601
3  ddd -0.871858 -0.348382  1.100491 <class 'pandas.core.frame.DataFrame'>
'''

df = pd.read_csv(r'../testdata/ex3.txt', sep = '\s+', skiprows=[1, 3])
print(df, type(df))
'''
            A         B         C
bbb  0.927272  0.302904 -0.032399
ddd -0.871858 -0.348382  1.100491 <class 'pandas.core.frame.DataFrame'>
'''

df = pd.read_fwf(r'../testdata/data_fwt.txt', encoding = 'utf8', widths=(10, 3, 5), names = ('date', 'name', 'price'))
print(df, type(df))
'''
         date name  price
0  2017-04-10  네이버  32000
1  2017-04-11  네이버  34000
2  2017-04-12  네이버  33000
3  2017-04-10  코리아  22000
4  2017-04-11  코리아  21000
5  2017-04-12  코리아  24000 <class 'pandas.core.frame.DataFrame'>
'''

print(df['date'])
'''
0    2017-04-10
1    2017-04-11
2    2017-04-12
3    2017-04-10
4    2017-04-11
5    2017-04-12
'''

chunksize : 파일이 너무 큰 경우에는 나눠서 읽기 옵션을 사용

test = pd.read_csv('../testdata/data_csv2.csv', header=None, chunksize = 3)
print(test)
'''
    0      1     2
0   1     사과  3500
1   2      배  5000
2   3    바나나  1000
3   4     수박  7000
4   5     참외  2000
5   6  에스프레소  5500
6   7  아메리카노  2000
7   8   카페라떼  4000
8   9   카푸치노  4000
9  10   카페모카  5000

<pandas.io.parsers.TextFileReader object at 0x00000240E6E521C0>
'''
for p in test:
    #print(p)
    print(p.sort_values(by=2, ascending=True))
    print()
'''
   0    1     2
2  3  바나나  1000
0  1   사과  3500
1  2    배  5000

   0      1     2
4  5     참외  2000
5  6  에스프레소  5500
3  4     수박  7000

   0      1     2
6  7  아메리카노  2000
7  8   카페라떼  4000
8  9   카푸치노  4000

    0     1     2
9  10  카페모카  5000
'''

파일로 저장

items = {'apple':{'count':10, 'price':1500},
         'orange':{'count':5, 'price':500}}
print(items)
df = pd.DataFrame(items)
print(df)
'''
count     10       5
price   1500     500
'''

csv

df.to_csv('result1.csv', sep=',')
df.to_csv('result2.csv', sep=',', index=False) # 색인 제외
df.to_csv('result3.csv', sep=',', index=False, header=False) # 색인, 칼럼명 제외

html

data = df.T
'''
        count  price
apple      10   1500
orange      5    500
'''
print(data)
data.to_html('result1.html')

excel로 저장

df2 = pd.DataFrame({'data':[1,2,3,4,5]})
wr = pd.ExcelWriter('good.xlsx', engine='xlsxwriter')
df2.to_excel(wr, sheet_name = 'Sheet1')
wr.save()

excel로 읽기

exf = pd.ExcelFile('good.xlsx')
print(exf.sheet_names)

dfdf = exf.parse('Sheet1')
print(dfdf)
dfdf2 = pd.read_excel(open('good.xlsx','rb'), sheet_name='Sheet1')
print(dfdf2)
'''
['Sheet1']
   Unnamed: 0  data
0           0     1
1           1     2
2           2     3
3           3     4
4           4     5
'''

ElementTree : XML, HTML 문서 읽기

* pdex5_etree.py

import xml.etree.ElementTree as etree

# 일반적인 형태의 파일읽기
xml_f = open("pdex5.xml","r", encoding="utf-8").read()
print(xml_f,"\n",type(xml_f))  # <class str> 이므로 str관련 명령만 사용 가능
print()

root = etree.fromstring(xml_f)
print(root,'\n', type(root))   # <class xml.etree.ElementTree.Element> Element 관련 명령 사용가능
print(root.tag, len(root.tag)) # items 5

xmlfile = etree.parse("pdex5.xml")
print(xmlfile)
root = xmlfile.getroot()
print(root.tag)                   # items
print(root[0].tag)                # item
print(root[0][0].tag)             # name
print(root[0][0].attrib)          # {'id': 'ks1'}
print(root[0][2].attrib.keys())   # dict_keys(['kor', 'eng'])
print(root[0][2].attrib.values()) # dict_values(['90', '80'])

* pdex5.xml

<?xml version="1.0" encoding="UTF-8"?>
<!-- xml문서 최상위 해당 라인이 추가되어야 한다. -->
<items>
	<item>
		<name id="ks1">홍길동</name>
		<tel>111-1111</tel>
		<exam kor="90" eng="80"/>
	</item>
	<item>
		<name id="ks2">신길동</name>
		<tel>111-2222</tel>
		<exam kor="100" eng="50"/>
	</item>
</items>

Beautifulsoup : XML, HTML문서의 일부 자료 추출

* pdex6_beautifulsoup.py

# Beautifulsoup : XML, HTML문서의 일부 자료 추출
# 라이브러리 설치 beautifulsoup4, request, lxml
# find(), select()
 
import requests  
from bs4 import BeautifulSoup
def go():
    base_url = "http://www.naver.com/index.html"
    
    #storing all the information including headers in the variable source code
    source_code = requests.get(base_url)
    
    #sort source code and store only the plaintext
    plain_text = source_code.text            # 문자열
    #print(plain_text)
    
    #converting plain_text to Beautiful Soup object so the library can sort thru it
    convert_data = BeautifulSoup(plain_text, 'lxml') # lxml의 HTML 해석기 사용
    
    # 해석 라이브러리
    # BeautifulSoup(markup, "html.parser")
    # BeautifulSoup(markup, "lxml")
    # BeautifulSoup(markup, ["lxml", "xml"])
    # BeautifulSoup(markup, "xml")
    # BeautifulSoup(markup, html5lib)
    
    # find 함수
    # find()
    # find_next()
    # find_all()
    
    for link in convert_data.findAll('a'):   # a tag search 
        href = link.get('href')              # href 속성 get
        #href = base_url + link.get('href')  #Building a clickable url
        print(href)                          #displaying href
go()

Beautifulsoup의 find(), select()

* pdex7_beautifulsoup.py

from bs4 import BeautifulSoup

html_page = """
<html><body>
<h1>제목태그</h1>
<p>웹문서 읽기</p>
<p>원하는 자료 선택</p>
</body></html>
"""
print(html_page, type(html_page)) # <class str>

soup = BeautifulSoup(html_page, 'html.parser') # BeautifulSoup 객체 생성
print(type(soup))  # bs4.BeautifulSoup BeautifulSoup이 제공하는 명령 사용 가능
print()

h1 = soup.html.body.h1
print('h1 :', h1.string)  # h1 : 제목태그
p1 = soup.html.body.p     # 최초의 p
print('p1 :', p1.string)  # p1 : 웹문서 읽기

p2 = p1.next_sibling      # </p>
p2 = p1.next_sibling.next_sibling
print('p2 :', p2.string)  # p2 : 원하는 자료 선택

find() 사용

html_page2 = """
<html><body>
<h1 id='title'>제목태그</h1>
<p>웹문서 읽기</p>
<p id='my'>원하는 자료 선택</p>
</body></html>
"""

soup2 = BeautifulSoup(html_page2, 'html.parser')
print(soup2.p, ' ', soup2.p.string)    # 직접 최초 tag 선택가능
                                        # <p>웹문서 읽기</p>   웹문서 읽기

print(soup2.find('p').string)          # find('태그명')
                                        # 웹문서 읽기

print(soup2.find('p', id='my').string) # find('태그명', id='아이디명')
                                        # 원하는 자료 선택

print(soup2.find_all('p'))             # find_all('태그명') 
                                        # [<p>웹문서 읽기</p>, <p id="my">원하는 자료 선택</p>]

print(soup2.find(id='title').string)   # find(id='아이디명')
                                        # 제목태그

print(soup2.find(id='my').string)
                                        # 원하는 자료 선택

find_all(), findeAll() 사용

html_page3 = """
<html><body>
<h1 id='title'>제목태그</h1>
<p>웹문서 읽기</p>
<p id='my'>원하는 자료 선택</p>
<div>
    <a href="https://www.naver.com">naver</a><br/>
    <a href="https://www.daum.net">daum</a><br/>
</div>
</body></html>
"""

soup3 = BeautifulSoup(html_page3, 'html.parser')
print(soup3.find('a'))             # <a href="https://www.naver.com">naver</a>

print(soup3.find('a').string)
# naver

print(soup3.find(['a', 'i']))      # find(['태그명1', '태그명2']) 
                                    # <a href="https://www.naver.com">naver</a>

print(soup3.find_all(['a', 'p']))  # [<p>웹문서 읽기</p>, <p id="my">원하는 자료 선택</p>, <a href="https://www.naver.com">naver</a>, <a href="https://www.daum.net">daum</a>]
print(soup3.findAll(['a', 'p']))   # 위와 동일

print(soup3.find_all('a'))         # a태그만
                                    # [<a href="https://www.naver.com">naver</a>, <a href="https://www.daum.net">daum</a>]

print(soup3)
print(soup3.prettify()) # 들여쓰기
print()

links = soup3.find_all('a')        # [<a href="https://www.naver.com">naver</a>, <a href="https://www.daum.net">daum</a>]
print(links)
for i in links:
    href = i.attrs['href']
    text = i.string
    print(href, text)
# https://www.naver.com naver
# https://www.daum.net daum

find() 정규표현식 사용

import re
links2 = soup3.find_all(href=re.compile(r'^https://'))
print(links2)                 # [<a href="https://www.naver.com">naver</a>, <a href="https://www.daum.net">daum</a>]

for k in links2:
    print(k.attrs['href'])
# https://www.naver.com
# https://www.daum.net

select() 사용(css의 selector)

html_page4 = """
<html><body>
<div id='hello'>
    <a href="https://www.naver.com">naver</a><br/>
    <a href="https://www.daum.net">daum</a><br/>
    <ul class="world">
        <li>안녕</li>
        <li>반가워</li>
    </ul>
</div>
<div id='hi'>
    seconddiv
</div>
</body></html>
"""

soup4 = BeautifulSoup(html_page4, 'lxml')
aa = soup4.select_one('div#hello > a').string # select_one() : 하나만 선택
print("aa :", aa)            # aa : naver

bb = soup4.select("div#hello ul.world > li") # > : 직계, 공백 : 자손, select() : 복수 선택. 객체리턴.
print("bb :", bb) # bb : [<li>안녕</li>, <li>반가워</li>]

for i in bb:
    print("li :", i.string)
# li : 안녕
# li : 반가워

웹문서 읽기 - web scraping

* pdex8.py

import urllib.request as req
from bs4 import BeautifulSoup

url = "https://ko.wikipedia.org/wiki/%EC%9D%B4%EC%88%9C%EC%8B%A0"
wiki = req.urlopen(url)
print(wiki) # <http.client.HTTPResponse object at 0x00000267F71B1550>

soup = BeautifulSoup(wiki, 'html.parser')
# Chrome - F12 - 태그 오른쪽 클릭  - Copy - Copy selector
print(soup.select_one("#mw-content-text > div.mw-parser-output > p"))

url = "https://news.daum.net/society#1"
daum = req.urlopen(url)
soup = BeautifulSoup(daum, 'html.parser')
print(soup.select_one("div#kakaoIndex > a").string) # 본문 바로가기
datas = soup.select("div#kakaoIndex > a")

for i in datas:
    href = i.attrs['href']
    text = i.string
    print("href :{}, text:{}".format(href, text))
    #print("href :%s, text:%s"%(href, text))
# href :#kakaoBody, text:본문 바로가기
# href :#kakaoGnb, text:메뉴 바로가기
print()

datas2 = soup.findAll('a')
#print(datas2)
for i in datas2[:2]:
    href = i.attrs['href']
    text = i.string
    print("href :{}, text:{}".format(href, text))
    
# href :#kakaoBody, text:본문 바로가기
# href :#kakaoGnb, text:메뉴 바로가기

날씨 정보 예보

* pdex9_weather.py

#<![CDATA[ ]]>
import urllib.request
import urllib.parse
from bs4 import BeautifulSoup
import pandas as pd
url = "http://www.weather.go.kr/weather/forecast/mid-term-rss3.jsp"
data = urllib.request.urlopen(url).read()
#print(data.decode('utf-8'))
soup = BeautifulSoup(urllib.request.urlopen(url).read(), 'lxml')
#print(soup)
print()

title = soup.find('title').string
print(title) # 기상청 육상 중기예보
#wf = soup.find('wf')
wf = soup.select_one('item > description > header > wf')
print(wf)


city = soup.find_all('city')
#print(city)
cityDatas = []
for c in city:
    #print(c.string)
    cityDatas.append(c.string)

df = pd.DataFrame()
df['city'] = cityDatas
print(df.head(3))
'''
  city
0   서울
1   인천
2   수원
'''

#tmEfs = soup.select_one('location data tmef')
#tmEfs = soup.select_one('location > data > tmef')
tmEfs = soup.select_one('location > province + city + data > tmef') # + 아래 형제
print(tmEfs) # <tmef>2021-02-28 00:00</tmef>

tempMins = soup.select('location > province + city + data > tmn')
tempDatas = []
for t in tempMins:
    tempDatas.append(t.string)
    
df['temp_min'] = tempDatas
print(df.head(3), len(df))
'''
  city temp_min
0   서울        3
1   인천        3
2   수원        2 41
'''
df.columns = ['지역', '최저기온']
print(df.head(3))
'''
   지역 최저기온
0  서울    3
1  인천    3
2  수원    2
'''
print(df.describe())
print(df.info())

# 파일로 저장
df.to_csv('날씨정보.csv', index=False)
df2 = pd.read_csv('날씨정보.csv')

print(df2.head(2))
print(df2[0:2])

print(df2.tail(2))
print(df2[-2:len(df)])
'''
     지역  최저기온
39   제주    10
40  서귀포    10
'''

print(df.iloc[0])
'''
지역      서울
최저기온     3
'''
print(type(df.iloc[0])) # Series

print(df.iloc[0:2, :])
'''
   지역 최저기온
0  서울    3
1  인천    3
'''
print(type(df.iloc[0:2, :])) # DataFrame
print("---")
print(df.iloc[0:2, 0:2])
'''
   지역 최저기온
0  서울    3
1  인천    3
0    서울
1    인천
'''
print(df['지역'][0:2])
'''
0    서울
1    인천
'''
print(df['지역'][:2])
'''
0    서울
1    인천
'''
print("---")
print(df.loc[1:3])
'''
   지역 최저기온
1  인천    3
2  수원    2
3  파주   -2
'''
print(df.loc[[1, 3]])
'''
   지역 최저기온
1  인천    3
3  파주   -2
'''

print(df.loc[:, '지역'])
'''
0      서울
1      인천
2      수원
3      파주
'''
print('----------')
df = df.astype({'최저기온':'int'})
print(df.info())
print('----------')
print(df['최저기온'].mean()) # 2.1951219512195124
print(df['최저기온'].std()) # 3.034958914014504
print(df['최저기온'].describe())
#print(df['최저기온'] >= 5)
print(df.loc[df['최저기온'] >= 5])
'''
     지역  최저기온
17   여수     5
19   광양     5
27   부산     7
28   울산     5
32   통영     6
35   포항     6
39   제주    10
40  서귀포    10
'''
print('----------')
print(df.sort_values(['최저기온'], ascending=True))
'''
     지역  최저기온
31   거창    -3
6    춘천    -3
3    파주    -2
4    이천    -2
34   안동    -2
13   충주    -1
7    원주    -1
'''

웹 문서를 다운받아 파일로 저장하기 - 스케줄러

* pdex10_schedule.py

# 웹 문서를 다운받아 파일로 저장하기 - 스케줄러
from bs4 import BeautifulSoup
import urllib.request as req
import datetime

def working():
    url = "https://finance.naver.com/marketindex/"
    data = req.urlopen(url)
    soup = BeautifulSoup(data, 'html.parser')
    price = soup.select_one("div.head_info > span.value").string
    print("미국 USD :", price) # 미국 USD : 1,108.90
    
    t = datetime.datetime.now()
    print(t) # 2021-02-25 14:52:30.108522
    fname = './usd/' + t.strftime('%Y-%m-%d-%H-%M-%S') + ".txt"
    print(fname) # 2021-02-25-14-53-59.txt
    
    with open(fname, 'w') as f:
        f.write(price)
    
# 스케줄러
import schedule   # pip install schedule
import time
 
# 한번만 실행
working(); 
# 10초에 한번씩 실행
schedule.every(10).second.do(working)
# 10분에 한번씩 실행
schedule.every(10).minutes.do(working)
# 매 시간 실행
schedule.every().hour.do(working)
# 매일 10:30 에 실행
schedule.every().day.at("10:30").do(working)
# 매주 월요일 실행
schedule.every().monday.do(working)
# 매주 수요일 13:15 에 실행
schedule.every().wednesday.at("13:15").do(working)

while True:
    schedule.run_pending()
    time.sleep(1)

urllib.request, requests 모듈로 웹 자료 읽기

* pdex11.py

방법 1

from bs4 import BeautifulSoup

import urllib.request
url = "https://movie.naver.com/movie/sdb/rank/rmovie.nhn" # 네이버 영화 랭킹 정보
data = urllib.request.urlopen(url).read()
print(data)
soup = BeautifulSoup(data, 'lxml')
print(soup)
print(soup.select("div.tit3"))
print(soup.select("div[class=tit3]")) # 위와 동일

for tag in soup.select("div[class=tit3]"):
    print(tag.text.strip())
'''
미션 파서블
극장판 귀멸의 칼날: 무한열차편
소울
퍼펙트 케어
새해전야
몬스터 헌터
'''

방법 2

import requests
data = requests.get(url);
print(data.status_code, data.encoding) # 정보 반환. 200 MS949
datas = data.text
print(datas)

datas = requests.get(url).text; # 명령을 연속적으로 주고 읽음
soup2 = BeautifulSoup(datas, "lxml")
print(soup2)

m_list = soup2.findAll("div", "tit3")          # findAll("태그명", "속성명")
m_list = soup2.findAll("div", {'class':'tit3'}) # findAll("태그명", {'속성':'속성명'}) 
print(m_list)
count = 1
for i in m_list:
    title = i.find('a')
    #print(title)
    print(str(count) + "위 : " + title.string)
    count += 1
'''
1위 : 미션 파서블
2위 : 극장판 귀멸의 칼날: 무한열차편
3위 : 소울
4위 : 퍼펙트 케어
5위 : 새해전야
6위 : 몬스터 헌터
7위 : 더블패티
'''

네이버 실시간 검색어

import requests
from bs4 import BeautifulSoup  # html 분석 라이브러리

# 유저 설정
url = 'https://datalab.naver.com/keyword/realtimeList.naver?where=main'
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36'}

res = requests.get(url, headers = headers)
soup = BeautifulSoup(res.content, 'html.parser')

# span.item_title 정보를 선택
data = soup.select('span.item_title')
i = 1

for item in data:
    print(str(i) + ')' + item.get_text())
    i += 1
'''
1)함소원 진화
2)함소원
3)오은영
4)기성용
'''

* pdex12_search.py

구글 검색기능

import requests
from bs4 import BeautifulSoup
import webbrowser

def searchFunc(search_word):
    base_url = "https://www.google.com/search?q={0}"
    #sword = base_url.format("colab")
    sword = base_url.format(search_word)
    print(sword) # https://www.google.com/search?q=colab
    
    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36'}
    #plain_text = requests.get(sword)
    plain_text = requests.get(sword, headers=headers)
    #print(plain_text.text)
    soup = BeautifulSoup(plain_text.text, 'lxml')
    #print(soup)
    link_data = soup.select('div.yuRUbf > a')
    #print(link_data)
    for link in link_data:
        #print(link.attrs['href'])
        #print(type(link),type(str(link)))
        print(str(link).find('https'),str(link).find('ping') - 2)
        urls = str(link)[str(link).find('https'):str(link).find('ping') -2]
        print(urls)
        webbrowser.open(urls)
        
search_word = "파이썬"
    
searchFunc(search_word)

XML

* pdex13_xml.py

xmlObj.select('태그명')

# XML로 제공된 강남구 도서간 정보 읽기
import urllib.request as req
from bs4 import BeautifulSoup

url = "http://openapi.seoul.go.kr:8088/sample/xml/SeoulLibraryTime/1/5/"
plainText = req.urlopen(url).read().decode() # url의 text read
#print(plainText)

xmlObj = BeautifulSoup(plainText, 'lxml') # xml 해석기 사용
libData = xmlObj.select('row')            # row 태그의 데이터 정보 객체에 저장
#print(libData)

for data in libData:
    name = data.find('lbrry_name').text   # row태그내 lbrry_name 태그 정보 read
    addr = data.find('adres').text        # row태그내 adres 태그 정보 read
    print('도서관명\t:', name,'\n주소\t:',addr)

json

* pdex14.json

{
	"직원":{
		"이름":"홍길동",
		"직급":"대리",
		"전화":"010-222-2222"
	},
	"웹사이트":{
		"카페명":"cafe.daum.net/flowlife",
		"userid":"hell"
	}
}

* pdex14_json.py

json.loads('json 문자열')

import json

json_file = "./pdex14.json" # 파일경로
json_data = {}

def readData(filename):
    f = open(filename, 'r', encoding="utf-8") # 파일 열기
    lines = f.read()                          # 파일 읽기
    f.close()                                 # 파일 닫기
    #print(lines)
    return json.loads(lines)                  # decoding str -> dict                        
    
def main():
    global json_file                          # 전역 변수 사용
    json_data = readData(json_file)           # json파일의 내용을 dict타입으로 반환
    print(type(json_data))                    # dict
    
    d1 = json_data['직원']['이름']              # data의 key로 value read
    d2 = json_data['직원']['직급']
    d3 = json_data['직원']['전화']
    print("이름 : " + d1 +" 직급 : "+d2 + " 전화 : "+d3)
    # 이름 : 홍길동 직급 : 대리 전화 : 010-222-2222

if __name__ == "__main__":
    main()

* pdex15_json.py

jsonData.get('태그명')

# JSON으로 제공된 강남구 도서간 정보 읽기
import urllib.request as req
import json

url = "http://openapi.seoul.go.kr:8088/sample/json/SeoulLibraryTime/1/5/ "
plainText = req.urlopen(url).read().decode() # url text read 및 decode
print(plainText)

jsonData = json.loads(plainText)
print(jsonData['SeoulLibraryTime']['row'][0]["LBRRY_NAME"]) # LH강남3단지작은도서관

# get 함수
libData = jsonData.get("SeoulLibraryTime").get("row")
print(libData)

name = libData[0].get('LBRRY_NAME') # LH강남3단지작은도서관
print(name)
print()

for msg in libData:
    name = msg.get('LBRRY_NAME')
    tel = msg.get('TEL_NO')
    addr = msg.get('ADRES')
    print('도서관명\t:', name,'\n주소\t:',addr,'\n전화\t:',tel)
'''
도서관명    : LH강남3단지작은도서관 
주소    : 서울특별시 강남구 자곡로3길 22 
전화    : 02-459-8700
도서관명    : 강남구립못골도서관 
주소    : 서울시 강남구 자곡로 116 
전화    : 02-459-5522
도서관명    : 강남역삼푸른솔도서관 
주소    : 서울특별시 강남구 테헤란로8길 36. 4층 
전화    : 02-2051-1178
도서관명    : 강남한신휴플러스8단지작은도서관 
주소    : 서울특별시 강남구 밤고개로27길 20(율현동, 강남한신휴플러스8단지) 
전화    : 
도서관명    : 강남한양수자인작은씨앗도서관 
주소    : 서울특별시 강남구 자곡로 260 
전화    : 
'''

selenium : 자동 웹 브라우저 제어

anaconda 접속
pip install selenium

https://sites.google.com/a/chromium.org/chromedriver/home 접속
Latest stable release: ChromeDriver 88.0.4324.96 접속
chromedriver_win32.zip download
dowload 파일 특정경로에 압축풀어 이동

anaconda 접속

python
from selenium import webdriver
browser = webdriver.Chrome('D:/1. 프로그래밍/0. 설치 Program/Python/chromedriver')
browser.implicitly_wait(5)
browser.get('https://daum.net')
browser.quit()

import time
from selenium import webdriver
browser = webdriver.Chrome('D:/1. 프로그래밍/0. 설치 Program/Python/chromedriver')
browser.get('http://www.google.com/xhtml');
time.sleep(5)
search_box = browser.find_element_by_name('q')
search_box.send_keys('파이썬')
search_box.submit()
time.sleep(5)
browser.quit()

* pdex16_selenium.py

# 셀레니움으로 임의의 사이트 화면 캡처
from selenium import webdriver

try:
    url = "http://www.daum.net"
    browser = webdriver.Chrome('D:/1. 프로그래밍/0. 설치 Program/Python/chromedriver')
    browser.implicitly_wait(3)

    browser.get(url);
    browser.save_screenshot("daum_img.png")
    browser.quit()
    print('성공')

except Exception:
    print('에러')

형태소 분석

* nlp01.py

# 말뭉치 : 자연어 연구를 목적으로 수집된 샘플 dataset
# 형태소(단어로써 의미를 가지는 최소 단위) 분석 (한글) - 어근, 접두사, 접미사, 품사 형태로 분리한 데이터로 분석작업

from konlpy.tag import Kkma

kkma = Kkma()
#print(kkma)
phrase = "영국 제약사 아스트라제네카의 백신 1차분 접종은 오전 9시부터 전국의 요양원 직원들과 약 5200만명의 환자들에게 투여되기 시작했다. 반가워요"
print(kkma.sentences(phrase))
print(kkma.nouns(phrase)) # 명사 출력
print()

from konlpy.tag import Okt
okt = Okt()
print(okt.pos(phrase))               # 단어 + 품사
print(okt.pos(phrase, stem=True))    # 
print(okt.nouns(phrase))             # 명사 출력
print(okt.morphs(phrase))            # 모든 품사 출력

단어 빈도수

* nlp2_webscrap

# 위키백과 사이트에서 원하는 단어 검색 후 형태소 분석. 단어 출현 빈도 수 출력.

import urllib
from bs4 import BeautifulSoup
from konlpy.tag import Okt
from urllib import parse # 한글 인코딩용

okt = Okt()
#para = input('검색 단어 입력 : ')
para = "이순신"
para = parse.quote(para)    # 한글 인코딩
url = "https://ko.wikipedia.org/wiki/" + para
print(url) # https://ko.wikipedia.org/wiki/%EC%9D%B4%EC%88%9C%EC%8B%A0

page = urllib.request.urlopen(url)        # url 열기
#print(page)

soup = BeautifulSoup(page.read(), 'lxml') # url 읽어. xml 해석기 사용
#print(soup)

wordList = []               # 형태소 분석으로 명사만 추출해 기억

for item in soup.select("#mw-content-text > div > p"): # p태그의 데이터를 읽는다
    #print(item)
    if item.string != None: # 태그 내부가 비어있으면 저장 하지않는다.
        #print(item.string)
        ss = item.string
        wordList += okt.nouns(ss) # 명사만 추출

print("wordList :", wordList)
print("단어 수 :", len(wordList))
'''
wordList : ['당시', '조산', '만호', '이순신', ...]
단어 수 : 241
'''

# 단어의 발생횟수를 dict type 당시 : 2 조산:5
word_dict = {}

for i in wordList:
    if i in word_dict: # dict에 없으면 추가 있으면 count+1
        word_dict[i] += 1
    else:
        word_dict[i] = 1
print("word_dict :", word_dict)
'''
word_dict : {'당시': 3, '조산': 1, '만호': 1,  ...}
'''

setdata = set(wordList)
print(setdata)
print("단어 수(중복 제거 후) :", len(setdata)) # 단어 수(중복 제거 후) : 169
print()

# 판다스의 series type으로 처리
import pandas as pd
woList = pd.Series(wordList)
print(woList[:5])
print(woList.value_counts()[:5]) # 단어 별 횟수 총 갯수 top 5
'''
0     당시
1     조산
2     만호
3    이순신
4     북방
dtype: object
이순신    14
척       7
배       7
대한      5
그       4
dtype: int64

당시      3
조산      1
만호      1
이순신    14
북방      1
'''
print()

woDict = pd.Series(word_dict)
print(woDict[:5])
print(woDict.value_counts())
'''
1     133
2      23
3       7
7       2
4       2
14      1
5       1
'''
print()

# DataFrame으로 처리
df1 = pd.DataFrame(wordList, columns =['단어'])
print(df1.head(5))
print()
'''
    단어
0   당시
1   조산
2   만호
3  이순신
4   북방
'''

# 단어 / 빈도수
df2 = pd.DataFrame([word_dict.keys(), word_dict.values()])
df2 = df2.T
df2.columns = ['단어', '빈도수']
print(df2.head(3))
'''
   단어  빈도수
0  당시    3
1  조산    1
2  만호    1
'''
df2.to_csv("./이순신.csv", sep=',', index=False)
df3 = pd.read_csv("./이순신.csv")
print(df3.head(3))
'''
   단어  빈도수
0  당시    3
1  조산    1
2  만호    1
'''

one_hot encoding

* nlp3.py

# Word를 수치화해서 vector에 담기
import numpy as np

# 단어 one_hot encoding
data_list = ['python', 'lan', 'program', 'computer', 'say']
print(data_list)

values = []
for x in range(len(data_list)):
    values.append(x)

print(values) # [0, 1, 2, 3, 4]

values_len = len(values) # 5

one_hot = np.eye(values_len)
print(one_hot) # one_hot encoding
'''
[[1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 1.]]
'''

word2vec

: 단어를 벡터로 변경

anaconda 접속
pip install gensim
conda remove --force scipy
pip install scipy

from gensim.models import word2vec 

sentence = [['python', 'lan', 'program', 'computer', 'say']]
model =word2vec.Word2Vec(sentences = sentence, min_count=1, size = 50) # 50개의 크기
print(model.wv)
word_vectors = model.wv
print("word_vectors.vocab : ", word_vectors.vocab) # key, value로 구성된 vocab obj
print()

vocabs = word_vectors.vocab.keys()
print("vocabs : ", vocabs)
# vocabs :  dict_keys(['python', 'lan', 'program', 'computer', 'say'])

vocab_val = word_vectors.vocab.values()
print("vocab_val : ", vocab_val)
# vocab_val :  dict_values([<gensim.models.keyedvectors.Vocab object at 0x00000234EE66F610>, <gensim.models.keyedvectors.Vocab object at 0x00000234F6674130>, <gensim.models.keyedvectors.Vocab object at 0x00000234F6674190>, <gensim.models.keyedvectors.Vocab object at 0x00000234F6674220>, <gensim.models.keyedvectors.Vocab object at 0x00000234F6809F10>])
print()

word_vectors_list = [word_vectors[v] for v in vocabs]
print(word_vectors_list)
print()
'''
[array([ 1.3842779e-03,  7.4106529e-03,  2.4765935e-03, -8.9635467e-03,
        8.0429604e-03,  7.1699792e-03, -1.5191999e-03,  3.6448328e-04,
       -1.7622416e-03, -5.8619846e-03, -5.2785235e-03,  1.9480551e-03,
'''

similarity() : 코사인 유사도(Cosine Similarity) 수식에 의한 단어 사이의 거리를 출력

print(word_vectors.similarity(w1 = 'lan', w2 = 'program')) # 단어간 유사도 거리 : 0.20018259
print(word_vectors.similarity(w1 = 'lan', w2 = 'say'))     # 단어간 유사도 거리 : -0.10073644
print()
print(model.wv.most_similar(positive='lan'))
# [('program', 0.20018261671066284), ('computer', 0.158003032207489), ('python', 0.11154142022132874), ('say', -0.10073643922805786)]
# 1에 가까울 수록 유사도가 좋음

* nlp4.py

# 웹 뉴스 정보를 읽어 형태소 분석 = > 단어별 유사도 출력
import pandas as pd
from konlpy.tag import Okt

okt = Okt()

with open('news.txt', mode='r', encoding='utf8') as f:
    #print(f.read())
    lines = f.read().split('\n') # 줄 나누기
    print(len(lines))
    
wordDic = {} # 단어 수 확인을 위한 dict type

for line in lines:
    datas = okt.pos(line) # 품사 태깅
    #print(datas) # [('(', 'Punctuation'), ('경남', 'Noun'), ('=', 'Punctuation'), ...

    for word in datas:
        if word[1] == 'Noun':  # 명사만 작업에 참여
            #print(word) # ('경남', 'Noun')
            #print(word[0] in wordDic)
            if not (word[0] in wordDic): # 없으면 0, 있으면 count + 1
                wordDic[word[0]] = 0 # {word[0] : count, ... } 
            wordDic[word[0]] += 1

print(wordDic)
# {'경남': 9, '뉴스': 4, '김다솜': 1, '기자': 1, '가덕도': 14, ...

# 단어 건수 별 내림차순 정렬
keys = sorted(wordDic.items(), key= lambda x : x[1], reverse=True) # count로 내림차순 정렬
print(keys)

# DataFrame에 담기 - 단어, 건수
wordList = []
countList = []

for word, count in keys[:20]: #상위 20개만 작업
    wordList.append(word)
    countList.append(count)

df = pd.DataFrame()
df['word'] = wordList
df['count'] = countList
print(df) 
'''
      word  count
0       공항     19
1      가덕도     14
2       경남      9
3      특별법      9
4   시민사회단체      8
5       도내      6
'''

# word2vec
result = []
with open('news.txt', mode='r', encoding='utf8') as fr:
    lines = fr.read().split('\n') # 줄 나누기
    for line in lines:
        datas = okt.pos(line, stem=True) # 품사 태깅. stem : 원형 어근 형태로 처리. 한가하고 -> 한가하다
        #print(datas) # [('(', 'Punctuation'), ('경남', 'Noun'), ('=', 'Punctuation'), ...
        temp = []
        for word in datas:
            if not word[1] in ['Punctuation', 'Suffix', 'Josa', 'Verb', 'Modifier', 'Number', 'Determiner', 'Foreign']:
                temp.append(word[0]) # 한 줄당 단어
        temp2 = (" ".join(temp)).strip() # 공백 제거 후 띄어쓰기를 채워 합친다.
        result.append(temp2)
        
print(result)

fileName = "news2.txt"
with open(fileName, mode="w", encoding='utf8') as fw:
    fw.write('\n'.join(result)) # 각 줄을 \n로 이어준 후 저장
    print('저장 완료')

* news.txt

(경남=뉴스1) 김다솜 기자 = 가덕도신공항 특별법의 국회통과를 앞둔 26일 경남 도내 시민사회단체와 정당에서 반대 의견이 나오고 있다.
경남기후위기비상행동, 민주노총 경남본부 등 도내 시민사회단체는 이날 오전 경남도청 앞에서 기자회견을 열어 관련 부처에서도 반대 의견을 표명한 가덕도신공항 특별법이 기득권 양당의 담합 행위라고 반발했다.
정의당 경남도당도 같은 입장을 담은 성명을 발표했다.
이들은 관련 부처에서도 반대하는 사업을 받아들일 수 없다고 강조했다. 국토교통부의 가덕도신공항 의견보고서에서 위험성, 효율성 등 부정적인 측면이 지적된 만큼 신중하게 판단해야 한다는 것이다.
국토교통부는 지난 24일 안정성, 시공성, 운영성 등 7가지 점검 내용이 담긴 의견보고서를 제시했다. 보고서는 가덕도신공항 건설 시 안전사고 위험성이 크게 증가하고, 환경훼손도 뒤따른다고 지적하고 있다.
법무부도 가덕도신공항 특별법이 개별적·구체적 사건만 규율하고 있어서 적법 절차 및 평등원칙에 위배될 우려가 있다는 입장을 전했다.
이번 가덕도신공항 특별법에 포함된 예비타당성 조사 면제에 대한 반응도 좋지 않다. 기획재정부는 대규모 신규 사업에서 예산 낭비를 방지하려면 타당성을 검증할 필요가 있다는 의견을 전했다.
도내 시민사회단체는 “관계 부처까지도 수용이 곤란하다고 말하고 있다”며 “설계 없이 공사를 할 수 있게 한 유례를 찾을 수 없는 이 기상천외한 가덕도신공항 특별법은 위험하기 짝이 없다”고 비판했다.
가덕도신공항 특별법 처리를 앞두고 도내 시민사회단체와 민주노총 경남본부, 정의당 경남도당은 26일 법안 폐기를 촉구했다. 이들은 가덕도신공항 건설로 얻는 피해가 막대하다는 점을 거듭 강조했다. (경남 시민사회단체 제공) © 뉴스1
가덕도신공항 특별법 처리를 앞두고 도내 시민사회단체와 민주노총 경남본부, 정의당 경남도당은 26일 법안 폐기를 촉구했다. 이들은 가덕도신공항 건설로 얻는 피해가 막대하다는 점을 거듭 강조했다. (경남 시민사회단체 제공) © 뉴스1
당초 동남권 관문공항을 세우기 위한 목적에도 어긋난다는 지적도 나온다. 부산시가 발표한 가덕도신공항 건설안은 국제선만 개항하고, 국내선은 김해공항만 개항하도록 했는데 동남권 관문공항으로 보기에는 현실적이지 못하다는 설명이다.
국제선과 국내선, 군시설 등 동남권 관문공항의 기본 요소를 갖추려면 부산시에서 추산한 7조5000억 원보다 많은 28조7000억 원이 필요하다는 점도 짚었다. 이는 이명박 정부에 시행된 4대강 사업 예산보다도 많은 액수다.
도내 시민사회단체는 “정부와 여당이 적폐라 비난했던 이명박의 4대강 살리기 사업보다 더 나갔다”며 “동네 하천 정비도 이렇게 하지 않는다”고 일갈했다.
정의당 경남도당도 “모든 분야에서 부적격하다는 가덕도신공항 특별법을 밀어붙이는 건 집권 여당의 명백한 입법권 남용”이라며 “1년 임시 부산시장 자리를 위해 백년지대계인 공항건설을 선거지대계로 전락시킨 더불어민주당과 국민의힘을 규탄한다”고 전했다.
한편 가덕도신공항 특별법은 오늘 국회 본회의를 거쳐 통과될 전망이다. 이 법안은 예비타당성조사를 면제하고, 다른 법률에 우선 적용시키자는 내용 등을 담고 있다.
allcotton@news1.kr
Copyright ⓒ 뉴스1코리아 www.news1.kr 무단복제 및 전재 – 재배포금지

* news2.txt

경남 뉴스 김다솜 기자 가덕도 공항 특별법 국회통과 경남 도내 시민사회단체 정당 반대 의견 있다
경남 기후 위기 비상 행동 민주 노총 경남 본부 등 도내 시민사회단체 날 오전 경남 도청 앞 기자회견 관련 부처 반대 의견 표명 가덕도 공항 특별법 기득권 당 담합 행위 반발
정의당 남도 같다 입장 성명 발표
이 관련 부처 반대 사업 수 없다 강조 국토교통부 가덕도 공항 의견 보고서 위험성 효율 등 부정 측면 지적 만큼 신중하다 판단 것
국토교통부 지난 안정 공성 운영 등 가지 점검 내용 의견 보고서 제시 보고서 가덕도 공항 건설 시 안전 사고 위험성 크게 증가 환경 훼손 지적 있다
법무부 가덕도 공항 특별법 개별 구체 사건 규율 있다 적법 절차 및 평등 원칙 위배 우려 있다 입장 전
이번 가덕도 공항 특별법 포함 예비 타당성 조사 면제 대한 반응 좋다 기획재정부 대규모 신규 사업 예산 낭비 방지 타당성 검증 필요 있다 의견 전
도내 시민사회단체 관계 부처 수용 곤란하다 말 있다 며 설계 없이 공사 수 있다 유례 수 없다 이 기상 외한 가덕도 공항 특별법 위험하다 짝 없다 고 비판
가덕도 공항 특별법 처리 도내 시민사회단체 민주 노총 경남 본부 정의당 남도 법안 폐기 촉구 이 가덕도 공항 건설 피해 막대 점 거듭 강조 경남 시민사회단체 제공 뉴스
가덕도 공항 특별법 처리 도내 시민사회단체 민주 노총 경남 본부 정의당 남도 법안 폐기 촉구 이 가덕도 공항 건설 피해 막대 점 거듭 강조 경남 시민사회단체 제공 뉴스
당초 동남권 관문 공항 위 목적 지적도 부산시 발표 가덕도 공항 건설 국제선 개항 국내선 김해 공항 개항 동남권 관문 공항 보기 현실 못 설명
국제선 국내선 군시설 등 동남권 관문 공항 기본 요소 부산시 추산 원보 많다 원 필요하다 점도 이명박 정부 시행 대강 사업 예산 많다 액수
도내 시민사회단체 정부 여당 적폐 비난 이명박 대강 사업 더 며 동네 하천 정비 이렇게 고 일
정의당 남도 모든 분야 부 적격하다 가덕도 공항 특별법 건 집권 여당 명백하다 입법권 남용 라며 임시 부산시 자리 위해 백년 지대 공항 건설 선거 지대 전락 민주당 국민 힘 규탄 고 전
한편 가덕도 공항 특별법 오늘 국회 본회의 통과 전망 이 법안 예비 타당성조사 면제 다른 법률 우선 적용 내용 등 있다
allcotton@news1.kr
Copyright 뉴스 코리아 www.news1.kr 무단 복제 및 재 재 배포 금지

# 모델 생성
model = word2vec.Word2Vec(genObj, size=100, window=10, min_count=2, sg=1) # window : 참조 주변 단어 수, min_count : 2보다 적은 값은 모델 구성에서 제외, sg=0 cbow, sg=1 skip-gram,
#Cbow 주변 단어로 중심단어 예측
#skip-gram : 중심 단어로 주변단어 추측
print(model)
model.init_sims(replace=True) # 모델 제작 중 생성된 필요없는 메모리 해제

# 학습 시킨 모델은 저장 후 재사용 가능
try:
    model.save('news.model')
except Exception as e:
    print('err', e)

model = word2vec.Word2Vec.load('news.model')
print(model.wv.most_similar(positive=['사업']))
print(model.wv.most_similar(positive=['사업'], topn=3))
print(model.wv.most_similar(positive=['사업', '남도'], topn=3))
# positive : 단어 사전에 해당단어가 있을 확률
# negative : 단어 사전에 해당단어가 없을 확률
result = model.wv.most_similar(positive=['사업', '남도'], negative=['건설'])
print(result)

클라우드 차트

* nlp5.py

# 검색 결과를 형태소 분석하여 단어 빈도수를 구하고 이를 기초로 워드 클라우드 차트 출력
from bs4 import BeautifulSoup
import urllib.request
from urllib.parse import quote

#keyboard = input("검색어")
keyboard = "주식"
#print(quote(keyboard)) # encoding

# 동아일보 검색 기능
target_url ="https://www.donga.com/news/search?query=" + quote(keyboard)
print(target_url)
source_code = urllib.request.urlopen(target_url)
soup = BeautifulSoup(source_code, 'lxml', from_encoding='utf-8')

msg = ""
for title in soup.find_all("p", "tit"):
    title_link = title.select("a")
    #print(title_link)
    article_url = title_link[0]['href'] # [<a href="https://www.donga.com/news/Issue/031407" target="_blank">독감<span cl ... 
    #print(article_url) # https://bizn.donga.com/3/all/20210226/105634947/2 ..
    
    source_article = urllib.request.urlopen(article_url) # 실제 기사
    soup = BeautifulSoup(source_article, 'lxml', from_encoding='utf-8')
    cotents = soup.select('div.article_txt')
    #print(cotents)
    for temp in cotents:
        item = str(temp.find_all(text=True))
        #print(item)
        msg = msg + item

print(msg)

from urllib.parse import quote

quote(문자열) : encoding

from konlpy.tag import Okt
from collections import Counter

nlp = Okt()
nouns = nlp.nouns(msg)
result = []
for temp in nouns:
    if len(temp) > 1:
        result.append(temp)
print(result)
print(len(result))
count = Counter(result)
print(count)
tag = count.most_common(50) # 상위 50개만 작업에 참여

anaconda prompt 실행
pip install simplejson
pip install pytagcloud

import pytagcloud

taglist = pytagcloud.make_tags(tag, maxsize=100)
print(taglist)

pytagcloud.create_tag_image(taglist, "word.png", size=(1000, 600), fontname="korean", rectangular=False)

# 저장된 이미지 읽기
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
# %matplotlib inline # 주피터
img = mpimg.imread("word.png")
plt.imshow(img)
plt.show()

# 브라우저로 출력
import webbrowser
webbrowser.open("word.png")

C:\Windows\Fonts 폰트 복사
C:\anaconda3\Lib\site-packages\pytagcloud\fonts 붙여넣기
fonts.json

{
        "name": "korean",
        "ttf": "malgun.ttf",
        "web": "http://fonts.googleapis.com/css?family=Nobile"
},

'BACK END > Python Library' 카테고리의 다른 글

[LINUX] 리눅스 - 명령어 /eclipse /FlashPlayer /DB /apache /R /anaconda /hadoop (0)	2021.03.26
[Pandas] pandas 정리2 - db, django (0)	2021.03.02
[MatPlotLib] matplotlib 정리 (0)	2021.03.02
[NumPy] numpy 정리 (0)	2021.02.23