[R] R 정리 14 - 로지스틱 회귀2 ROC

2021. 2. 1. 13:10

14. roc : 모델의 성능을 차트로 표현.

- 로지스틱 회귀 분석 모델 : 날씨 관련 자료로 비가 내릴지 말지 예측

weather <- read.csv("testdata/weather.csv", stringsAsFactors = FALSE)
# stringsAsFactors = FALSE : factor -> string으로 전환.
dim(weather) #  366  15
head(weather)
colnames(weather)
str(weather)

weather_df <- weather[, c(-1, -6, -8, -14)] # 편의상 일부 변수 제외
head(weather_df, 3)
# MinTemp MaxTemp Rainfall Sunshine WindGustSpeed WindSpeed Humidity Pressure Cloud Temp RainTomorrow
# 1     8.0    24.3      0.0      6.3            30        20       29   1015.0     7 23.6          Yes
# 2    14.0    26.9      3.6      9.7            39        17       36   1008.4     3 25.7          Yes
# 3    13.7    23.4      3.6      3.3            85         6       69   1007.2     7 20.2          Yes

weather_df[complete.cases(weather_df), ]  # NA가 있는 행 찾기
sum(is.na(weather_df))
weather_df <- na.omit(weather_df) # NA가 있는 행 제거
sum(is.na(weather_df))

# RainTomorrow 종속변수 YES:1, NO:0 (Dummy 변수)(범주 -> 범위 변경)
weather_df$RainTomorrow[weather_df$RainTomorrow == 'Yes'] <- 1
weather_df$RainTomorrow[weather_df$RainTomorrow == 'No'] <- 0
weather_df$RainTomorrow <- as.numeric(weather_df$RainTomorrow)
head(weather_df)

- train/test

set.seed(123)
idx <- sample(1:nrow(weather_df), nrow(weather_df) * 0.7)
train <- weather_df[idx,]
test <- weather_df[-idx,]
dim(train)
dim(test)

- 모델 생성

weather_model <- glm(RainTomorrow ~ ., data=train, family="binomial")
weather_model
summary(weather_model)

- predict

pred <- predict(weather_model, newdata = test, type = 'response')
head(pred, 10)

result_pred <- ifelse(pred >= 0.5, 1, 0)
head(result_pred, 10)
table(result_pred, test$RainTomorrow) # 전체수 / TP
# result_pred  0  1
# 0 79 17
# 1  3 10
(79+10) / nrow(test) # 0.8090909

- ROC Curve : 분류 모델의 평가 도형

install.packages("ROCR")
library(ROCR)
pr <- prediction(pred, test$RainTomorrow)
pr
prf <- performance(pr, measure = "tpr", x.measure = "fpr") # measure: y축- 민감도, x.measure : x축
plot(prf) # 곡선이 많이 굽어 있을 수록 확률이 높다. AUC (Area under roc curve). 면적

- AUC

auc <- performance(pr, measure = "auc")
auc
auc <- auc@y.values[[1]]
auc # 0.8844828
# AUC 기준
# 0.90-1 = excellent
# 0.80-0.90 = good 
# 0.70-0.80 = fair 
# 0.60-0.70 = poor
# 0.50-0.60 = fail

- 로지스틱 회귀 분석 다항 분류

str(iris)

ind <- sample(1:nrow(iris), nrow(iris) * 0.7, replace = FALSE)
train <- iris[ind, ]
test <- iris[-ind, ]
dim(train)

library(nnet)
m <- multinom(Species~., data = train)
m$fitted.values

m_class <- max.col(m$fitted.values)
m_class
table(m_class)
# 1  2  3 
# 38 31 36

table(m_class, train$Species)
# m_class setosa versicolor virginica
# 1     38          0         0
# 2      0         30         1
# 3      0          1        35
(38 + 30 + 35) / nrow(train) # 0.9809524

pred = predict(m, newdata = test, type='class') # type='probs': 확률값
pred
table(pred, test$Species)
# pred         setosa versicolor virginica
# setosa         12          0         0
# versicolor      0         17         0
# virginica       0          2        14
(12 + 17 + 14) / nrow(test) # 0.9555556

install.packages("caret")
install.packages("e1071")
library(caret)
library(e1071)
confusionMatrix(pred, test$Species)

my <- test
my <- my[c(1,2,3),]
my <- edit(my)
my
newpr <- predict(m, newdata = my, type ='class')
newpr

'BACK END > R' 카테고리의 다른 글

[R] R 정리 16 - Random Forest (0)	2021.02.02
[R] R 정리 15 - Decision Tree (0)	2021.02.02
[R] R 정리 13 - 로지스틱 회귀 (0)	2021.02.01
[R] R 정리 12 - 다중 선형 회귀 DB (0)	2021.01.29
[R] R 정리 11 - 다중 선형 회귀 (0)	2021.01.29

[코딩] Circle Square

[R] R 정리 14 - 로지스틱 회귀2 ROC

'BACK END > R' 카테고리의 다른 글

+ Recent posts

티스토리툴바