네이버 클라우드 캠프 18일차 230518

[Naver Cloud Camp 7] 교육 정리

네이버 클라우드 캠프 18일차 230518

우기37 2023. 5. 20. 15:29

#1 교육정리

오늘의 수업은 프로젝트 발표를 위한 준비 시간이었습니다.

저희팀은 brazilian_medical_noshow data 로 최고의 정확도를 도출하기 위해 노력했습니다.

과정을 아래와 같이 정리해드리니 참고부탁드립니다.

1) 데이터 수집 및 기획

2) 데이터 전처리

3) 모델링

4) 모델 리뷰 및 최종 성능 비교

#2 데이터 수집 및 기획

1. 우선 팀원들과 같이 medical_noshow 에 대한 data의 구성을 살펴보고 모르는 데이터와 의문점이 드는 (ex, 음수값 등) data에 대해서 탐색 및 의논을 하였습니다.

2. 강사님이 제공해주신 dataset을 사용하였으며, 수업 시간에 배운 각종 모델을 적용해보기 위해 팀원 각각의 분업을 시작하였습니다.

3. 전처리를 하기 전에는 저는 주로 딥러닝을 test 하였고 전처리 이후에는 GridSearhCV를 통해 하이퍼파라미터를 추출하여 Voting 하여 다른 모델과 비교 분석 하였습니다.

4. 하지만 딥러닝(MLP)의 정확도가 머신러닝(GridsearchCV, RandomSearchCV, Optuna, SelectFromModel, Catboost, LGBMboost, XGBoost, Embedding, Bagging)에 비해 가장 성능이 좋았습니다.

5. 저희 팀은 성능의 한계를 느끼고 모델의 문제가 아닌 데이터 전처리의 수정 및 개선이 필요함을 느끼고, 구글링을 통해 kaggle에서 데이터 전처리에 대한 데이터를 참고하였습니다.

#3 데이터 전처리

초기 데이터 전처리)

age에 대한 상하위 5%씩의 데이터가 정확도를 떨어뜨리는데 영향을 미칠거라 생각하여 아래와 같이 전처리 하였습니다.

그 이유는 0세 등 미취학아동과 100세 이상의 노인인 경우 보호자가 동반하거나, 위급한 상황의 확률이 더욱 커 병원 예약의 노쇼를 하는데 영향력이 극히 드물거라고 생각했습니다.

또한, 'PatientId'(환자식별아이디), 'AppointmentID'(예약식별아이디), 'ScheduledDay'(예약된날짜), 'Hipertension'(고혈압여부), 'Diabetes'(당뇨병여부), 'Alcoholism'(알코올중독) 의 칼럼들은 영향을 미치지 않을거라 생각하여 Drop 하였습니다.

그 이유는 아이디와 날짜의 경우 노쇼와 관련이 없으며, 각종 심한 질병은 당연히 노쇼를 내지 않고 병원에 갈거라고 생각했습니다.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import time

from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.preprocessing import MinMaxScaler, StandardScaler, PowerTransformer
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder

from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

from sklearn.covariance import EllipticEnvelope

import warnings
warnings.filterwarnings('ignore')

# 1. Data preprocessing #

path = '../medical_noshow.csv'
medical_noshow = pd.read_csv(path)

# print(medical_noshow.columns)
# print(medical_noshow.head(10))

medical_noshow.AppointmentDay = pd.to_datetime(medical_noshow.AppointmentDay).dt.date
medical_noshow.ScheduledDay = pd.to_datetime(medical_noshow.ScheduledDay).dt.date
medical_noshow['PeriodBetween'] = medical_noshow.AppointmentDay - medical_noshow.ScheduledDay
# convert derived datetime to int
medical_noshow['PeriodBetween'] = medical_noshow['PeriodBetween'].dt.days

medical_noshow['diseaseCount'] = medical_noshow.Hipertension + medical_noshow.Diabetes + medical_noshow.Alcoholism

medical_noshow = medical_noshow.dropna(axis = 0)    # nan값을 가진 행 드랍

outliers = EllipticEnvelope(contamination=.1)      
# 이상치 탐지 모델 생성
outliers.fit(medical_noshow[['Age']])      
# 이상치 탐지 모델 훈련
predictions = outliers.predict(medical_noshow[['Age']])       
# 이상치 판별 결과
outlier_indices = np.where(predictions == -1)[0]    
# 이상치로 판별된 행의 인덱스를 추출
medical_noshow.loc[outlier_indices, 'Age']  = np.nan    #이상치를 nan처리
# 데이터프레임에서 이상치 행을 삭제
# print(medical_noshow[medical_noshow['PeriodBetween'] < 0])
medical_noshow[medical_noshow['PeriodBetween'] < 0] = np.nan    #이상치를 nan처리
medical_noshow = medical_noshow.fillna(np.nan)    #비어있는 데이터를 nan처리

medical_noshow = medical_noshow.dropna(axis = 0)    # nan값을 가진 행 드랍

# print(datasets.PeriodBetween.describe())
x = medical_noshow[['PatientId', 'AppointmentID', 'Gender',   'ScheduledDay', 
              'AppointmentDay', 'PeriodBetween', 'Age', 'Neighbourhood', 
              'Scholarship', 'diseaseCount', 'Hipertension', 'Diabetes', 'Alcoholism', 
              'Handcap', 'SMS_received']]
y = medical_noshow[['No-show']]

# print(x.info())
# print(y.info())
# print(x.describe())
## 1-1. correlation hit map ##

# sns.set(font_scale = 1)
# sns.set(rc = {'figure.figsize':(12, 8)})
# sns.heatmap(data = medical_noshow.corr(), square = True, annot = True, cbar = True)
# plt.show()

## 1-2. drop useless data ##

x = x.drop(['PatientId', 'AppointmentID','ScheduledDay', 'Hipertension', 'Diabetes', 'Alcoholism'], axis=1)
# print(x.describe())
# print(x.shape)

# print("이상치 정리 후\n",x.describe())
# print(x.shape, y.shape)

## 1-3. encoding object to int ##

encoder = LabelEncoder()

### char -> number
ob_col = list(x.dtypes[x.dtypes=='object'].index) ### data type이 object인 data들의 index를 ob_col 리스트에 저장
# print(ob_col)

for col in ob_col:
    x[col] = LabelEncoder().fit_transform(x[col].values)
y = LabelEncoder().fit_transform(y.values)

# print(x.describe())
print('y(No-Show column) index 0~8:', y[0:8], '...\n')
print(x.info())

## 1-4. fill na data ##

# print(x.describe())

## 1-5. check dataset ##

# print('head : \n',x.head(7))
# print('y : ',y[0:7]) ### y : np.array

x_train, x_test, y_train, y_test = train_test_split(
    x, y, train_size=0.6, test_size=0.2, random_state=100, shuffle=True
)

scaler = PowerTransformer()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

하지만, 머신러닝의 경우는 0.79대를 넘지 못하고, 딥러닝의 경우 0.80을 넘지 못해 데이터 전처리의 개선 방향을 새로 탐색했습니다.

수정 데이터 전처리)

구글링과 팀원들과의 회의를 통해 아래와 같이 전처리를 수정하였습니다.

기존에 불필요한 칼럼은 drop해야 성능이 향상될거라고만 생각했지만, 오히려 'PreviousApp'(각 환자별 이전 약속 수)를 계산해서 'PreviousNoShow'(이전 약속 수를 no-show) 낸 사람의 no-show 비율을 계산하였고, 'Num_App_Missed' 칼럼을 추가해 각 환자별 누적 no-show 수를 계산했습니다.

그리고 'Day_diff' 칼럼을 생성해 약속일자ㅘ 예약 일자의 차이를 계산하여 unique로 유일한 값을 출력하였고, 영향이 없을거라 생각되는 열인 'ScheduledDay', 'AppointmentDay', 'PatientID', 'Neighbourhood' 를 제거하였습니다.

'Handicap' 열을 범주형 더미 변수로 생성해서 장애등급을 나누었습니다.

마지막으로 기존처럼 Age의 상하위 5%씩 총 10%의 데이터를 제외하였습니다.

자세한 코드내용은 코드블럭을 참고부탁드립니다.

import numpy as np
import pandas as pd
import time
import warnings

from keras.models import Sequential
from keras.layers import Dense

from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import MinMaxScaler
from sklearn.covariance import EllipticEnvelope
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import VotingClassifier

from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

warnings.filterwarnings('ignore')

# Data preprocessing #

path = '../AI_study/'
df = pd.read_csv(path + 'medical_noshow.csv')
# CSV 파일을 읽어와서 DataFrame으로 저장

# print(medical_noshow.columns)
# print(medical_noshow.head(10))

print('Count of rows', str(df.shape[0]))
print('Count of Columns', str(df.shape[1]))
# 데이터프레임의 크기와 열의 수를 출력

df.isnull().any().any()

for i in df.columns:
    print(i+":",len(df[i].unique()))
# 데이터프레임의 각 열에 대해 유일한 값의 수를 출력

df['PatientId'].astype('int64')
df.set_index('AppointmentID', inplace = True)
# 'PatientId' 열을 정수형으로 변환하고, 'AppointmentID'를 인덱스로 설정
df['No-show'] = df['No-show'].map({'No':0, 'Yes':1})
# 'No-show' 열의 값('No', 'Yes')을 0과 1로 매핑
df['Gender'] = df['Gender'].map({'F':0, 'M':1})
# 'Gender' 열의 값('F', 'M')을 0과 1로 매핑

df['PreviousApp'] = df.sort_values(by = ['PatientId','ScheduledDay']).groupby(['PatientId']).cumcount()
# 'PreviousApp' 열을 생성하여 각 환자별 이전 약속 수를 계산
df['PreviousNoShow'] = (df[df['PreviousApp'] > 0].sort_values(['PatientId', 'ScheduledDay']).groupby(['PatientId'])['No-show'].cumsum() / df[df['PreviousApp'] > 0]['PreviousApp'])
# 'PreviousNoShow' 열을 생성하여 이전 약속에서의 No-show 비율을 계산

df['PreviousNoShow'] = df['PreviousNoShow'].fillna(0)
df['PreviousNoShow']
# 'PreviousNoShow' 열의 NaN 값을 0으로 채운다

# Number of Appointments Missed by Patient
df['Num_App_Missed'] = df.groupby('PatientId')['No-show'].apply(lambda x: x.cumsum())
df['Num_App_Missed']
# 'Num_App_Missed' 열을 생성하여 각 환자별 누적 No-show 수를 계산

df['ScheduledDay'] = pd.to_datetime(df['ScheduledDay']).dt.strftime('%Y-%m-%d')
df['ScheduledDay'] = pd.to_datetime(df['ScheduledDay'])
df['ScheduledDay']
df['AppointmentDay'] = pd.to_datetime(df['AppointmentDay']).dt.strftime('%Y-%m-%d')
df['AppointmentDay'] = pd.to_datetime(df['AppointmentDay'])
df['AppointmentDay']
# 'ScheduledDay' 열과 'AppointmentDay' 열을 날짜 형식으로 변환

df['Day_diff'] = (df['AppointmentDay'] - df['ScheduledDay']).dt.days
# 'Day_diff' 열을 생성하여 약속 일자와 예약 일자의 차이를 계산
df['Day_diff'].unique()
# Day_diff' 열의 유일한 값을 출력

df = df[(df.Age >= 0)]
# 'Age' 열의 값이 0 이상 행만 선택

df.drop(['ScheduledDay'], axis=1, inplace=True)
df.drop(['AppointmentDay'], axis=1, inplace=True)
df.drop('PatientId', axis=1,inplace = True)
df.drop('Neighbourhood', axis=1,inplace = True)
# 불필요한 열('ScheduledDay', 'AppointmentDay', 'PatientId', 'Neighbourhood')을 삭제

#Convert to Categorical
df['Handcap'] = pd.Categorical(df['Handcap'])
#Convert to Dummy Variables
Handicap = pd.get_dummies(df['Handcap'], prefix = 'Handicap')
df = pd.concat([df, Handicap], axis=1)
df.drop(['Handcap'], axis=1, inplace = True)
# 'Handcap' 열을 범주형으로 변환하고, 더미 변수로 변환

df = df[(df.Age >= 0) & (df.Age <= 100)]
df.info()
# 'Age' 열의 값이 0 이상 100 이하인 행만 선택

x = df.drop(['No-show'], axis=1)
y = df['No-show']

scaler = MinMaxScaler()
x = scaler.fit_transform(x)
# Min-Max 스케일링을 사용하여 특성 값을 0과 1 사이로 조정

##### Complete Data Preprocessing #####

결과는 놀랍게도 같은 MLP 모델을 적용했을 때 정확도가 0.80(전처리 수정 전) --> 0.97(전처리 수정 후)대의 정확도가 나왔습니다.

#4 모델링

이번에는 데이터 전처리 후의 저희 팀에서 적용한 다양한 모델들의 코드를 보여드리겠습니다.

#4에서는 모델과 결과값만 보여드리는 것이오니 혹시 모델링 하실분은 위 #3의 데이터와 함께 적용해보세요.

GridSearchCV - Voting - cat, xgb, lgbm)

각 boost를 gridsearchcv 하여 최적의 하이퍼파라미터를 도출하여 voting에 적용

#!/usr/bin/env python
# coding: utf-8
import numpy as np
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import accuracy_score
import time

from sklearn.covariance import EllipticEnvelope
from sklearn.preprocessing import LabelEncoder

import warnings
warnings.filterwarnings('ignore')


######### voting ############
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.ensemble import VotingClassifier

x_train, x_test, y_train, y_test = train_test_split(
    x, y, train_size=0.6, test_size=0.2, random_state=100, shuffle=True
)

scaler = MinMaxScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

n_splits = 5
kfold = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)

xgb = XGBClassifier(
    colsample_bylevel=0.1, colsample_bynode=0.1, colsample_bytree=0.2,
    gamma=2, learning_rate=0.2, max_depth=3,
    min_child_weight=0.01, n_estimators=200, reg_alpha=0.1,
    reg_lambda=0.1, subsample=0.2
) # gridsearchCV로 산출한 best_parameter 적용

lgbm = LGBMClassifier(
    feature_fraction=0.2, learning_rate=0.1, max_depth=6,
    min_data_in_leaf=200, n_estimators=300, num_leaves=7,
    reg_alpha=0.1, reg_lambda=0.01
) # gridsearchCV로 산출한 best_parameter 적용

cat = CatBoostClassifier(
    colsample_bylevel=0.1,
    learning_rate=0.2, depth=3,
    n_estimators=200,
    reg_lambda=0.1, subsample=0.1
) # gridsearchCV로 산출한 best_parameter 적용

model = VotingClassifier(
    estimators=[('xgb', xgb), ('lgbm', lgbm), ('cat', cat)],
    voting='hard',
    n_jobs=-1,
    verbose=0
)

model.fit(x_train, y_train)

y_voting_predict = model.predict(x_test)
voting_score = accuracy_score(y_test, y_voting_predict)


classifiers = [cat, xgb, lgbm]
for model in classifiers:
    model.fit(x_train, y_train)
    y_predict = model.predict(x_test)
    score = accuracy_score(y_test, y_predict)
    class_name = model.__class__.__name__
    print(class_name, "'s score : ", score)

print('voting result : ', voting_score)
print('GridSearchCV -> voting')


# CatBoostClassifier 's score :  0.9644860658704307
# XGBClassifier 's score :  0.9641693811074918
# [LightGBM] [Warning] feature_fraction is set=0.2, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.2
# [LightGBM] [Warning] min_data_in_leaf is set=200, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=200
# LGBMClassifier 's score :  0.9646217879116902
# voting result :  0.9645765472312704
# GridSearchCV -> voting

RandomSearchCV - cat)

#!/usr/bin/env python
# coding: utf-8
import numpy as np
import pandas as pd
import time
import warnings

from keras.models import Sequential
from keras.layers import Dense

from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import r2_score
from sklearn.covariance import EllipticEnvelope
from sklearn.preprocessing import LabelEncoder

warnings.filterwarnings('ignore')

#### RandomSearch ####
x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.2, shuffle=True, random_state=42
)

param = {
    'learning_rate': [0.01, 0.05, 0.7, 0.1],
    'depth': [3, 5, 7, 9, 12],
    'l2_leaf_reg': [1, 3, 6, 9],
    'colsample_bylevel': [0.5, 0.6, 0.7, 0.9],
    'n_estimators': [100, 300, 500, 700, 900],
    'subsample': [0.5, 0.7, 1.0, 1.2],
    'random_strength': [1, 2, 4, 8],
    'bagging_temperature': [0.01, 0.1, 0.5, 1],
    'border_count': [10, 30, 60, 90]
    }

n_splits = 5
random_state = 42
kfold = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=random_state)

from catboost import CatBoostClassifier
from sklearn.model_selection import RandomizedSearchCV
search_model = CatBoostClassifier()
model = RandomizedSearchCV(search_model, param, cv = kfold, verbose = 1, refit = True, n_jobs = -1, n_iter=64, random_state=42)
# n_iter : 랜덤 탐색 반복 횟수 -> 랜덤 표본의 다양화

start_time = time.time()
model.fit(x_train, y_train)
end_time = time.time() - start_time

print('걸린 시간 : ', end_time, '초')
print('최적의 파라미터 : ', model.best_params_)
print('최적의 매개변수 : ', model.best_estimator_)
print('best_score : ', model.best_score_)
print('model_score : ', model.score(x_test, y_test))
print('RandomSearch -> catboost')

# 걸린 시간 :  1066.063810825348 초
# 최적의 파라미터 :  {'subsample': 0.5, 'random_strength': 2, 'n_estimators': 500, 'learning_rate': 0.01, 'l2_leaf_reg': 9, 'depth': 9, 'colsample_bylevel': 0.9, 'border_count': 10, 'bagging_temperature': 1}
# 최적의 매개변수 :  <catboost.core.CatBoostClassifier object at 0x7fb50d4d9220>
# best_score :  0.9724820449018832
# model_score :  0.9724936663047412
# RandomSearch -> catboost

RandomSearchCV - lgbm)

#!/usr/bin/env python
# coding: utf-8
import numpy as np
import pandas as pd
import time
import warnings

from keras.models import Sequential
from keras.layers import Dense

from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import r2_score
from sklearn.covariance import EllipticEnvelope
from sklearn.preprocessing import LabelEncoder

warnings.filterwarnings('ignore')

#### RandomSearch ####
x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.2, shuffle=True, random_state=42
)

param = {
    'learning_rate': [0.01, 0.1, 0.3, 0.5],
    'max_depth': [3, 7, 12, 16],
    'num_leaves': [16 ,24, 42, 64],
    'reg_alpha': [0.05, 0.1, 0.2, 0.5, 0.7],
    'reg_lambda': [0.5, 0.7, 0.9, 1.2],
    'n_estimators': [100, 300, 500, 700],
    'subsample': [0.3, 0.6, 0.9],
    'colsample_bytree': [0.3, 0.5, 0.7, 1.0, 1.2],
    'min_child_samples': [30, 45, 60, 90],
    'min_data_in_leaf': [1, 3, 5, 7],
    'feature_fraction': [0.1, 0.3, 0.55, 0.85, 1]
    }

n_splits = 5
random_state = 42
kfold = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=random_state)

from lightgbm import LGBMClassifier
from sklearn.model_selection import RandomizedSearchCV
search_model = LGBMClassifier()
model = RandomizedSearchCV(search_model, param, cv = kfold, verbose = 1, refit = True, n_jobs = -1, n_iter=64, random_state=42)
# n_iter : 랜덤 탐색 반복 횟수 -> 랜덤 표본의 다양화

start_time = time.time()
model.fit(x_train, y_train)
end_time = time.time() - start_time

print('걸린 시간 : ', end_time, '초')
print('최적의 파라미터 : ', model.best_params_)
print('최적의 매개변수 : ', model.best_estimator_)
print('best_score : ', model.best_score_)
print('model_score : ', model.score(x_test, y_test))
print('RandomSearch -> lightgbm')

#걸린 시간 :  116.6949098110199 초
#최적의 파라미터 :  {'subsample': 0.3, 'reg_lambda': 1.2, 'reg_alpha': 0.05, 'num_leaves': 24, 'n_estimators': 700, 'min_data_in_leaf': 1, 'min_child_samples': 90, 'max_depth': 3, 'learning_rate': 0.01, 'feature_fraction': 0.85, 'colsample_bytree': 1.2}
#최적의 매개변수 :  LGBMClassifier(colsample_bytree=1.2, feature_fraction=0.85, learning_rate=0.01,
#               max_depth=3, min_child_samples=90, min_data_in_leaf=1,
#               n_estimators=700, num_leaves=24, reg_alpha=0.05, reg_lambda=1.2,
#               subsample=0.3)
#best_score :  0.9728552847367528
#model_score :  0.9731722765110388
#RandomSearch -> lightgbm

RandomSearchCV - xgb)

#!/usr/bin/env python
# coding: utf-8
import numpy as np
import pandas as pd
import time
import warnings

from keras.models import Sequential
from keras.layers import Dense

from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import r2_score
from sklearn.covariance import EllipticEnvelope
from sklearn.preprocessing import LabelEncoder

warnings.filterwarnings('ignore')

#### RandomSearch ####
x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.2, shuffle=True, random_state=42
)
param = {
    'learning_rate': [0.01, 0.05, 0.075, 0.1, 0.15],
    'max_depth': [3, 5, 7, 9, 12, 15],
    'reg_alpha': [0.01, 0.05, 0.1, 0.15, 0.3],
    'reg_lambda': [0.3, 0.5, 0.75, 0.9],
    'n_estimators': [100, 150, 350, 500, 700],
    'subsample': [0.3, 0.5, 0.7, 0.9, 1.0, 1.2],
    'colsample_bytree': [0.9, 1.0, 1.2],
    'min_child_weight': [1, 3, 5, 7, 9],
    'gamma': [0.01, 0.05, 0.1, 0.15],
    'colsample_bylevel' : [0.3, 0.5, 0.7, 1, 1.2] ,
    'colsample_bynode' : [0.3, 0.5, 0.7, 0.9, 1.2]
    }

n_splits = 5
random_state = 42
kfold = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=random_state)

from xgboost import XGBClassifier
from sklearn.model_selection import RandomizedSearchCV
search_model = XGBClassifier()
model = RandomizedSearchCV(search_model, param, cv = kfold, verbose = 1, refit = True, n_jobs = -1, n_iter=64, random_state=42)
# n_iter : 랜덤 탐색 반복 횟수 -> 랜덤 표본의 다양화

start_time = time.time()
model.fit(x_train, y_train)
end_time = time.time() - start_time

print('걸린 시간 : ', end_time, '초')
print('최적의 파라미터 : ', model.best_params_)
print('최적의 매개변수 : ', model.best_estimator_)
print('best_score : ', model.best_score_)
print('model_score : ', model.score(x_test, y_test))
print('RandomSearch -> xgboost')

#걸린 시간 :  227.25962376594543 초
#최적의 파라미터 :  {'subsample': 1.0, 'reg_lambda': 0.9, 'reg_alpha': 0.1, 'n_estimators': 100, 'min_child_weight': 7, 'max_depth': 12, 'learning_rate': 0.05, 'gamma': 0.15, 'colsample_bytree': 1.0, 'colsample_bynode': 0.3, 'colsample_bylevel': 1}
#최적의 매개변수 :  XGBClassifier(base_score=None, booster=None, callbacks=None,
#              colsample_bylevel=1, colsample_bynode=0.3, colsample_bytree=1.0,
#              early_stopping_rounds=None, enable_categorical=False,
#              eval_metric=None, feature_types=None, gamma=0.15, gpu_id=None,
#              grow_policy=None, importance_type=None,
#              interaction_constraints=None, learning_rate=0.05, max_bin=None,
#              max_cat_threshold=None, max_cat_to_onehot=None,
#              max_delta_step=None, max_depth=12, max_leaves=None,
#              min_child_weight=7, missing=nan, monotone_constraints=None,
#              n_estimators=100, n_jobs=None, num_parallel_tree=None,
#              predictor=None, random_state=None, ...)
#best_score :  0.9729570774189901
#model_score :  0.9731722765110388
#RandomSearch -> xgboost

Optuna - Voting - cat, xgb, lgbm)

#!/usr/bin/env python
# coding: utf-8
import numpy as np
import pandas as pd
import time
import warnings

from keras.models import Sequential
from keras.layers import Dense

from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import MinMaxScaler
from sklearn.covariance import EllipticEnvelope
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import VotingClassifier

from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

warnings.filterwarnings('ignore')

######### voting ############

x_train, x_test, y_train, y_test = train_test_split(
    x, y, train_size=0.6, test_size=0.2, random_state=100, shuffle=True
)

x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

n_splits = 5
random_state = 42
kfold = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=random_state)

xgb = XGBClassifier(
    n_estimators=3676,
    max_depth=12,
    random_state=1465
) # optuna로 산출한 best_parameter 적용

lgbm = LGBMClassifier(
    n_estimators=3855,
    learning_rate=0.44236276056374924,
    random_state=1069
) # optuna로 산출한 best_parameter 적용

cat = CatBoostClassifier(
    n_estimators=939,
    depth=14,
    fold_permutation_block=70,
    learning_rate=0.5787982175373018,
    od_pval=0.671062923929216,
    l2_leaf_reg=1.8747109317317627,
    random_state=618
)# optuna로 산출한 best_parameter 적용

model = VotingClassifier(
    estimators=[('xgb', xgb), ('lgbm', lgbm), ('cat', cat)],
    voting='hard',
    n_jobs=-1,
    verbose=0
)

model.fit(x_train, y_train)

y_voting_predict = model.predict(x_test)
voting_score = accuracy_score(y_test, y_voting_predict)
print('voting result : ', voting_score)

classifiers = [cat, xgb, lgbm]
for model in classifiers:
    model.fit(x_train, y_train)
    y_predict = model.predict(x_test)
    score = accuracy_score(y_test, y_predict)
    class_name = model.__class__.__name__
    print(class_name, "'s score : ", score)
    
print('Optuna -> voting')

### result ###
# voting result :  0.956795150199059
# CatBoostClassifier 's score :  0.9577904451682954
# XGBClassifier 's score :  0.9560260586319218
# LGBMClassifier 's score :  0.9551212450235251
#Optuna -> voting

MLP - Multi Layers Perceptron)

#!/usr/bin/env python
# coding: utf-8
import numpy as np
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense, Dropout
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
import time

from sklearn.covariance import EllipticEnvelope
from sklearn.preprocessing import LabelEncoder

import warnings
warnings.filterwarnings('ignore')

##### 훈련 구성 시작 #####
x_train, x_test, y_train, y_test = train_test_split(
x, y, test_size = 0.2, shuffle=True, random_state=42
)
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)

##### 모델 구성 ##### 
# ver 4 기준 입력층:linear, 은닉층:relu *2 에서 최상
# ver 5 에서는 노드 수와 Dropout 추가해서 최적화 해볼 것
model = Sequential()
model.add(Dense(16, input_dim=16, activation='linear'))
model.add(Dropout(0.2))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(64, activation='relu'))
model.add(Dense(1, activation='sigmoid')) 

# 컴파일, 훈련
model.compile(loss='binary_crossentropy', 
              optimizer='adam', 
              metrics=['accuracy'])

##earlyStopping
from keras.callbacks import EarlyStopping, ModelCheckpoint
earlyStopping = EarlyStopping(monitor='val_loss', patience=32, mode='min',
                              verbose=1, restore_best_weights=True ) # restore_best_weights의 기본값은 false이므로 true로 반드시 변경

# Model Check point
mcp = ModelCheckpoint(
    monitor='val_loss',
    mode='auto',
    verbose=1,
    save_best_only=True,
    filepath='./medical_noshow/mcp/noshow_ver5_layer2_bat32_node64_dropout20_3.hdf5'
    ######################################
    # 훈련전에 mcp파일 명 변경 잊지 말기!! #
    ######################################
)

batch_size=32
start_time = time.time()
model.fit(x_train, y_train, epochs=500, batch_size=batch_size, 
          validation_split=0.2, 
          callbacks=[earlyStopping, mcp],
          verbose=1)
end_time = time.time() - start_time

loss, acc = model.evaluate(x_test, y_test)

# model.summary()
print('소요시간 : ', end_time)
print('batch_size : ', batch_size)
print('loss : ', loss)
print('acc : ', acc)

# model = Sequential()
# model.add(Dense(16, input_dim=16, activation='linear'))
# model.add(Dropout(0.2))
# model.add(Dense(64, activation='relu'))
# model.add(Dropout(0.2))
# model.add(Dense(64, activation='relu'))
# model.add(Dense(1, activation='sigmoid')) 
# Epoch 00157: val_loss did not improve from 0.05229
# 2211/2211 [==============================] - 10s 4ms/step - loss: 0.0531 - accuracy: 0.9715 - val_loss: 0.0530 - val_accuracy: 0.9723
# Epoch 00157: early stopping
# 691/691 [==============================] - 2s 3ms/step - loss: 0.0511 - accuracy: 0.9736
# 소요시간 :  1574.3082158565521
# batch_size :  32
# loss :  0.05109154060482979
# acc :  0.9736247062683105

#5 모델 리뷰 및 최종 성능 비교(각 모델 팀 총평)

1. 전처리 과정을 통한 데이터의 정제화가 모델의 성능에 가장 큰 영향을 끼치는 것을 알게되었습니다.

2. 학습 시간이 다르게 걸렸지만 모델 별 정확도는 비슷한 것으로 보아 모델들의 정확도 차이는 크지 않았습니다.

다만 하이퍼 파라미터에 따라서는 모델 별 성능이 판이하게 나오는 것으로 보아 머신러닝을 사용하기 위해서는하이퍼 파라미터 튜닝이 필수적인 것으로 보인다.

3. 딥러닝이 모델 중 최적화도 비교적 간단하고 성능이 가장 좋긴했지만 상당한 시간이 소요되며, 모델 구성 방향성을 잡는데 어려움이 있었습니다.

4. 그리드 서치와 랜덤서치로 수동적으로 파라미터를 수정하는 과정을 옵튜나가 자동적으로 최적화하였기에 편리하였고 이로 인해 알고리즘을 통한 자동화 과정의 중요성을 깨닫았습니다.

더욱 자세한 코드를 보고 싶으시다면, 아래 git-hub 에서 참고부탁드립니다!

https://github.com/NC7-AI-Team-C/Brazilian-Medical-NoShow