11. 범주형변수 가공¶

5장에서는 수치형변수를 처리하는 방법에 대해 살펴봤습니다. 이번 장에서는 범주형변수를 변환하는 방법에 대해 알아보겠습니다.

범주형변수는 성별처럼 변수의 값 사이에 관련이 없고 구분되어 있는 변수를 뜻합니다. 범주형변수는 모델 학습에 사용하기 위해 수치형 또는 이진 변수로 변환해야 합니다.

11.1 라이브러리 import 및 설정¶

%reload_ext autoreload
%autoreload 2
%matplotlib inline

import kaggler
from lightgbm import LGBMRegressor
from matplotlib import rcParams, pyplot as plt
import numpy as np
import pandas as pd
from pathlib import Path
from scipy.sparse import hstack
import seaborn as sns
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from warnings import simplefilter

rcParams['figure.figsize'] = (16, 8)
plt.style.use('fivethirtyeight')
simplefilter('ignore')

11.2 학습데이터 로드¶

이번 실습에 사용할 데이터도 5장에서 실습한 것 처럼 데이콘의 영화 관객수 예측 모델 개발 페이지에서 다운로드하여 ../data/movies/ 폴더에 저장을 해둡니다. 해당 데이터는 영화의 장르, 개봉일, 상영시간 등의 데이터로 영화 총 관객수 (box_off_num)를 예측하는 데이터입니다.

data_dir = Path('../data/movies/')
trn_file = data_dir / 'movies_train.csv'
seed = 42
target_col = 'box_off_num'

df = pd.read_csv(trn_file, index_col=0)
print(df.shape)
df.head()

(600, 11)

	distributor	genre	release_time	time	screening_rat	director	dir_prev_bfnum	dir_prev_num	num_staff	num_actor	box_off_num
title
개들의 전쟁	롯데엔터테인먼트	액션	2012-11-22	96	청소년 관람불가	조병옥	NaN	0	91	2	23398
내부자들	(주)쇼박스	느와르	2015-11-19	130	청소년 관람불가	우민호	1161602.50	2	387	3	7072501
은밀하게 위대하게	(주)쇼박스	액션	2013-06-05	123	15세 관람가	장철수	220775.25	4	343	4	6959083
나는 공무원이다	(주)NEW	코미디	2012-07-12	101	전체 관람가	구자홍	23894.00	2	20	6	217866
불량남녀	쇼박스(주)미디어플렉스	코미디	2010-11-04	108	15세 관람가	신근호	1.00	1	251	2	483387

데이터를 확인해보면 distributor, genre, screening_rat, 그리고 director까지 총 4개의 범주형변수가 존재함을 알 수 있습니다.

11.3 EDA (Exploratory Data Analysis)¶

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 600 entries, 개들의 전쟁 to 베를린
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   distributor     600 non-null    object 
 1   genre           600 non-null    object 
 2   release_time    600 non-null    object 
 3   time            600 non-null    int64  
 4   screening_rat   600 non-null    object 
 5   director        600 non-null    object 
 6   dir_prev_bfnum  270 non-null    float64
 7   dir_prev_num    600 non-null    int64  
 8   num_staff       600 non-null    int64  
 9   num_actor       600 non-null    int64  
 10  box_off_num     600 non-null    int64  
dtypes: float64(1), int64(5), object(5)
memory usage: 56.2+ KB

info()함수를 통해 변수별 데이터 타입을 확인해보겠습니다. 독립변수를 포함 총 6개의 수치형변수가 있으며 그 중 dir_prev_bfnum은 결측값이 많음을 확인할 수 있습니다. 범주형변수는 distributor, genre, screening_rat 그리고 director까지 총 4개가 있습니다. 이 때 release_time은 시계열변수이지만 문자열(object)로 인식하고 있습니다.

df['release_time'] = pd.to_datetime(df['release_time'])
print(df['release_time'].dtype)

datetime64[ns]

df.fillna(0, inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 600 entries, 개들의 전쟁 to 베를린
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   distributor     600 non-null    object        
 1   genre           600 non-null    object        
 2   release_time    600 non-null    datetime64[ns]
 3   time            600 non-null    int64         
 4   screening_rat   600 non-null    object        
 5   director        600 non-null    object        
 6   dir_prev_bfnum  600 non-null    float64       
 7   dir_prev_num    600 non-null    int64         
 8   num_staff       600 non-null    int64         
 9   num_actor       600 non-null    int64         
 10  box_off_num     600 non-null    int64         
dtypes: datetime64[ns](1), float64(1), int64(5), object(4)
memory usage: 56.2+ KB

release_time을 시계열타입(datetime)으로 변환하고 결측값을 0으로 대체했습니다.

num_cols = [x for x in df.columns if df[x].dtype in [np.int64, np.float64] and x != target_col]
cat_cols = ['distributor', 'genre', 'screening_rat', 'director']
print(f'    numeric ({len(num_cols)}):\t{num_cols}')
print(f'categorical ({len(cat_cols)}):\t{cat_cols}')

    numeric (5):	['time', 'dir_prev_bfnum', 'dir_prev_num', 'num_staff', 'num_actor']
categorical (4):	['distributor', 'genre', 'screening_rat', 'director']

11.3.1 범주형변수 EDA¶

print(cat_cols)

['distributor', 'genre', 'screening_rat', 'director']

pd.DataFrame(df['distributor'].value_counts())

	distributor
CJ 엔터테인먼트	54
롯데엔터테인먼트	52
(주)NEW	30
(주)마운틴픽쳐스	29
인디스토리	26
...	...
(주)팝 파트너스	1
스폰지이엔티	1
(주)JK필름	1
디 씨드	1
스튜디오 블루	1

169 rows × 1 columns

범주형변수를 탐색할 때 유용하게 쓰이는 value_counts()함수를 위에서 사용했습니다. value_counts()는 데이터내에서 범주별로 등장한 횟수를 반환해줍니다. 위의 예시에서 CJ 엔터테인먼트가 54개의 영화를 배급 했음을 알 수 있습니다. 또한 총 169개의 배급사가 존재하는 것도 하단에 나와있는 169 rows를 통해 알 수 있습니다.

pd.DataFrame(df['genre'].value_counts())

	genre
드라마	221
다큐멘터리	93
멜로/로맨스	78
코미디	53
공포	42
액션	28
느와르	27
애니메이션	21
미스터리	17
SF	13
뮤지컬	5
서스펜스	2

genre에는 총 12개의 범주가 존재하며 그 중 드라마가 221개로 가장 많은 것을 알 수 있습니다.

pd.DataFrame(df['screening_rat'].value_counts())

	screening_rat
청소년 관람불가	204
15세 관람가	202
12세 관람가	102
전체 관람가	92

pd.DataFrame(df['director'].value_counts())

	director
홍상수	7
노진수	4
장률	4
전규환	4
신재호	4
...	...
안상훈	1
한준희	1
이광호	1
김동빈	1
전윤수	1

472 rows × 1 columns

11.4 수치형/시계열변수 가공¶

수치형 독립변수 중 멱변환 분포를 따르는 변수에도 np.log1p() 변환을 적용하였다.

df[['dir_prev_bfnum', 'dir_prev_num', 'num_staff', 'num_actor']] = df[['dir_prev_bfnum', 'dir_prev_num', 'num_staff', 'num_actor']].apply(np.log1p)
df[num_cols].describe()

	time	dir_prev_bfnum	dir_prev_num	num_staff	num_actor
count	600.000000	600.000000	600.000000	600.000000	600.000000
mean	100.863333	5.305796	0.462197	4.026352	1.446130
std	18.097528	6.254561	0.555570	1.789517	0.446256
min	45.000000	0.000000	0.000000	0.000000	0.000000
25%	89.000000	0.000000	0.000000	2.890372	1.098612
50%	100.000000	0.000000	0.000000	4.424829	1.386294
75%	114.000000	12.837611	1.098612	5.579730	1.609438
max	180.000000	16.684279	1.791759	6.768493	3.258097

df['year'] = df['release_time'].dt.year
df['month'] = df['release_time'].dt.month
df.head()

	distributor	genre	release_time	time	screening_rat	director	dir_prev_bfnum	dir_prev_num	num_staff	num_actor	box_off_num	year	month
title
개들의 전쟁	롯데엔터테인먼트	액션	2012-11-22	96	청소년 관람불가	조병옥	0.000000	0.000000	4.521789	1.098612	23398	2012	11
내부자들	(주)쇼박스	느와르	2015-11-19	130	청소년 관람불가	우민호	13.965312	1.098612	5.961005	1.386294	7072501	2015	11
은밀하게 위대하게	(주)쇼박스	액션	2013-06-05	123	15세 관람가	장철수	12.304905	1.609438	5.840642	1.609438	6959083	2013	6
나는 공무원이다	(주)NEW	코미디	2012-07-12	101	전체 관람가	구자홍	10.081425	1.098612	3.044522	1.945910	217866	2012	7
불량남녀	쇼박스(주)미디어플렉스	코미디	2010-11-04	108	15세 관람가	신근호	0.693147	0.693147	5.529429	1.098612	483387	2010	11

num_cols += ['year', 'month']
print(num_cols)

['time', 'dir_prev_bfnum', 'dir_prev_num', 'num_staff', 'num_actor', 'year', 'month']

features = num_cols + cat_cols
print(features)

['time', 'dir_prev_bfnum', 'dir_prev_num', 'num_staff', 'num_actor', 'year', 'month', 'distributor', 'genre', 'screening_rat', 'director']

11.5 범주형변수 가공¶

rmse = lambda y, p: np.sqrt(mean_squared_error(y, p))
rmsle = lambda y, p: np.sqrt(mean_squared_error(np.log1p(y), np.log1p(p)))

11.5.1 Ordinal Encoding¶

가장 먼저 배워볼 변환 방법은 ordinal encoding 입니다. 각각의 범주를 0부터 n-1의 정수로 변환하는 방법입니다. 일반적으로 ordinal encoding은 범주간의 대소 관계가 존재할 때 해당 관계를 보존하는 식으로 변환하는 방법입니다. 예를 들어 옷 사이즈 small, medium, large가 있을 때 차례대로 0, 1, 2로 변환하는 것입니다. 하지만 Scikit-learn에서 Ordinal encoding시 사용하는 함수는 범주내의 대소 관계와 무관하게 수치 변환을 실시합니다. 또한 Scikit-learn에서 제공하는 label encoding 방법 또한 이와 유사하게 적용됩니다.

from sklearn.preprocessing import OrdinalEncoder
df_cat = df.copy()
oe = OrdinalEncoder()
df_cat[cat_cols] = oe.fit_transform(df[cat_cols])
df_cat[cat_cols].head()

	distributor	genre	screening_rat	director
title
개들의 전쟁	80.0	10.0	3.0	393.0
내부자들	21.0	2.0	3.0	252.0
은밀하게 위대하게	21.0	10.0	1.0	357.0
나는 공무원이다	6.0	11.0	2.0	17.0
불량남녀	96.0	11.0	1.0	218.0

OrdinalEncoder()을 적용하니 범주가 수치로 변환된 것을 확인할 수 있습니다.

trn, tst = train_test_split(df_cat, test_size=.2, random_state=seed)
clf = LGBMRegressor(random_state=seed)
clf.fit(trn[features], np.log1p(trn[target_col]))
p = np.expm1(clf.predict(tst[features]))
print(f' RMSE:\t{rmse(tst[target_col], p):12.2f}')
print(f'RMSLE:\t{rmsle(tst[target_col], p):12.2f}')

 RMSE:	  1412591.02
RMSLE:	        1.98

11.5.2 Label Encoding with Grouping¶

앞서 살펴본 encoding 방법은 개별 범주에 하나의 번호를 부여하기 때문에 한번만 등장하는 범주가 여러개일 때 각각 다른 숫자를 갖게 됩니다. 한번만 등장하지 않는데 개별 숫자로 encoding하는 것은 비효율적일 수 있습니다. 이번 절에서는 빈도 수에 기반해 범주를 묶어서 encoding하는 방법을 소개하겠습니다.

from kaggler.preprocessing import LabelEncoder
df_cat = df.copy()
le = LabelEncoder(min_obs=2)
df_cat[cat_cols] = le.fit_transform(df[cat_cols])
df_cat[cat_cols].head()

Using TensorFlow backend.

	distributor	genre	screening_rat	director
title
개들의 전쟁	2.0	5	0	0.0
내부자들	6.0	6	0	6.0
은밀하게 위대하게	6.0	5	1	0.0
나는 공무원이다	3.0	3	3	0.0
불량남녀	66.0	3	1	0.0

kaggler라이브러리의 preprocessing모듈에서 제공하는 LabelEncoder()함수는 특정 횟수 미만으로 등장하는 범주들을 모두 하나의 그룹으로 묶어주는 기능을 제공합니다. min_obs파라미터에 2를 명시했기 때문에 2번 미만으로 등장하는 범주는 모두 하나의 수치로 변환하게 됩니다.

trn, tst = train_test_split(df_cat, test_size=.2, random_state=seed)
clf = LGBMRegressor(random_state=seed)
clf.fit(trn[features], np.log1p(trn[target_col]))
p = np.expm1(clf.predict(tst[features]))
print(f' RMSE:\t{rmse(tst[target_col], p):12.2f}')
print(f'RMSLE:\t{rmsle(tst[target_col], p):12.2f}')

 RMSE:	  1182500.97
RMSLE:	        1.92

Label encoding with grouping 방법을 통해 수치 변환을 한 경우 ordinal encoding을 했을 때 보다 평가지표인 rmse와 rmsle가 모두 줄어든 것을 확인할 수 있습니다. rmse와 rmsle은 낮을 수록 모델 성능이 높은 지표입니다.

11.5.3 One-Hot-Encoding¶

앞서 살펴본 ordinal/label encoding은 트리 기반의 모델을 사용할 때는 학습이 잘 되지만 선형회귀, 로지스틱회귀, 또는 딥러닝(신경망) 모델에 적용할 때는 학습이 잘 안될 수 있습니다. 후자에 언급한 알고리즘들은 숫자의 대소 관계에 영향을 받는데 ordinal/label encoding으로 변환한 수치값은 원래 없던 대소 관계를 임의로 부여한 것이기 때문에 모델에 악영향을 줄 수가 있습니다. 예를 들어 조병옥감독과 우민호감독은 서로 다른 사람인데 각각 1과 3으로 변환이 되면 수치상 조병옥감독이 3명이 모이면 우민호감독이 된다는 뜻입니다. 선형회귀, 로지스틱회귀, 그리고 딥러닝 모델은 이러한 관계를 반영할려고 하기 때문에 모델 학습이 잘 안될 수도 있습니다.

위와 같은 예시인 경우 ordinal encoding 보다 one-hot encoding방법을 적용할 수 있습니다. One-hot encoding은 각각의 범주의 존재 여부를 이진 변수로 변환해서 나타냅니다. sklearn.preprocessing모듈의 OneHotEncoder()함수를 통해 적용할 수 있습니다.

from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
X = hstack((df[num_cols],
            ohe.fit_transform(df[cat_cols])))
print(X.shape)

(600, 664)

One-hot encoding 변환 후에는 범주의 개수 만큼 열의 개수가 늘어나기 때문에 664개로 열이 늘어난 것을 확인할 수 있습니다.

X_trn, X_tst, y_trn, y_tst = train_test_split(X, df[target_col], test_size=.2, random_state=seed)
clf = LGBMRegressor(random_state=seed)
clf.fit(X_trn, np.log1p(y_trn))
p = np.expm1(clf.predict(X_tst))
print(f' RMSE:\t{rmse(tst[target_col], p):12.2f}')
print(f'RMSLE:\t{rmsle(tst[target_col], p):12.2f}')

 RMSE:	  1239343.52
RMSLE:	        1.99

11.5.4 One-Hot-Encoding with Grouping¶

One-hot encoding 적용 전에도 특정 빈도 미만으로 출현하는 범주들은 하나의 그룹으로 묶고 나서 변환을 실시할 수 있습니다. kaggler라이브러리를 사용해 그룹을 지은 후 one-hot encoding 적용이 가능합니다.

from kaggler.preprocessing import OneHotEncoder
ohe = OneHotEncoder(min_obs=2)
X = hstack((df[num_cols],
            ohe.fit_transform(df[cat_cols])))
print(X.shape)

(600, 187)

min_obs파라미터에 2를 주었기 때문에 2미만의 빈도를 가진 범주는 모두 하나의 범주로 묶게 됩니다. 범주를 묶지 않았을 때는 encoding 변환 후 664개의 열이 존재했는데 범주를 묶은 후에는 encoding 변환 후 187개의 열만 존재하는 것을 확인할 수 있습니다.

X_trn, X_tst, y_trn, y_tst = train_test_split(X, df[target_col], test_size=.2, random_state=seed)
clf = LGBMRegressor(random_state=seed)
clf.fit(X_trn, np.log1p(y_trn))
p = np.expm1(clf.predict(X_tst))
print(f' RMSE:\t{rmse(tst[target_col], p):12.2f}')
print(f'RMSLE:\t{rmsle(tst[target_col], p):12.2f}')

 RMSE:	  1183708.39
RMSLE:	        1.96

rmse와 rmsle도 범주를 묶기 전보다 좋아진 것을 확인할 수 있습니다.

11.5.5 Target Encoding without Cross-Validation¶

Target encoding은 각각의 범주를 종속변수의 평균값으로 변환하는 방법입니다. 예를들어 모든 12세 관람가영화의 관객수 평균이 84만이면 12세 관람가를 84만으로 변환합니다.

Target encoding은 종속변수에 대한 정보를 활용해 변환하기 때문에 과적합에 취약합니다. 그러므로 cross-validation과 smoothing을 함께 사용해서 과적합을 방지해줄 필요가 있습니다. kaggler 라이브러리에서 제공하는 TargetEncoder()함수는 cross-validation과 smoothing을 모두 지원합니다.

from kaggler.preprocessing import TargetEncoder
trn, tst = train_test_split(df, test_size=.2, random_state=seed)
te = TargetEncoder(cv=None)
trn[cat_cols] = te.fit_transform(trn[cat_cols], trn[target_col])
tst[cat_cols] = te.transform(tst[cat_cols])
trn[cat_cols].head()

	distributor	genre	screening_rat	director
title
돼지의 왕	721746.077154	1.998008e+05	3.511182e+05	723283.709843
청춘그루브	723430.057925	6.093409e+05	1.289088e+06	723430.057925
행복한 울릉인	722860.857509	7.472685e+04	1.529000e+05	723430.082604
옥희의 영화	718805.251941	6.093409e+05	3.511182e+05	711379.735399
권법형사 : 차이나타운	90921.244145	2.212935e+06	3.511182e+05	723429.982531

clf = LGBMRegressor(random_state=seed)
clf.fit(trn[features], np.log1p(trn[target_col]))
p = np.expm1(clf.predict(tst[features]))
print(f' RMSE:\t{rmse(tst[target_col], p):12.2f}')
print(f'RMSLE:\t{rmsle(tst[target_col], p):12.2f}')

 RMSE:	  1271836.20
RMSLE:	        3.40

11.5.6 Target Encoding with Cross-Validation¶

trn, tst = train_test_split(df, test_size=.2, random_state=seed)
te = TargetEncoder()
trn[cat_cols] = te.fit_transform(trn[cat_cols], trn[target_col])
tst[cat_cols] = te.transform(tst[cat_cols])
trn[cat_cols].head()

	distributor	genre	screening_rat	director
title
돼지의 왕	723277.447850	267982.096480	348028.101562	723432.730281
청춘그루브	742425.489583	742425.489583	742425.489583	742425.489583
행복한 울릉인	768901.203125	10425.029522	122162.823956	768901.203125
옥희의 영화	722877.805490	813347.509259	380361.663366	721858.927057
권법형사 : 차이나타운	722864.097814	727096.433397	115692.865474	597129.065104

clf = LGBMRegressor(random_state=seed)
clf.fit(trn[features], np.log1p(trn[target_col]))
p = np.expm1(clf.predict(tst[features]))
print(f' RMSE:\t{rmse(tst[target_col], p):12.2f}')
print(f'RMSLE:\t{rmsle(tst[target_col], p):12.2f}')

 RMSE:	  1529036.45
RMSLE:	        2.12

이번 실습에 사용하는 데이터셋은 샘플 수가 약 600개 정도 되는 작은 데이터셋이라서 cross-validation을 적용해도 과적합이 크게 개선되지는 않았습니다.

11.5.7 Frequency Encoding¶

Frequency encoding은 각 범주가 출현한 빈도 수로 범주를 변환합니다. 예를 들어 12세 관람가가 102번 등장했으면 12세 관람가범주를 102로 변환합니다. Frequency encoding은 ordinal/label encoding에 grouping을 적용한 것과 비슷한 효과를 가집니다. 출현 빈도가 1인 범주는 모두 1로 변환되기 때문에 하나의 그룹에 묶인 상태로 변환된 것과 같기 때문입니다. kaggler 라이브러리에서 제공하는 FrequencyEncoder()함수를 통해 적용 가능합니다.

from kaggler.preprocessing import FrequencyEncoder
df_cat = df.copy()
fe = FrequencyEncoder()
df_cat[cat_cols] = fe.fit_transform(df[cat_cols])
df_cat[cat_cols].head()

	distributor	genre	screening_rat	director
title
개들의 전쟁	52	28	204	1
내부자들	26	27	204	4
은밀하게 위대하게	26	28	202	1
나는 공무원이다	30	53	92	1
불량남녀	2	53	202	1

trn, tst = train_test_split(df_cat, test_size=.2, random_state=seed)
clf = LGBMRegressor(random_state=seed)
clf.fit(trn[features], np.log1p(trn[target_col]))
p = np.expm1(clf.predict(tst[features]))
print(f' RMSE:\t{rmse(tst[target_col], p):12.2f}')
print(f'RMSLE:\t{rmsle(tst[target_col], p):12.2f}')

 RMSE:	  1258846.99
RMSLE:	        1.96

간단하지만 준수한 성능을 보이는 방법임을 알 수 있습니다.

11.5.8 Hash Encoding¶

Hashing encoding은 각 범주에 hash 함수를 적용해서 나온 hash값으로 변환하는 방식입니다. Hash 함수를 사용하기 때문에 fit()이 필요 없어서 빠르고 메모리를 적게 사용합니다. 그래서 범주의 개수가 많거나 데이터가 클 때 사용하면 효율적인 방식입니다. sklearn.feature_extraction모듈의 FeatureHasher()함수를 통해 적용 가능합니다.

from sklearn.feature_extraction import FeatureHasher
fh = FeatureHasher(n_features=128, input_type='string')
X = hstack([df[num_cols]] + [fh.fit_transform(df[col]) for col in cat_cols])
print(X.shape)

(600, 519)

FeatureHasher()함수의 n_features파라미터에 hash 값의 길이를 설정할 수 있는데 길이가 너무 짧으면 다른 범주 간의 hash 값 충돌이 발생할 수 있습니다. 그러므로 충분하게 큰 값을 설정해주어야 합니다. 실습 데이터는 범주의 개수가 적어서 128을 주었지만 일반적으로 2^10인 1024로 설정합니다.

X_trn, X_tst, y_trn, y_tst = train_test_split(X, df[target_col], test_size=.2, random_state=seed)
clf = LGBMRegressor(random_state=seed)
clf.fit(X_trn, np.log1p(y_trn))
p = np.expm1(clf.predict(X_tst))
print(f' RMSE:\t{rmse(tst[target_col], p):12.2f}')
print(f'RMSLE:\t{rmsle(tst[target_col], p):12.2f}')

 RMSE:	  1261559.18
RMSLE:	        1.95

단국대 2020 캐글 뽀개기