[Machine Learning]변수 중요도 출력(feature importance)

Machine Learning

[Machine Learning]변수 중요도 출력(feature importance)

heedy 2022. 11. 1. 13:30

728x90

학습 데이터 중 모든 변수를 사용하면 노이즈 데이터가 섞여서 모델 성능이 잘 나오지 않을 수 있습니다.
모델 성능을 높이기 위해 변수를 선택하는 과정을 거쳐야 하는데, 변수 중요도를 활용하여 선택하는 방법이 있습니다.

RandomForest를 이용하여 titanic데이터를 학습한 후 변수 중요도를 출력하겠습니다.
먼저, titanic 데이터를 불러와줍니다.

import pandas as pd

df = pd.read_csv('train.csv')
df.head()

RandomForest를 이용하여 학습을 진행합니다.

학습을 위해 object column은 사용하지 않습니다.
Nan 값은 평균값으로 처리해줍니다.

# object column 제외
df = df.select_dtypes(exclude= 'object')

# Nan 값 확인
df.isna().sum()

# Nan 값 평균값으로 대체
df.fillna(df.Age.mean(), inplace = True)

# RandomForestClassifier 학습
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

X = df.drop('Survived', axis = 1)
y = df.Survived

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, stratify = y)

rf = RandomForestClassifier(random_state = 10)
rf.fit(x_train, y_train)
pred = rf.predict(x_test)
print(accuracy_score(pred, y_test))

accuracy는 0.7039 정도 나왔습니다.

학습을 한 후 feature importance를 확인할 수 있습니다.
column 순서로 importance score가 순서대로 나오는 것을 확인할 수 있습니다.

rf.feature_importances_

가독성 좋게 X의 column을 index로 설정하여 DataFrame으로 만들어준 후 score 기준으로 내림차순으로 정렬해줍니다.

fi = pd.DataFrame(rf.feature_importances_, index = X.columns, columns = ['score'])
fi.sort_values(by = 'score', ascending = False)

결과로 Fare, PassengerId, Age가 중요 변수로 측정이 됩니다.

5% 이하인 SibSp, Parch를 제외하고 학습을 한 후 accuracy를 다시 측정하겠습니다.

# SibSp, Parch 제외 학습 진행
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

X = df.drop(['Survived','SibSp','Parch'], axis = 1)
y = df.Survived

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, stratify = y)

rf = RandomForestClassifier(random_state = 10)
rf.fit(x_train, y_train)
pred = rf.predict(x_test)
print(accuracy_score(pred, y_test))

accuracy는 0.7039로 변함이 없네요.
아마 학습에 사용되는 변수가 적어서 그런 것 같습니다. 사실상 SibSp와 Parch 변수가 없어도 동일한 결과라는 뜻이기도 합니다.

변수가 많은 feature importance 기준으로 feature selection 시 결과가 다양하게 나오기 때문에 유용하게 사용할 수 있습니다.

728x90

저작자표시

'Machine Learning' 카테고리의 다른 글

[Machine Learning] SMOTETomek (0)	2022.11.23
[Machine Learning] imblearn 라이브러리 undersampling (0)	2022.11.21
[Machine Learning] imblearn 라이브러리 oversampling (0)	2022.11.18
[Machine Learning]PCA로 cluster 그래프 그리기 (0)	2022.11.11
[Machin Learning]변수 중요도를 기준으로 Kfold 교차검증 진행하기 (0)	2022.11.01

현재글[Machine Learning]변수 중요도 출력(feature importance)

250x250

🎈언제나 열심히 사는 heedy의 기록🎈

Today :
Yesterday :

일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Im heedy