- SMOTETomek?
Combination of over - and under - sampling method
- SMOTE의 방법과 TomekLink를 복합하여 진행하는 것
- SMOTE로 over sampling 진행 후 경계선에 있는 major sample을 제거
- 분류 경계면을 뚜렷하게하여 분류가 잘 될 수 있도록 한다.
- Import
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification
from imblearn.under_sampling import TomekLinks
from imblearn.combine import SMOTETomek
- Data 생성
X, y = make_classification(n_samples=5000, n_features=2, n_informative=2,
n_redundant=0, n_repeated=0, n_classes=2,
n_clusters_per_class=1,
weights=[0.35, 0.65],
class_sep=0.8, random_state=0)
print(X.shape , y.shape)
(5000, 2) (5000,)
Data Frame으로 변경
df = pd.DataFrame(X, columns = ['f1', 'f2'])
df['target'] = y
print(df.target.value_counts())
df.head()
>>>
1 3244
0 1756
f1 f2 target
0 -0.467337 0.381432 0
1 0.682590 0.099249 1
2 0.743504 0.700580 1
3 0.043891 1.103027 0
4 0.869732 0.988004 1
생성한 데이터의 분포
plt.figure(figsize = (12, 8))
sns.scatterplot(data = df, x = 'f1', y = 'f2', hue = 'target')
- SMOTETomek
smt = SMOTETomek(tomek = TomekLinks(sampling_strategy = 'majority'), random_state = 10)
X_smttm, y_smttm = smt.fit_resample(X, y)
df1 = pd.DataFrame(X_smttm, columns = ['smttm1', 'smttm2'])
df1['y_smttm'] = y_smttm
print(df1.y_smttm.value_counts())
plt.figure(figsize = (12, 8))
sns.scatterplot(data = df1, x = 'smttm1', y = 'smttm2', hue = 'y_smttm')
1 3244
0 3223
- under sampling 참고: https://heejins.tistory.com/31
[Machine Learning] imblearn 라이브러리 undersampling
imblearn라이브러리의 under_sampling 비교 - Import import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from sklearn.datasets import make_classification from imblearn.under_sampling import * - Data 생성 X, y = make_classificat
heejins.tistory.com
- over sampling 참고: https://heejins.tistory.com/28
[Machine Learning] imblearn 라이브러리 oversampling
imblearn라이브러리의 over_sampling 비교 - Import import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from sklearn.datasets import make_classification from imblearn.over_sampling import * - Data 생성 X, y = make_classificatio
heejins.tistory.com
- SMOTETomek notebook ipynb file: https://github.com/heejvely/python-machine_learning-practice/blob/main/SMOTTomek.ipynb
GitHub - heejvely/python-machine_learning-practice: practice for using python machine learning model
practice for using python machine learning model. Contribute to heejvely/python-machine_learning-practice development by creating an account on GitHub.
github.com
'Machine Learning' 카테고리의 다른 글
[Machine Learning] imblearn 라이브러리 undersampling (0) | 2022.11.21 |
---|---|
[Machine Learning] imblearn 라이브러리 oversampling (0) | 2022.11.18 |
[Machine Learning]PCA로 cluster 그래프 그리기 (0) | 2022.11.11 |
[Machin Learning]변수 중요도를 기준으로 Kfold 교차검증 진행하기 (0) | 2022.11.01 |
[Machine Learning]변수 중요도 출력(feature importance) (0) | 2022.11.01 |