Machine Learning

[Machine Learning] SMOTETomek

heedy 2022. 11. 23. 14:47
728x90

- SMOTETomek?

Combination of over - and under - sampling method

  • SMOTE의 방법과 TomekLink를 복합하여 진행하는 것
  • SMOTE로 over sampling 진행 후 경계선에 있는 major sample을 제거
  • 분류 경계면을 뚜렷하게하여 분류가 잘 될 수 있도록 한다.

- Import

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification
from imblearn.under_sampling import TomekLinks
from imblearn.combine import SMOTETomek

- Data 생성

X, y = make_classification(n_samples=5000, n_features=2, n_informative=2,
                           n_redundant=0, n_repeated=0, n_classes=2,
                           n_clusters_per_class=1,
                           weights=[0.35, 0.65],
                           class_sep=0.8, random_state=0)
print(X.shape , y.shape)
(5000, 2) (5000,)

Data Frame으로 변경

df = pd.DataFrame(X, columns = ['f1', 'f2'])
df['target'] = y
print(df.target.value_counts())
df.head()
>>>
1    3244
0    1756

f1	f2			target
0	-0.467337	0.381432	0
1	0.682590	0.099249	1
2	0.743504	0.700580	1
3	0.043891	1.103027	0
4	0.869732	0.988004	1

생성한 데이터의 분포

plt.figure(figsize = (12, 8))
sns.scatterplot(data = df, x = 'f1', y = 'f2', hue = 'target')


- SMOTETomek

smt = SMOTETomek(tomek = TomekLinks(sampling_strategy = 'majority'), random_state = 10)
X_smttm, y_smttm = smt.fit_resample(X, y)
df1 = pd.DataFrame(X_smttm, columns = ['smttm1', 'smttm2'])
df1['y_smttm'] = y_smttm
print(df1.y_smttm.value_counts())

plt.figure(figsize = (12, 8))
sns.scatterplot(data = df1, x = 'smttm1', y = 'smttm2', hue = 'y_smttm')
1    3244
0    3223

- under sampling 참고: https://heejins.tistory.com/31

 

[Machine Learning] imblearn 라이브러리 undersampling

imblearn라이브러리의 under_sampling 비교 - Import import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from sklearn.datasets import make_classification from imblearn.under_sampling import * - Data 생성 X, y = make_classificat

heejins.tistory.com

- over sampling 참고: https://heejins.tistory.com/28

 

[Machine Learning] imblearn 라이브러리 oversampling

imblearn라이브러리의 over_sampling 비교 - Import import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from sklearn.datasets import make_classification from imblearn.over_sampling import * - Data 생성 X, y = make_classificatio

heejins.tistory.com

- SMOTETomek notebook ipynb file: https://github.com/heejvely/python-machine_learning-practice/blob/main/SMOTTomek.ipynb

 

GitHub - heejvely/python-machine_learning-practice: practice for using python machine learning model

practice for using python machine learning model. Contribute to heejvely/python-machine_learning-practice development by creating an account on GitHub.

github.com

 

728x90