features_selection

The goal of this notebook is compare the difference between the selection using: - backward elimination - forward selection - stepwise selection

Code

from statsmodels_utils import backward_elimination, stepwise_selection, forward_selection, calculate_and_drop_vif
import statsmodels.api as sm
import pandas as pd

be_model = sm.Logit
ss_model = sm.Logit
fs_model = sm.Logit

Code

df_lung = pd.read_csv('data/lung_cancer.csv')
target_label = 'lung_cancer_risk'
X = df_lung.drop([target_label], axis=1)
y = df_lung[target_label]

Code

%%time
X_clean, dropped_features = calculate_and_drop_vif(X)
print(dropped_features)

Starting VIF Multicollinearity Check...

----------------------------------------
Dropped: 'pack_years' (VIF: 35.16)
Dropped: 'smoker' (VIF: 10.29)
Dropped: 'fev1_x10' (VIF: 6.35)
----------------------------------------
VIF Check Complete!
Features dropped: 3
['pack_years', 'smoker', 'fev1_x10']
CPU times: user 6.86 s, sys: 257 ms, total: 7.12 s
Wall time: 524 ms

Code

%%time
final_be_model, final_be_features = backward_elimination(be_model, X_clean, y, disp=False)
print(final_be_features)

Starting Backward Elimination using Logit...
----------------------------------------
Dropped: 'alcohol_units_per_week' (p-value: 0.9265)
Dropped: 'asthma' (p-value: 0.9053)
Dropped: 'previous_tb' (p-value: 0.8015)
Dropped: 'fatigue' (p-value: 0.7437)
Dropped: 'income_level' (p-value: 0.6612)
Dropped: 'education_years' (p-value: 0.6296)
Dropped: 'bmi' (p-value: 0.4011)
Dropped: 'shortness_of_breath' (p-value: 0.1578)
Dropped: 'gender' (p-value: 0.0561)
----------------------------------------
Backward Elimination Complete!
Remaining features: 17
['age', 'smoking_years', 'cigarettes_per_day', 'passive_smoking', 'air_pollution_index', 'occupational_exposure', 'radon_exposure', 'family_history_cancer', 'copd', 'chronic_cough', 'chest_pain', 'oxygen_saturation', 'crp_level', 'xray_abnormal', 'exercise_hours_per_week', 'diet_quality', 'healthcare_access']
CPU times: user 1.09 s, sys: 3.78 ms, total: 1.1 s
Wall time: 103 ms

Code

%%time
final_ss_model, final_ss_features = stepwise_selection(ss_model, X_clean, y, disp=False)
print(final_ss_features)

Starting Stepwise Selection using Logit...

----------------------------------------
Added:   'xray_abnormal' (p-value: 0.0000)
Added:   'cigarettes_per_day' (p-value: 0.0000)
Added:   'crp_level' (p-value: 0.0000)
Added:   'smoking_years' (p-value: 0.0000)
Added:   'age' (p-value: 0.0000)
Added:   'copd' (p-value: 0.0000)
Added:   'air_pollution_index' (p-value: 0.0000)
Added:   'chronic_cough' (p-value: 0.0000)
Added:   'family_history_cancer' (p-value: 0.0000)
Added:   'occupational_exposure' (p-value: 0.0000)
Added:   'exercise_hours_per_week' (p-value: 0.0000)
Added:   'passive_smoking' (p-value: 0.0000)
Added:   'oxygen_saturation' (p-value: 0.0000)
Added:   'healthcare_access' (p-value: 0.0000)
Added:   'diet_quality' (p-value: 0.0000)
Added:   'radon_exposure' (p-value: 0.0000)
Added:   'chest_pain' (p-value: 0.0397)
----------------------------------------
Stepwise Selection Complete!
Final number of features: 17
['xray_abnormal', 'cigarettes_per_day', 'crp_level', 'smoking_years', 'age', 'copd', 'air_pollution_index', 'chronic_cough', 'family_history_cancer', 'occupational_exposure', 'exercise_hours_per_week', 'passive_smoking', 'oxygen_saturation', 'healthcare_access', 'diet_quality', 'radon_exposure', 'chest_pain']
CPU times: user 6.27 s, sys: 2.89 ms, total: 6.27 s
Wall time: 1.77 s

Code

%%time
final_fs_model, final_fs_features = forward_selection(fs_model, X_clean, y, disp=False)
print(final_fs_features)

Starting Forward Selection using Logit...

----------------------------------------
Added: 'xray_abnormal' (p-value: 0.0000)
Added: 'cigarettes_per_day' (p-value: 0.0000)
Added: 'crp_level' (p-value: 0.0000)
Added: 'smoking_years' (p-value: 0.0000)
Added: 'age' (p-value: 0.0000)
Added: 'copd' (p-value: 0.0000)
Added: 'air_pollution_index' (p-value: 0.0000)
Added: 'chronic_cough' (p-value: 0.0000)
Added: 'family_history_cancer' (p-value: 0.0000)
Added: 'occupational_exposure' (p-value: 0.0000)
Added: 'exercise_hours_per_week' (p-value: 0.0000)
Added: 'passive_smoking' (p-value: 0.0000)
Added: 'oxygen_saturation' (p-value: 0.0000)
Added: 'healthcare_access' (p-value: 0.0000)
Added: 'diet_quality' (p-value: 0.0000)
Added: 'radon_exposure' (p-value: 0.0000)
Added: 'chest_pain' (p-value: 0.0397)
----------------------------------------
Forward Selection Complete!
Final number of features: 17
['xray_abnormal', 'cigarettes_per_day', 'crp_level', 'smoking_years', 'age', 'copd', 'air_pollution_index', 'chronic_cough', 'family_history_cancer', 'occupational_exposure', 'exercise_hours_per_week', 'passive_smoking', 'oxygen_saturation', 'healthcare_access', 'diet_quality', 'radon_exposure', 'chest_pain']
CPU times: user 5.94 s, sys: 902 μs, total: 5.94 s
Wall time: 1.7 s

Code

fs_difference_ss = set(final_fs_features).difference(set(final_ss_features))
print('difference between forward and stepwise selection: ', len(fs_difference_ss))
be_difference_ss = set(final_fs_features).difference(set(final_be_features))
print('difference between backward elimination and stepwise selection: ', len(be_difference_ss))
for feature in final_ss_features:
    if feature not in final_be_features:
        print(feature)

difference between forward and stepwise selection:  0
difference between backward elimination and stepwise selection:  0

Conclusion of Heuristic Methods

Backward elimination, forward selection, and stepwise selection all converge on the exact same set of 17 features.

Code

%%time
from statsmodels_utils import best_subset_selection
subset, best_subs_features = best_subset_selection(sm.Logit, X_clean, y, max_features=4, disp=False)
print(best_subs_features)

Starting Best Subset Selection on 26 features using Logit.
Testing combinations up to 4 features at a time.
Total models to train: 17,901
----------------------------------------
Evaluating subsets of size 1...
Evaluating subsets of size 2...
Evaluating subsets of size 3...
Evaluating subsets of size 4...
----------------------------------------
Best Subset Selection Complete!
Best AIC Score: 1164.24
Optimal features (4): ['age', 'oxygen_saturation', 'crp_level', 'xray_abnormal']
['age', 'oxygen_saturation', 'crp_level', 'xray_abnormal']
CPU times: user 45.7 s, sys: 1.92 ms, total: 45.7 s
Wall time: 45.9 s

Code

%%time
from statsmodels_utils import exact_subset_selection
subset, best_subs_features = exact_subset_selection(sm.Logit, X_clean, y, 4, disp=False)

Starting Exact Subset Selection on 26 total features using Logit.
Testing combinations of EXACTLY 4 features at a time.
Total models to train: 14,950
----------------------------------------
Evaluating subsets of size 4...
----------------------------------------
Exact Subset Selection Complete!
Best AIC Score: 1164.24
Optimal features (4): ['age', 'oxygen_saturation', 'crp_level', 'xray_abnormal']
CPU times: user 39.4 s, sys: 6.89 ms, total: 39.4 s
Wall time: 39.6 s