Code
import pandas as pd
df_lung = pd.read_csv('data/lung_cancer.csv')The idea of this notebook is explore the realtionship of the behavioral features v/s risk of lung cancer
df_variables_desc = pd.read_json('data/variables_descriptions.json')
print(df_variables_desc)
behavior_features = [
'age', # Age of the individual in years
'gender', # Biological sex (0 = Female, 1 = Male)
'education_years', # Total years of formal education completed
'income_level', # Socioeconomic status on an ordinal scale (1 = lowest, 5 = highest)
'smoker', # Indicates whether the individual has a history of smoking (0 = No, 1 = Yes)
'smoking_years', # Total number of years the individual has smoked
'cigarettes_per_day', # Average number of cigarettes smoked per day
'pack_years', # Cumulative smoking exposure combining duration and intensity
'passive_smoking', # Exposure to secondhand smoke (0 = No, 1 = Yes)
'air_pollution_index', # Air quality index representing long-term pollution exposure
'occupational_exposure', # Exposure to hazardous substances at work (0 = No, 1 = Yes)
'radon_exposure', # History of radon exposure (0 = No, 1 = Yes)
'family_history_cancer', # Family history of cancer (0 = No, 1 = Yes)
'asthma', # History of asthma (0 = No, 1 = Yes)
'previous_tb', # History of tuberculosis infection (0 = No, 1 = Yes)
'chronic_cough', # Presence of long-term cough symptoms (0 = No, 1 = Yes)
'chest_pain', # Reports of chest pain (0 = No, 1 = Yes)
'shortness_of_breath', # Presence of breathing difficulty (0 = No, 1 = Yes)
'fatigue', # Persistent fatigue symptoms (0 = No, 1 = Yes)
'bmi', # Body mass index category value
'exercise_hours_per_week', # Average weekly physical activity duration
'diet_quality', # Overall dietary quality score (1 = poor, 5 = excellent)
'alcohol_units_per_week', # Average alcohol consumption per week
'healthcare_access' # Access to healthcare services (1 = poor, 5 = excellent)
] name description
0 age Age of the individual in years
1 gender Biological sex (0 = Female, 1 = Male)
2 education_years Total years of formal education completed
3 income_level Socioeconomic status on an ordinal scale (1 = ...
4 smoker Indicates whether the individual has a history...
5 smoking_years Total number of years the individual has smoked
6 cigarettes_per_day Average number of cigarettes smoked per day
7 pack_years Cumulative smoking exposure combining duration...
8 passive_smoking Exposure to secondhand smoke (0 = No, 1 = Yes)
9 air_pollution_index Air quality index representing long-term pollu...
10 occupational_exposure Exposure to hazardous substances at work (0 = ...
11 radon_exposure History of radon exposure (0 = No, 1 = Yes)
12 family_history_cancer Family history of cancer (0 = No, 1 = Yes)
13 copd Diagnosis of chronic obstructive pulmonary dis...
14 asthma History of asthma (0 = No, 1 = Yes)
15 previous_tb History of tuberculosis infection (0 = No, 1 =...
16 chronic_cough Presence of long-term cough symptoms (0 = No, ...
17 chest_pain Reports of chest pain (0 = No, 1 = Yes)
18 shortness_of_breath Presence of breathing difficulty (0 = No, 1 = ...
19 fatigue Persistent fatigue symptoms (0 = No, 1 = Yes)
20 bmi Body mass index category value
21 oxygen_saturation Blood oxygen saturation level (%)
22 fev1_x10 Lung function measure derived from forced expi...
23 crp_level C-reactive protein level indicating inflammation
24 xray_abnormal Abnormal findings in chest imaging (0 = No, 1 ...
25 exercise_hours_per_week Average weekly physical activity duration
26 diet_quality Overall dietary quality score (1 = poor, 5 = e...
27 alcohol_units_per_week Average alcohol consumption per week
28 healthcare_access Access to healthcare services (1 = poor, 5 = e...
29 lung_cancer_risk Target variable indicating elevated lung cance...
Multicollinearity There are some redundant variables, per example

Hunting for pairs with correlation > 0.8...
Tie-breaker: Kept 'smoking_years', Dropped 'smoker'
Tie-breaker: Kept 'pack_years', Dropped 'smoking_years'
Tie-breaker: Kept 'pack_years', Dropped 'cigarettes_per_day'
Tie-breaker: Kept 'pack_years', Dropped 'oxygen_saturation'
Tie-breaker: Kept 'pack_years', Dropped 'fev1_x10'
Tie-breaker: Kept 'pack_years', Dropped 'crp_level'
----------------------------------------
Dropped 6 variables due to pairwise correlation.
Starting VIF Multicollinearity Check...
----------------------------------------
[DROPPED] 'pack_years'
* VIF Score: 5.05 (Overlap: 80.2%)
* Primary Suspects (Highest pairwise correlations with this feature):
- xray_abnormal (0.75)
- shortness_of_breath (0.74)
- chronic_cough (0.69)
* Current Runner-Ups (Watch these scores drop in the next loop!):
- xray_abnormal: 2.30
- shortness_of_breath: 2.26
- chronic_cough: 1.91
------------------------------
----------------------------------------
VIF Check Complete!
Features dropped: 1
age gender education_years income_level pack_years passive_smoking air_pollution_index occupational_exposure radon_exposure family_history_cancer copd asthma previous_tb chronic_cough chest_pain shortness_of_breath fatigue bmi xray_abnormal exercise_hours_per_week diet_quality alcohol_units_per_week healthcare_access
Starting VIF Multicollinearity Check...
----------------------------------------
[DROPPED] 'pack_years'
* VIF Score: 35.16 (Overlap: 97.2%)
* Primary Suspects (Highest pairwise correlations with this feature):
- fev1_x10 (0.94)
- crp_level (0.92)
- oxygen_saturation (0.90)
* Current Runner-Ups (Watch these scores drop in the next loop!):
- cigarettes_per_day: 14.15
- smoking_years: 13.62
- smoker: 12.75
------------------------------
[DROPPED] 'smoker'
* VIF Score: 10.29 (Overlap: 90.3%)
* Primary Suspects (Highest pairwise correlations with this feature):
- smoking_years (0.83)
- cigarettes_per_day (0.81)
- fev1_x10 (0.63)
* Current Runner-Ups (Watch these scores drop in the next loop!):
- cigarettes_per_day: 10.17
- smoking_years: 9.95
- fev1_x10: 7.08
------------------------------
[DROPPED] 'fev1_x10'
* VIF Score: 6.35 (Overlap: 84.2%)
* Primary Suspects (Highest pairwise correlations with this feature):
- crp_level (0.87)
- oxygen_saturation (0.85)
- cigarettes_per_day (0.79)
* Current Runner-Ups (Watch these scores drop in the next loop!):
- crp_level: 5.45
- oxygen_saturation: 4.74
- cigarettes_per_day: 3.21
------------------------------
----------------------------------------
VIF Check Complete!
Features dropped: 3
Hunting for pairs with correlation > 0.8...
Tie-breaker: Kept 'crp_level', Dropped 'oxygen_saturation'
----------------------------------------
Dropped 1 variables due to pairwise correlation.
age gender education_years income_level smoking_years cigarettes_per_day passive_smoking air_pollution_index occupational_exposure radon_exposure family_history_cancer copd asthma previous_tb chronic_cough chest_pain shortness_of_breath fatigue bmi oxygen_saturation crp_level xray_abnormal exercise_hours_per_week diet_quality alcohol_units_per_week healthcare_access