Biostatistics Modeling: Lung Cancer Risk Analysis

Code
import pandas as pd
df_lung = pd.read_csv('data/lung_cancer.csv')
print(df_lung.info())
df_lung.describe()
<class 'pandas.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 30 columns):
 #   Column                   Non-Null Count  Dtype
---  ------                   --------------  -----
 0   age                      5000 non-null   int64
 1   gender                   5000 non-null   int64
 2   education_years          5000 non-null   int64
 3   income_level             5000 non-null   int64
 4   smoker                   5000 non-null   int64
 5   smoking_years            5000 non-null   int64
 6   cigarettes_per_day       5000 non-null   int64
 7   pack_years               5000 non-null   int64
 8   passive_smoking          5000 non-null   int64
 9   air_pollution_index      5000 non-null   int64
 10  occupational_exposure    5000 non-null   int64
 11  radon_exposure           5000 non-null   int64
 12  family_history_cancer    5000 non-null   int64
 13  copd                     5000 non-null   int64
 14  asthma                   5000 non-null   int64
 15  previous_tb              5000 non-null   int64
 16  chronic_cough            5000 non-null   int64
 17  chest_pain               5000 non-null   int64
 18  shortness_of_breath      5000 non-null   int64
 19  fatigue                  5000 non-null   int64
 20  bmi                      5000 non-null   int64
 21  oxygen_saturation        5000 non-null   int64
 22  fev1_x10                 5000 non-null   int64
 23  crp_level                5000 non-null   int64
 24  xray_abnormal            5000 non-null   int64
 25  exercise_hours_per_week  5000 non-null   int64
 26  diet_quality             5000 non-null   int64
 27  alcohol_units_per_week   5000 non-null   int64
 28  healthcare_access        5000 non-null   int64
 29  lung_cancer_risk         5000 non-null   int64
dtypes: int64(30)
memory usage: 1.1 MB
None
age gender education_years income_level smoker smoking_years cigarettes_per_day pack_years passive_smoking air_pollution_index ... bmi oxygen_saturation fev1_x10 crp_level xray_abnormal exercise_hours_per_week diet_quality alcohol_units_per_week healthcare_access lung_cancer_risk
count 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 ... 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.0000 5000.000000 5000.000000
mean 54.566400 0.488200 11.510400 2.554400 0.454800 8.822200 6.685600 6.249000 0.347600 64.278800 ... 23.604800 96.089200 31.475800 4.593600 0.201400 2.580400 2.534800 5.8850 2.536800 0.248800
std 11.932731 0.499911 2.953987 0.982667 0.498003 11.651598 9.032888 9.964901 0.476256 19.479809 ... 3.937222 3.412385 5.213679 5.333566 0.401086 1.834178 0.985385 4.4239 0.984093 0.432361
min 18.000000 0.000000 5.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 20.000000 ... 16.000000 85.000000 5.000000 0.000000 0.000000 0.000000 1.000000 0.0000 1.000000 0.000000
25% 47.000000 0.000000 10.000000 2.000000 0.000000 0.000000 0.000000 0.000000 0.000000 51.000000 ... 21.000000 95.000000 30.000000 1.000000 0.000000 1.000000 2.000000 2.0000 2.000000 0.000000
50% 55.000000 0.000000 11.000000 3.000000 0.000000 0.000000 0.000000 0.000000 0.000000 64.000000 ... 24.000000 97.000000 33.000000 3.000000 0.000000 2.000000 3.000000 6.0000 3.000000 0.000000
75% 62.000000 1.000000 13.250000 3.000000 1.000000 18.000000 14.000000 10.000000 1.000000 78.000000 ... 26.000000 99.000000 35.000000 7.000000 0.000000 4.000000 3.000000 9.0000 3.000000 0.000000
max 90.000000 1.000000 20.000000 5.000000 1.000000 52.000000 44.000000 60.000000 1.000000 130.000000 ... 37.000000 100.000000 37.000000 33.000000 1.000000 10.000000 5.000000 23.0000 5.000000 1.000000

8 rows × 30 columns

Code
import pandas as pd
definitions = pd.read_json('data/variables_descriptions.json')
definitions.head()
name description
0 age Age of the individual in years
1 gender Biological sex (0 = Female, 1 = Male)
2 education_years Total years of formal education completed
3 income_level Socioeconomic status on an ordinal scale (1 = ...
4 smoker Indicates whether the individual has a history...

In this notebook, we will focus on predicting lung cancer risk based on patient data. Our approach includes: 1. Exploratory Data Analysis (EDA): Examining data distributions and correlations among features. 2. Feature Engineering & Selection: Handling multicollinearity using Variance Inflation Factor (VIF). 3. Supervised Modeling: Training Logistic Regression and Lasso models. 4. Model Evaluation & Interpretation: Comparing models using precision, recall, and F1-score, and explaining model predictions using SHAP values.

1. Data Preparation

First, we split our dataset into training and testing sets. We use a stratified split on the target variable lung_cancer_risk to ensure both sets have a representative proportion of the positive class.

Code

from sklearn.model_selection import train_test_split

target_label = 'lung_cancer_risk'
y = df_lung[target_label].reset_index(drop=True)
X = df_lung.drop(target_label, axis=1).reset_index(drop=True)

2. Exploratory Data Analysis (EDA)

We visualize the correlations between all features. Highly correlated features can introduce multicollinearity, which might destabilize linear models like Logistic Regression.

Code

from statsmodels_utils import plot_correlation_heatmap

plot_correlation_heatmap(X, 'Correlation Heatmap all features')

Let’s take a closer look at the relationships between smoking-related variables, as they are typically highly correlated.

Code

import matplotlib.pyplot as plt
import seaborn as sns

smoking_corr_features = ['smoking_years', 'cigarettes_per_day', 'pack_years']
df_smoker = df_lung[smoking_corr_features]
g = sns.PairGrid(df_smoker)
g.map_lower(sns.scatterplot, alpha=0.5, edgecolor='k')
g.map_diag(sns.histplot, fill=True, color='gray')
g.map_upper(lambda x, y, **kwargs: plt.gca().annotate(
    f"{x.corr(y):.2f}",
    xy=(0.5, 0.5),
    xycoords='axes fraction',
    ha='center',
    va='center',
    fontsize=16,
    fontweight='bold'
))
g.add_legend()
plt.show()

Code
seed = 42
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed, stratify=y)

plt.hist(y_train)
plt.show()

3. Multicollinearity and VIF

As seen in the correlation heatmap and pair plot, there are strong correlations among certain features (like pack_years and smoking_years). This indicates a potential multicollinearity issue, which we can quantify using the Variance Inflation Factor (VIF). We will recursively calculate VIF and drop features that exceed a threshold to ensure model stability.

Code
from statsmodels_utils import calculate_and_drop_vif, drop_highly_correlated_pairs


# X_train_clean_dropped_pairs  = drop_highly_correlated_pairs(X_train, y_train)
# X_train_clean, dropped_vif_features = calculate_and_drop_vif(X_train_clean_dropped_pairs)
X_train_clean, dropped_vif_features = calculate_and_drop_vif(X_train, verbose=True)
plot_correlation_heatmap(X_train_clean, 'Cleaned Correlation Heatmap')
Starting VIF Multicollinearity Check...

----------------------------------------

[DROPPED] 'pack_years'
  * VIF Score: 35.66 (Overlap: 97.2%)
  * Primary Suspects (Highest pairwise correlations with this feature):
      - fev1_x10 (0.95)
      - crp_level (0.92)
      - oxygen_saturation (0.91)
  * Current Runner-Ups (Watch these scores drop in the next loop!):
      - cigarettes_per_day: 14.60
      - smoking_years: 13.52
      - smoker: 12.77
------------------------------

[DROPPED] 'cigarettes_per_day'
  * VIF Score: 10.40 (Overlap: 90.4%)
  * Primary Suspects (Highest pairwise correlations with this feature):
      - smoker (0.81)
      - fev1_x10 (0.79)
      - crp_level (0.79)
  * Current Runner-Ups (Watch these scores drop in the next loop!):
      - smoker: 10.22
      - smoking_years: 9.84
      - fev1_x10: 7.19
------------------------------

[DROPPED] 'fev1_x10'
  * VIF Score: 6.33 (Overlap: 84.2%)
  * Primary Suspects (Highest pairwise correlations with this feature):
      - crp_level (0.88)
      - oxygen_saturation (0.86)
      - smoking_years (0.77)
  * Current Runner-Ups (Watch these scores drop in the next loop!):
      - crp_level: 5.36
      - smoking_years: 5.03
      - oxygen_saturation: 4.73
------------------------------
----------------------------------------
VIF Check Complete!
Features dropped: 3

4. Modeling: Logistic Regression

We will start by training a Logistic Regression model (Logit) from statsmodels. To select the most relevant features, we employ a stepwise selection method, iteratively adding features that significantly improve the model based on their p-values.

Code
# go for stats model logit
from statsmodels_utils import stepwise_selection, backward_elimination
import statsmodels.api as sm

logit_model = sm.Logit
final_model_logit, features = backward_elimination(logit_model, X_train_clean, y_train, disp=False,
                                                 warn_convergence=False)
print(f"Optimal features: {features}")
final_model_logit.summary()
Starting Backward Elimination using Logit...
----------------------------------------
Dropped: 'smoker' (p-value: 0.9993)
Dropped: 'fatigue' (p-value: 0.9296)
Dropped: 'income_level' (p-value: 0.8643)
Dropped: 'chest_pain' (p-value: 0.8294)
Dropped: 'previous_tb' (p-value: 0.8061)
Dropped: 'education_years' (p-value: 0.7859)
Dropped: 'gender' (p-value: 0.6615)
Dropped: 'asthma' (p-value: 0.4131)
Dropped: 'alcohol_units_per_week' (p-value: 0.3346)
----------------------------------------
Backward Elimination Complete!
Remaining features: 17
Optimal features: ['age', 'smoking_years', 'passive_smoking', 'air_pollution_index', 'occupational_exposure', 'radon_exposure', 'family_history_cancer', 'copd', 'chronic_cough', 'shortness_of_breath', 'bmi', 'oxygen_saturation', 'crp_level', 'xray_abnormal', 'exercise_hours_per_week', 'diet_quality', 'healthcare_access']
Logit Regression Results
Dep. Variable: lung_cancer_risk No. Observations: 4000
Model: Logit Df Residuals: 3982
Method: MLE Df Model: 17
Date: Fri, 06 Mar 2026 Pseudo R-squ.: 0.9209
Time: 07:58:05 Log-Likelihood: -177.41
converged: True LL-Null: -2243.8
Covariance Type: nonrobust LLR p-value: 0.000
coef std err z P>|z| [0.025 0.975]
const 57.6956 8.755 6.590 0.000 40.536 74.855
age 0.1893 0.018 10.558 0.000 0.154 0.224
smoking_years 0.0910 0.018 4.998 0.000 0.055 0.127
passive_smoking 2.1159 0.334 6.330 0.000 1.461 2.771
air_pollution_index 0.0883 0.009 9.601 0.000 0.070 0.106
occupational_exposure 2.5526 0.375 6.800 0.000 1.817 3.288
radon_exposure 1.4291 0.434 3.293 0.001 0.579 2.280
family_history_cancer 2.8693 0.378 7.583 0.000 2.128 3.611
copd 4.4636 0.452 9.882 0.000 3.578 5.349
chronic_cough 4.1395 0.447 9.259 0.000 3.263 5.016
shortness_of_breath 0.8645 0.322 2.685 0.007 0.233 1.496
bmi -0.1079 0.037 -2.882 0.004 -0.181 -0.035
oxygen_saturation -0.8449 0.099 -8.527 0.000 -1.039 -0.651
crp_level 0.7062 0.067 10.602 0.000 0.576 0.837
xray_abnormal 6.0964 0.518 11.760 0.000 5.080 7.112
exercise_hours_per_week -0.5941 0.088 -6.735 0.000 -0.767 -0.421
diet_quality -0.7095 0.148 -4.788 0.000 -1.000 -0.419
healthcare_access -0.5990 0.142 -4.229 0.000 -0.877 -0.321


Possibly complete quasi-separation: A fraction 0.67 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.
Code
y_pred_logit = final_model_logit.predict(sm.add_constant(X_test[features]))
y_pred_logit = y_pred_logit.apply(lambda x: 1 if x > 0.5 else 0)
from statsmodels_utils import plot_confusion_matrix

plot_folder = '../frontend/public/plots'
confusion_matrix_logit_image = f'{plot_folder}/confusion_logit.jpg'

plot_confusion_matrix(y_test, y_pred_logit, title="Confusion matrix Logit", plot_file=confusion_matrix_logit_image)

from sklearn.metrics import recall_score, accuracy_score, f1_score, precision_score

print("Accuracy: ", accuracy_score(y_test, y_pred_logit))
print("F1: ", f1_score(y_test, y_pred_logit))
print("Precision: ", precision_score(y_test, y_pred_logit))
print("Recall: ", recall_score(y_test, y_pred_logit))
Plot saved successfully to: ../frontend/public/plots/confusion_logit.jpg

Accuracy:  0.981
F1:  0.9620758483033932
Precision:  0.9563492063492064
Recall:  0.9678714859437751

5. Model Evaluation

The impressive Recall score gives us confidence that the model is performing well, particularly in identifying true positive cases. In medical diagnoses such as lung cancer prediction, minimizing false negatives (i.e., missing a cancer diagnosis) is critical, making Recall a primary metric.

Overall, we have reached a good model that generalizes well to the unseen test data. Next, we will compare it against a Lasso (L1 Regularization) model to evaluate differences in metrics and feature selection.

Code
from sklearn.utils import compute_sample_weight
from sklearn.linear_model import Lasso

alpha = 1.0
weights = compute_sample_weight(class_weight='balanced', y=y_train)
clf = Lasso(alpha=alpha, random_state=seed)
clf.fit(X_train, y_train, sample_weight=weights)
clf_selected_features = X_train.columns[clf.coef_ != 0]
y_pred_lasso = clf.predict(X_test[X_train.columns])
y_pred_lasso = (y_pred_lasso > 0.5).astype(int)
print("Lasso selected features: ", clf_selected_features.tolist())
print("Stepwise selected features: ", features)
print("Common features: ", [features[i] for i in range(len(features)) if features[i] in clf_selected_features])
print("Logit Excluded features from Lasso: ", clf_selected_features.difference(features).tolist())
print("Lasso Excluded features from Logit: ", list(set(features).difference(clf_selected_features)))
Lasso selected features:  ['smoking_years', 'cigarettes_per_day', 'pack_years']
Stepwise selected features:  ['age', 'smoking_years', 'passive_smoking', 'air_pollution_index', 'occupational_exposure', 'radon_exposure', 'family_history_cancer', 'copd', 'chronic_cough', 'shortness_of_breath', 'bmi', 'oxygen_saturation', 'crp_level', 'xray_abnormal', 'exercise_hours_per_week', 'diet_quality', 'healthcare_access']
Common features:  ['smoking_years']
Logit Excluded features from Lasso:  ['cigarettes_per_day', 'pack_years']
Lasso Excluded features from Logit:  ['age', 'family_history_cancer', 'diet_quality', 'occupational_exposure', 'radon_exposure', 'copd', 'shortness_of_breath', 'oxygen_saturation', 'xray_abnormal', 'air_pollution_index', 'passive_smoking', 'bmi', 'exercise_hours_per_week', 'chronic_cough', 'crp_level', 'healthcare_access']

Lasso is highly aggressive with its L1 penalty and selected only 3 features: smoking_years, cigarettes_per_day, and pack_years. In contrast, the stepwise selection retained 17 features and correctly avoided pack_years (which we had previously dropped due to high VIF).

Code
from statsmodels_utils import plot_confusion_matrix


confusion_matrix_lasso_image = f'{plot_folder}/confusion_lasso.jpg'
plot_confusion_matrix(y_test, y_pred_lasso, title="Confusion matrix Lasso", plot_file=confusion_matrix_lasso_image)
print("Accuracy: ", accuracy_score(y_test, y_pred_lasso))
print("F1: ", f1_score(y_test, y_pred_lasso))
print("Precision: ", precision_score(y_test, y_pred_lasso))
print("Recall: ", recall_score(y_test, y_pred_lasso))
Plot saved successfully to: ../frontend/public/plots/confusion_lasso.jpg

Accuracy:  0.912
F1:  0.8422939068100358
Precision:  0.7605177993527508
Recall:  0.9437751004016064

Model Comparison Conclusion

The Lasso model performed significantly worse in detecting positive cases. It skipped several cases of lung cancer, resulting in a much lower Recall.

In a life-threatening context, achieving a high Recall score is crucial, and sacrificing a small amount of precision is totally acceptable. Lasso did not outperform the Logit model (with stepwise selection) on Recall or F1-score. Therefore, we will proceed with our Logistic Regression model.

6. Saving Model Artifacts

We will serialize and save the coefficients of our chosen Logistic Regression model, along with the list of selected features. This will allow us to reproduce the scoring and deploy the model for future predictions.

Code
# we save the coefficients for further predictions
import json
import os
out_dir = '../frontend/public'
ols_filename = 'ols_weights'
coeffs = final_model_logit.params.tolist()
os.makedirs(out_dir, exist_ok=True)
with open(f"{out_dir}/{ols_filename}.json", 'w') as f:
    json.dump(coeffs, f, indent=4)
    print(f"Model coefficients saved to '{out_dir}/{ols_filename}.json'")
    features_filename = f"{ols_filename}_features.json"
    # we save the features and their order for reference
    with open(f"{out_dir}/{features_filename}", 'w') as f:
        # the add the const as the first feature
        json.dump(['const',*features], f, indent=4)
    print(f"Features saved to '{out_dir}/{ols_filename}_features.json'")
Model coefficients saved to '../frontend/public/ols_weights.json'
Features saved to '../frontend/public/ols_weights_features.json'

7. Model Interpretability with SHAP

To understand how our Logistic Regression model makes its predictions, we will analyze the impact of individual features using SHAP (SHapley Additive exPlanations). This allows for both global and local interpretability.

Code
from statsmodels_utils import plot_coef


features_impact_image = f'{plot_folder}/features_impact.jpg'
plot_coef(final_model_logit, plot_file=features_impact_image)
Plot saved successfully to: ../frontend/public/plots/features_impact.jpg

Global Explanations: SHAP Summary

The SHAP summary plot gives us a bird’s-eye view of feature importance and the direction of the relationship between feature values and the prediction.

Code
from statsmodels_utils import plot_shap_summary

X_test_shap = X_test[features]

shap_summary_image = f'{plot_folder}/shap_summary.jpg'
with_constant = sm.add_constant(X_test_shap)
plot_shap_summary(
    final_model_logit,
    with_constant,
    features_to_drop=['const'],
    plot_type='violin',
    show=False,
    plot_file=shap_summary_image
)
Plot saved successfully to: ../frontend/public/plots/shap_summary.jpg

As seen in the plots above, the top three most influential features (crp_level, oxygen_saturation, and xray_abnormal) are all derived from clinical laboratory tests and imaging. This reinforces the clinical validity of the model, as these objective medical measurements strongly drive the predictions.

The SHAP bar plot aggregates the mean absolute SHAP values, providing a clear ranking of overall feature importance across the entire dataset.

Code
from statsmodels_utils import plot_shap_bar

shap_feature_importance_image = f'{plot_folder}/shap_feature_importance.jpg'
plot_shap_bar(final_model_logit, with_constant, plot_file=shap_feature_importance_image)
Plot saved successfully to: ../frontend/public/plots/shap_feature_importance.jpg

Local Explanations: Single Prediction

We can also use SHAP to explain individual predictions. Let’s look at a specific patient (e.g., index 99) to see how their specific feature values pushed the model’s prediction probability above or below the base value.

Code
from statsmodels_utils import plot_shap_waterfall_single

index_to_plot = 99
predicted_probability = final_model_logit.predict(with_constant.iloc[[index_to_plot]])
label = f"This case {"has" if predicted_probability.iloc[0] > 0.5 else "dont have"} cancer."

single_explanation_image = f'{plot_folder}/single_explanation.jpg'
plot_shap_waterfall_single(
    final_model_logit,
    with_constant,
    row_index=index_to_plot,
    title=f"{label} Single explanation",
    plot_file=single_explanation_image
)
Plot saved successfully to: ../frontend/public/plots/single_explanation.jpg

For the patient at index 99, we can see exactly how the model arrived at its positive prediction. Notably, having an abnormal X-ray (xray_abnormal = 1) is a primary driver, increasing the log-odds of the prediction. Other clinical features and medical history also significantly push the risk higher.

Feature Dependence

SHAP dependence plots show the marginal effect of a feature on the prediction. Let’s visualize how the model’s output depends on specific variables like age.

Code

from statsmodels_utils import plot_shap_dependence

shap_dependence_image = f'{plot_folder}/shap_dependence.jpg'
plot_shap_dependence(final_model_logit, with_constant, 'age', plot_file=shap_feature_importance_image)
Plot saved successfully to: ../frontend/public/plots/shap_feature_importance.jpg

Code
from statsmodels_utils import plot_top_shap_dependence

plot_shap_dependence_image = f'{plot_folder}/shap_dependence_top.jpg'
plot_top_shap_dependence(final_model_logit, with_constant, top_n=3, plot_file=plot_shap_dependence_image)
Automatically plotting 3 features: ['crp_level', 'oxygen_saturation', 'xray_abnormal']
Plot saved successfully to: ../frontend/public/plots/shap_dependence_top.jpg

Code
print(features)

plot_shap_dependence_label_image = f'{plot_folder}/shap_dependence_labels.jpg'
plot_top_shap_dependence(
    final_model_logit,
    with_constant,
    features=['air_pollution_index', 'passive_smoking', 'healthcare_access', 'family_history_cancer', 'radon_exposure'],
    plot_file=plot_shap_dependence_label_image
)
['age', 'smoking_years', 'passive_smoking', 'air_pollution_index', 'occupational_exposure', 'radon_exposure', 'family_history_cancer', 'copd', 'chronic_cough', 'shortness_of_breath', 'bmi', 'oxygen_saturation', 'crp_level', 'xray_abnormal', 'exercise_hours_per_week', 'diet_quality', 'healthcare_access']
Automatically plotting 5 features: ['air_pollution_index', 'passive_smoking', 'healthcare_access', 'family_history_cancer', 'radon_exposure']
Plot saved successfully to: ../frontend/public/plots/shap_dependence_labels.jpg