Biostatistics Modeling: Lung Cancer Risk Analysis

In this notebook, we will focus on predicting lung cancer risk based on patient data. Our approach includes: 1. Exploratory Data Analysis (EDA): Examining data distributions and correlations among features. 2. Feature Engineering & Selection: Handling multicollinearity using Variance Inflation Factor (VIF). 3. Supervised Modeling: Training Logistic Regression and Lasso models. 4. Model Evaluation & Interpretation: Comparing models using precision, recall, and F1-score, and explaining model predictions using SHAP values.

0. Data Loading

We load the lung cancer dataset (5,000 patients, 30 variables) together with its variable descriptions. The features span demographics (age, education_years), smoking behavior (smoking_years, cigarettes_per_day, pack_years), environmental exposures (air_pollution_index, radon_exposure), medical history (copd, family_history_cancer), and clinical measurements (oxygen_saturation, crp_level, xray_abnormal).

Note on the data: this is a synthetic educational dataset from Kaggle. Effect sizes are considerably stronger than in real clinical data, which is why the model separates classes more cleanly than any real-world screening tool would. The methodology is the point here, not the medical conclusions.

Code
import pandas as pd
df_lung = pd.read_csv('data/lung_cancer.csv')
print(df_lung.info())
df_lung.describe()
<class 'pandas.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 30 columns):
 #   Column                   Non-Null Count  Dtype
---  ------                   --------------  -----
 0   age                      5000 non-null   int64
 1   gender                   5000 non-null   int64
 2   education_years          5000 non-null   int64
 3   income_level             5000 non-null   int64
 4   smoker                   5000 non-null   int64
 5   smoking_years            5000 non-null   int64
 6   cigarettes_per_day       5000 non-null   int64
 7   pack_years               5000 non-null   int64
 8   passive_smoking          5000 non-null   int64
 9   air_pollution_index      5000 non-null   int64
 10  occupational_exposure    5000 non-null   int64
 11  radon_exposure           5000 non-null   int64
 12  family_history_cancer    5000 non-null   int64
 13  copd                     5000 non-null   int64
 14  asthma                   5000 non-null   int64
 15  previous_tb              5000 non-null   int64
 16  chronic_cough            5000 non-null   int64
 17  chest_pain               5000 non-null   int64
 18  shortness_of_breath      5000 non-null   int64
 19  fatigue                  5000 non-null   int64
 20  bmi                      5000 non-null   int64
 21  oxygen_saturation        5000 non-null   int64
 22  fev1_x10                 5000 non-null   int64
 23  crp_level                5000 non-null   int64
 24  xray_abnormal            5000 non-null   int64
 25  exercise_hours_per_week  5000 non-null   int64
 26  diet_quality             5000 non-null   int64
 27  alcohol_units_per_week   5000 non-null   int64
 28  healthcare_access        5000 non-null   int64
 29  lung_cancer_risk         5000 non-null   int64
dtypes: int64(30)
memory usage: 1.1 MB
None
age gender education_years income_level smoker smoking_years cigarettes_per_day pack_years passive_smoking air_pollution_index ... bmi oxygen_saturation fev1_x10 crp_level xray_abnormal exercise_hours_per_week diet_quality alcohol_units_per_week healthcare_access lung_cancer_risk
count 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 ... 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.0000 5000.000000 5000.000000
mean 54.566400 0.488200 11.510400 2.554400 0.454800 8.822200 6.685600 6.249000 0.347600 64.278800 ... 23.604800 96.089200 31.475800 4.593600 0.201400 2.580400 2.534800 5.8850 2.536800 0.248800
std 11.932731 0.499911 2.953987 0.982667 0.498003 11.651598 9.032888 9.964901 0.476256 19.479809 ... 3.937222 3.412385 5.213679 5.333566 0.401086 1.834178 0.985385 4.4239 0.984093 0.432361
min 18.000000 0.000000 5.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 20.000000 ... 16.000000 85.000000 5.000000 0.000000 0.000000 0.000000 1.000000 0.0000 1.000000 0.000000
25% 47.000000 0.000000 10.000000 2.000000 0.000000 0.000000 0.000000 0.000000 0.000000 51.000000 ... 21.000000 95.000000 30.000000 1.000000 0.000000 1.000000 2.000000 2.0000 2.000000 0.000000
50% 55.000000 0.000000 11.000000 3.000000 0.000000 0.000000 0.000000 0.000000 0.000000 64.000000 ... 24.000000 97.000000 33.000000 3.000000 0.000000 2.000000 3.000000 6.0000 3.000000 0.000000
75% 62.000000 1.000000 13.250000 3.000000 1.000000 18.000000 14.000000 10.000000 1.000000 78.000000 ... 26.000000 99.000000 35.000000 7.000000 0.000000 4.000000 3.000000 9.0000 3.000000 0.000000
max 90.000000 1.000000 20.000000 5.000000 1.000000 52.000000 44.000000 60.000000 1.000000 130.000000 ... 37.000000 100.000000 37.000000 33.000000 1.000000 10.000000 5.000000 23.0000 5.000000 1.000000

8 rows × 30 columns

Code
import pandas as pd
definitions = pd.read_json('data/variables_descriptions.json')
definitions.head()
name description
0 age Age of the individual in years
1 gender Biological sex (0 = Female, 1 = Male)
2 education_years Total years of formal education completed
3 income_level Socioeconomic status on an ordinal scale (1 = ...
4 smoker Indicates whether the individual has a history...

1. Data Preparation

First, we split our dataset into training and testing sets. We use a stratified split on the target variable lung_cancer_risk to ensure both sets have a representative proportion of the positive class.

Code

from sklearn.model_selection import train_test_split

target_label = 'lung_cancer_risk'
y = df_lung[target_label].reset_index(drop=True)
X = df_lung.drop(target_label, axis=1).reset_index(drop=True)

2. Exploratory Data Analysis (EDA)

We visualize the correlations between all features. Highly correlated features can introduce multicollinearity, which might destabilize linear models like Logistic Regression.

Code

from statsmodels_utils import plot_correlation_heatmap

plot_correlation_heatmap(X, 'Correlation Heatmap all features')

Let’s take a closer look at the relationships between smoking-related variables, as they are typically highly correlated.

Code

import matplotlib.pyplot as plt
import seaborn as sns

smoking_corr_features = ['smoking_years', 'cigarettes_per_day', 'pack_years']
df_smoker = df_lung[smoking_corr_features]
g = sns.PairGrid(df_smoker)
g.map_lower(sns.scatterplot, alpha=0.5, edgecolor='k')
g.map_diag(sns.histplot, fill=True, color='gray')
g.map_upper(lambda x, y, **kwargs: plt.gca().annotate(
    f"{x.corr(y):.2f}",
    xy=(0.5, 0.5),
    xycoords='axes fraction',
    ha='center',
    va='center',
    fontsize=16,
    fontweight='bold'
))
g.add_legend()
plt.show()

We hold out 20% of the data for testing, stratifying on the target so both splits keep the same class mix. The histogram below confirms the class imbalance: roughly 3 in 4 patients are negative. This matters later — it is why we look beyond accuracy when evaluating, and why the regularized benchmark uses class weighting.

Code
seed = 42
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed, stratify=y)

plt.hist(y_train)
plt.show()

3. Multicollinearity and VIF

As seen in the correlation heatmap and pair plot, there are strong correlations among certain features (like pack_years and smoking_years). This indicates a potential multicollinearity issue, which we can quantify using the Variance Inflation Factor (VIF). We will recursively calculate VIF and drop features that exceed a threshold to ensure model stability.

Code
from statsmodels_utils import calculate_and_drop_vif, drop_highly_correlated_pairs


# X_train_clean_dropped_pairs  = drop_highly_correlated_pairs(X_train, y_train)
# X_train_clean, dropped_vif_features = calculate_and_drop_vif(X_train_clean_dropped_pairs)
X_train_clean, dropped_vif_features = calculate_and_drop_vif(X_train, verbose=True)
plot_correlation_heatmap(X_train_clean, 'Cleaned Correlation Heatmap')
Starting VIF Multicollinearity Check...

----------------------------------------

[DROPPED] 'pack_years'
  * VIF Score: 35.66 (Overlap: 97.2%)
  * Primary Suspects (Highest pairwise correlations with this feature):
      - fev1_x10 (0.95)
      - crp_level (0.92)
      - oxygen_saturation (0.91)
  * Current Runner-Ups (Watch these scores drop in the next loop!):
      - cigarettes_per_day: 14.60
      - smoking_years: 13.52
      - smoker: 12.77
------------------------------

[DROPPED] 'cigarettes_per_day'
  * VIF Score: 10.40 (Overlap: 90.4%)
  * Primary Suspects (Highest pairwise correlations with this feature):
      - smoker (0.81)
      - fev1_x10 (0.79)
      - crp_level (0.79)
  * Current Runner-Ups (Watch these scores drop in the next loop!):
      - smoker: 10.22
      - smoking_years: 9.84
      - fev1_x10: 7.19
------------------------------

[DROPPED] 'fev1_x10'
  * VIF Score: 6.33 (Overlap: 84.2%)
  * Primary Suspects (Highest pairwise correlations with this feature):
      - crp_level (0.88)
      - oxygen_saturation (0.86)
      - smoking_years (0.77)
  * Current Runner-Ups (Watch these scores drop in the next loop!):
      - crp_level: 5.36
      - smoking_years: 5.03
      - oxygen_saturation: 4.73
------------------------------
----------------------------------------
VIF Check Complete!
Features dropped: 3

4. Modeling: Logistic Regression

We train a Logistic Regression model (statsmodels Logit) and select features via backward elimination: starting from the full cleaned feature set, we iteratively drop the least significant feature (highest p-value above 0.05) and refit, until every remaining coefficient is statistically significant. This yields a compact, interpretable model where every retained feature earns its place.

Code
# go for stats model logit
from statsmodels_utils import stepwise_selection, backward_elimination
import statsmodels.api as sm

logit_model = sm.Logit
final_model_logit, features = backward_elimination(logit_model, X_train_clean, y_train, disp=False,
                                                 warn_convergence=False)
print(f"Optimal features: {features}")
final_model_logit.summary()
Starting Backward Elimination using Logit...
----------------------------------------
Dropped: 'smoker' (p-value: 0.9993)
Dropped: 'fatigue' (p-value: 0.9296)
Dropped: 'income_level' (p-value: 0.8643)
Dropped: 'chest_pain' (p-value: 0.8294)
Dropped: 'previous_tb' (p-value: 0.8061)
Dropped: 'education_years' (p-value: 0.7859)
Dropped: 'gender' (p-value: 0.6615)
Dropped: 'asthma' (p-value: 0.4131)
Dropped: 'alcohol_units_per_week' (p-value: 0.3346)
----------------------------------------
Backward Elimination Complete!
Remaining features: 17
Optimal features: ['age', 'smoking_years', 'passive_smoking', 'air_pollution_index', 'occupational_exposure', 'radon_exposure', 'family_history_cancer', 'copd', 'chronic_cough', 'shortness_of_breath', 'bmi', 'oxygen_saturation', 'crp_level', 'xray_abnormal', 'exercise_hours_per_week', 'diet_quality', 'healthcare_access']
Logit Regression Results
Dep. Variable: lung_cancer_risk No. Observations: 4000
Model: Logit Df Residuals: 3982
Method: MLE Df Model: 17
Date: Fri, 06 Mar 2026 Pseudo R-squ.: 0.9209
Time: 07:58:05 Log-Likelihood: -177.41
converged: True LL-Null: -2243.8
Covariance Type: nonrobust LLR p-value: 0.000
coef std err z P>|z| [0.025 0.975]
const 57.6956 8.755 6.590 0.000 40.536 74.855
age 0.1893 0.018 10.558 0.000 0.154 0.224
smoking_years 0.0910 0.018 4.998 0.000 0.055 0.127
passive_smoking 2.1159 0.334 6.330 0.000 1.461 2.771
air_pollution_index 0.0883 0.009 9.601 0.000 0.070 0.106
occupational_exposure 2.5526 0.375 6.800 0.000 1.817 3.288
radon_exposure 1.4291 0.434 3.293 0.001 0.579 2.280
family_history_cancer 2.8693 0.378 7.583 0.000 2.128 3.611
copd 4.4636 0.452 9.882 0.000 3.578 5.349
chronic_cough 4.1395 0.447 9.259 0.000 3.263 5.016
shortness_of_breath 0.8645 0.322 2.685 0.007 0.233 1.496
bmi -0.1079 0.037 -2.882 0.004 -0.181 -0.035
oxygen_saturation -0.8449 0.099 -8.527 0.000 -1.039 -0.651
crp_level 0.7062 0.067 10.602 0.000 0.576 0.837
xray_abnormal 6.0964 0.518 11.760 0.000 5.080 7.112
exercise_hours_per_week -0.5941 0.088 -6.735 0.000 -0.767 -0.421
diet_quality -0.7095 0.148 -4.788 0.000 -1.000 -0.419
healthcare_access -0.5990 0.142 -4.229 0.000 -0.877 -0.321


Possibly complete quasi-separation: A fraction 0.67 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.

5. Model Evaluation

We evaluate on the held-out test set using a 0.5 probability threshold. Accuracy alone is misleading here — only ~25% of cases are positive, so a model that always predicts “no cancer” would score ~75% while being clinically useless.

The metric that matters most in this context is Recall (sensitivity): of the patients who actually have cancer, how many did we catch? In a screening setting, a false negative (a missed cancer) is far more costly than a false positive (an unnecessary follow-up test).

Code
y_pred_logit = final_model_logit.predict(sm.add_constant(X_test[features]))
y_pred_logit = y_pred_logit.apply(lambda x: 1 if x > 0.5 else 0)
from statsmodels_utils import plot_confusion_matrix

plot_folder = '../frontend/public/plots'
confusion_matrix_logit_image = f'{plot_folder}/confusion_logit.jpg'

plot_confusion_matrix(y_test, y_pred_logit, title="Confusion matrix Logit", plot_file=confusion_matrix_logit_image)

from sklearn.metrics import recall_score, accuracy_score, f1_score, precision_score

print("Accuracy: ", accuracy_score(y_test, y_pred_logit))
print("F1: ", f1_score(y_test, y_pred_logit))
print("Precision: ", precision_score(y_test, y_pred_logit))
print("Recall: ", recall_score(y_test, y_pred_logit))
Plot saved successfully to: ../frontend/public/plots/confusion_logit.jpg

Accuracy:  0.981
F1:  0.9620758483033932
Precision:  0.9563492063492064
Recall:  0.9678714859437751

5b. Benchmark: L1-Regularized Logistic Regression

As a robustness check we fit a second logistic regression, this time with an L1 (lasso) penalty via scikit-learn. Instead of dropping features by p-value, the L1 penalty performs feature selection implicitly by shrinking weak coefficients to exactly zero. class_weight='balanced' compensates for the class imbalance seen above.

Code
from sklearn.linear_model import LogisticRegression

# L1-regularized *logistic* regression (a true apples-to-apples comparison
# with the statsmodels Logit above; sklearn's Lasso is linear regression).
clf = LogisticRegression(penalty='l1', solver='liblinear', C=1.0,
                         class_weight='balanced', random_state=seed)
clf.fit(X_train, y_train)
clf_selected_features = X_train.columns[clf.coef_[0] != 0]
y_pred_lasso = clf.predict(X_test[X_train.columns])
print("Lasso selected features: ", clf_selected_features.tolist())
print("Stepwise selected features: ", features)
print("Common features: ", [features[i] for i in range(len(features)) if features[i] in clf_selected_features])
print("Logit Excluded features from Lasso: ", clf_selected_features.difference(features).tolist())
print("Lasso Excluded features from Logit: ", list(set(features).difference(clf_selected_features)))

Both approaches perform feature selection, but by different mechanisms: the L1 penalty shrinks uninformative coefficients exactly to zero during optimization, while backward elimination drops features by statistical significance. Comparing the two selected feature sets is a useful robustness check — features chosen by both methods are the ones to trust most.

Code
from statsmodels_utils import plot_confusion_matrix


confusion_matrix_lasso_image = f'{plot_folder}/confusion_lasso.jpg'
plot_confusion_matrix(y_test, y_pred_lasso, title="Confusion matrix Lasso", plot_file=confusion_matrix_lasso_image)
print("Accuracy: ", accuracy_score(y_test, y_pred_lasso))
print("F1: ", f1_score(y_test, y_pred_lasso))
print("Precision: ", precision_score(y_test, y_pred_lasso))
print("Recall: ", recall_score(y_test, y_pred_lasso))

Model Comparison Conclusion

When comparing the two models, we weigh Recall above raw accuracy for the clinical reasons outlined above. The backward-elimination Logit remains our deployment choice: its coefficients map one-to-one to interpretable log-odds contributions per feature, its feature set was vetted for multicollinearity beforehand (VIF), and its statistical summary (p-values, confidence intervals) supports the kind of scrutiny expected in a biostatistics setting.

The regularized model serves as a sanity check that a different selection mechanism lands on a similar signal — not as a competing deployment candidate.

6. Saving Model Artifacts

We will serialize and save the coefficients of our chosen Logistic Regression model, along with the list of selected features. This will allow us to reproduce the scoring and deploy the model for future predictions.

Code
# we save the coefficients for further predictions
import json
import os
out_dir = '../frontend/public'
logit_filename = 'logit_weights'
coeffs = final_model_logit.params.tolist()
os.makedirs(out_dir, exist_ok=True)
with open(f"{out_dir}/{logit_filename}.json", 'w') as f:
    json.dump(coeffs, f, indent=4)
    print(f"Model coefficients saved to '{out_dir}/{logit_filename}.json'")
    features_filename = f"{logit_filename}_features.json"
    # we save the features and their order for reference
    with open(f"{out_dir}/{features_filename}", 'w') as f:
        # the add the const as the first feature
        json.dump(['const',*features], f, indent=4)
    print(f"Features saved to '{out_dir}/{logit_filename}_features.json'")

7. Model Interpretability with SHAP

To understand how our Logistic Regression model makes its predictions, we will analyze the impact of individual features using SHAP (SHapley Additive exPlanations). This allows for both global and local interpretability.

Code
from statsmodels_utils import plot_coef


features_impact_image = f'{plot_folder}/features_impact.jpg'
plot_coef(final_model_logit, plot_file=features_impact_image)
Plot saved successfully to: ../frontend/public/plots/features_impact.jpg

Global Explanations: SHAP Summary

The SHAP summary plot gives us a bird’s-eye view of feature importance and the direction of the relationship between feature values and the prediction.

Code
from statsmodels_utils import plot_shap_summary

X_test_shap = X_test[features]

shap_summary_image = f'{plot_folder}/shap_summary.jpg'
with_constant = sm.add_constant(X_test_shap)
plot_shap_summary(
    final_model_logit,
    with_constant,
    features_to_drop=['const'],
    plot_type='violin',
    show=False,
    plot_file=shap_summary_image
)
Plot saved successfully to: ../frontend/public/plots/shap_summary.jpg

As seen in the plots above, the top three most influential features (crp_level, oxygen_saturation, and xray_abnormal) are all derived from clinical laboratory tests and imaging. This reinforces the clinical validity of the model, as these objective medical measurements strongly drive the predictions.

The SHAP bar plot aggregates the mean absolute SHAP values, providing a clear ranking of overall feature importance across the entire dataset.

Code
from statsmodels_utils import plot_shap_bar

shap_feature_importance_image = f'{plot_folder}/shap_feature_importance.jpg'
plot_shap_bar(final_model_logit, with_constant, plot_file=shap_feature_importance_image)
Plot saved successfully to: ../frontend/public/plots/shap_feature_importance.jpg

Local Explanations: Single Prediction

We can also use SHAP to explain individual predictions. Let’s look at a specific patient (e.g., index 99) to see how their specific feature values pushed the model’s prediction probability above or below the base value.

Code
from statsmodels_utils import plot_shap_waterfall_single

index_to_plot = 99
predicted_probability = final_model_logit.predict(with_constant.iloc[[index_to_plot]])
label = f"This case {"has" if predicted_probability.iloc[0] > 0.5 else "does not have"} cancer."

single_explanation_image = f'{plot_folder}/single_explanation.jpg'
plot_shap_waterfall_single(
    final_model_logit,
    with_constant,
    row_index=index_to_plot,
    title=f"{label} Single explanation",
    plot_file=single_explanation_image
)
Plot saved successfully to: ../frontend/public/plots/single_explanation.jpg

For the patient at index 99, we can see exactly how the model arrived at its positive prediction. Notably, having an abnormal X-ray (xray_abnormal = 1) is a primary driver, increasing the log-odds of the prediction. Other clinical features and medical history also significantly push the risk higher.

Feature Dependence

SHAP dependence plots show the marginal effect of a feature on the prediction. Let’s visualize how the model’s output depends on specific variables like age.

Code

from statsmodels_utils import plot_shap_dependence

shap_dependence_image = f'{plot_folder}/shap_dependence.jpg'
plot_shap_dependence(final_model_logit, with_constant, 'age', plot_file=shap_dependence_image)
Plot saved successfully to: ../frontend/public/plots/shap_feature_importance.jpg

Code
from statsmodels_utils import plot_top_shap_dependence

plot_shap_dependence_image = f'{plot_folder}/shap_dependence_top.jpg'
plot_top_shap_dependence(final_model_logit, with_constant, top_n=3, plot_file=plot_shap_dependence_image)
Automatically plotting 3 features: ['crp_level', 'oxygen_saturation', 'xray_abnormal']
Plot saved successfully to: ../frontend/public/plots/shap_dependence_top.jpg

Code
print(features)

plot_shap_dependence_label_image = f'{plot_folder}/shap_dependence_labels.jpg'
plot_top_shap_dependence(
    final_model_logit,
    with_constant,
    features=['air_pollution_index', 'passive_smoking', 'healthcare_access', 'family_history_cancer', 'radon_exposure'],
    plot_file=plot_shap_dependence_label_image
)
['age', 'smoking_years', 'passive_smoking', 'air_pollution_index', 'occupational_exposure', 'radon_exposure', 'family_history_cancer', 'copd', 'chronic_cough', 'shortness_of_breath', 'bmi', 'oxygen_saturation', 'crp_level', 'xray_abnormal', 'exercise_hours_per_week', 'diet_quality', 'healthcare_access']
Automatically plotting 5 features: ['air_pollution_index', 'passive_smoking', 'healthcare_access', 'family_history_cancer', 'radon_exposure']
Plot saved successfully to: ../frontend/public/plots/shap_dependence_labels.jpg

8. Conclusions

What we built. An end-to-end, interpretable classifier for lung cancer risk: multicollinearity pruning (VIF) → backward-elimination Logit → benchmark against an L1-regularized logistic regression → SHAP interpretability → coefficient export for in-browser inference.

Key findings.

  1. Clinical measurements dominate. The three most influential features — crp_level, oxygen_saturation, and xray_abnormal — are all objective lab/imaging results rather than self-reported behaviors. That pattern is what you would hope a credible medical model learns.
  2. Behavioral and environmental factors contribute coherent second-order signal. Smoking history, passive smoking, radon and occupational exposures all push risk in the medically expected direction (as seen in the SHAP summary and dependence plots).
  3. Feature selection was robust. Multicollinear duplicates (e.g. pack_years vs smoking_years) were removed by VIF before fitting, and the surviving feature set is statistically significant end to end.
  4. Interpretability comes free. Because the deployed model is a plain Logit, the browser app reproduces its exact probabilities with a dot product and a sigmoid — no server, no approximation, and every slider’s effect on the score is directly traceable to a coefficient.

Limitations & future work.

  • The dataset is synthetic; coefficients (some odds ratios are implausibly large) and metrics should not be read as clinical findings.
  • Single train/test split — cross-validated metrics with confidence intervals would harden the evaluation.
  • The 0.5 decision threshold is arbitrary for a screening context; threshold tuning against a recall target (or a cost-sensitive analysis) is the natural next step.
  • No probability calibration check yet (reliability curve / Brier score) — worth adding since the app surfaces raw probabilities to users.