Lung Cancer Risk Predictor

Data Science

Healthcare

An interactive, end-to-end data science project demonstrating the lifecycle of a machine learning model from exploratory data analysis (EDA) to a fully deployed frontend application.

Author

Miguel R.

Published

July 4, 2026

An interactive, end-to-end data science project demonstrating the lifecycle of a machine learning model from exploratory data analysis (EDA) to a fully deployed frontend application.

Live Demo: Lung Cancer Risk Predictor

This project predicts the estimated probability of lung cancer based on various health, behavioral, and environmental features using a Logistic Regression model.

** Medical Disclaimer:** This application and its underlying model are for educational and demonstrative purposes only. It is not intended to be a substitute for professional medical advice, diagnosis, or treatment. Always seek the advice of your physician or other qualified health provider with any questions you may have regarding a medical condition.

Project Goals

The objective of this repository is to serve as a comprehensive portfolio piece bridging the gap between Data Science and Web Development. It demonstrates: 1. Data Science & Machine Learning: Rigorous feature selection, model training, and interpretability using Python (Pandas, Scikit-Learn, SHAP). 2. Frontend Engineering: Translating static model weights into a live, interactive React application using TypeScript and Vite.

The Data Science Process

All research, data preprocessing, and modeling were conducted in Jupyter Notebooks (/notebooks).

1. Exploratory Data Analysis & Feature Selection

We began by analyzing the dataset to understand the relationships between different variables and the target (lung_cancer_risk).

View the EDA Notebook View the Feature Selection Notebook

2. Model Training (Logistic Regression vs L1-Regularized Logistic)

We evaluated multiple models. Logistic Regression was ultimately chosen for its interpretability and ease of extraction (weights and intercept) for the frontend application.

View the Modeling Notebook

Logit Confusion Matrix Lasso Confusion Matrix

3. Model Interpretability (SHAP Values)

To ensure the model makes decisions based on logical medical grounds, we used SHAP (SHapley Additive exPlanations) to explain individual predictions and global feature importance.

Global Feature Importance: SHAP Feature Importance

SHAP Summary Plot:

Single Prediction Explanation: Single Explanation

The Frontend Application

The frontend is a React application built with Vite and TypeScript. It does not rely on a Python backend API. Instead, it computes the logistic regression probability entirely in the browser using the extracted logit_weights.json and logit_weights_features.json.

Key Features

Real-time Inference: The risk score updates instantly as sliders and checkboxes are manipulated.
Dynamic Color Scaling: The risk indicator transitions smoothly from green (low risk) to yellow to red (high risk).
Data Context Integration: The custom slider component overlays the actual histogram distribution of the dataset (via feature_distributions.json), allowing the user to see where their selected value falls relative to the population.

Repository Structure

├── notebooks/                   # Data Science Research
│   ├── data/                    # Raw datasets
│   ├── behavioral.ipynb         # EDA and Behavioral Analysis
│   ├── features_selection.ipynb # Feature Engineering
│   └── modeling.ipynb           # Model Training and Evaluation
├── public/                      # Static Assets
│   ├── plots/                   # Exported charts from notebooks
│   ├── feature_distributions.json # Data histograms for the UI
│   ├── logit_weights.json         # Extracted model weights
│   └── logit_weights_features.json# Feature names mapping
├── src/                         # React Frontend Code
│   ├── App.tsx                  # Main application logic
│   └── Slider.tsx               # Custom slider with integrated histogram
└── ...