A Dual Approach to Predict & Profile High-Risk Diabetic Patients

Data Mining Project Presentation

By Aavash Lamichhane & Prayash Shakya

1. The Challenge of Diabetic Readmissions

  • High Stakes: Hospital readmissions lead to high costs & poor patient outcomes.
  • Elevated Risk: Diabetic patients face nearly double the 30-day readmission risk (15.3% vs 8.4%).
  • Our Solution: A dual Machine Learning approach for robust prediction & actionable patient profiling.

2. Project Objectives

Predictive Modeling

Accurately classify high-risk patients using supervised learning.

Patient Profiling

Identify distinct patient subgroups via unsupervised clustering.

Actionable Insights

Enable targeted, data-driven interventions to reduce readmissions.

3. Methodology Workflow

  1. 1Data Preprocessing: Cleaning, handling missing values, and deduplication.
  2. 2Feature Engineering: Creating new, informative features like `prior_utilization`.
  3. 3Exploratory Analysis (EDA): Visualizing relationships using plots to find key trends.
  4. 4Unsupervised Learning: Applying K-Means clustering to profile patients.
  5. 5Supervised Learning: Training, tuning, and evaluating multiple classification models.
  6. 6Ensemble Modeling: Combining top models for superior predictive accuracy.

4. Data Preprocessing

  • Initial Data: 101k records, 50 features.
  • Challenges: High missing data, low-variance features, duplicates.
  • Result: A clean, robust dataset of 67k unique patient records.

5. Feature Engineering & Encoding

We created new features and transformed existing ones to capture deeper clinical insights.

New Features Created

  • `numchange` (medication changes)
  • `comorbidity_score` (other diagnoses)
  • `prior_utilization` (previous visits)

Smart Encoding

  • Grouped high-cardinality diagnosis codes into 9 clinical categories.
  • Encoded lab results based on clinical meaning (e.g., 'Normal', 'High').

6. EDA: Readmission Distribution

Readmission Distribution Plot

7. EDA: Demographic Insights

Race vs Readmission Plot
Age vs Readmission Rate Plot

8. Unsupervised Learning: Finding the Profiles

Elbow Plot

Elbow plot indicating optimal k=4.

PCA Cluster Plot

PCA visualization of the 4 clusters.

9. Unsupervised Learning: Profile Comparison

Radar Plot Profile 0
Radar Plot Profile 1
Radar Plot Profile 2
Radar Plot Profile 3

10. The Four Patient Profiles

Clustering revealed four clinically meaningful profiles with vastly different readmission risks.

Profile 3: High-Risk

18.9%

"Frequent Flyers"

Profile 0: High-Acuity

10.9%

"Medically Complex"

Profile 1: High Comorbidity

9.3%

"Stable Chronic"

Profile 2: Low-Risk

6.3%

"Healthiest Group"

11. Supervised Learning: The Prediction Pipeline

  1. Problem: Binary classification (readmitted ≤30 days).
  2. Imbalance Handling: Used SMOTE to oversample the minority class.
  3. Feature Selection: Filtered predictors using `ExtraTreesClassifier`.
  4. Model Training: Evaluated 8 models (RF, KNN, Boosting, etc.).
  5. Optimization: Tuned hyperparameters with 5-fold `GridSearchCV`.

12. Supervised Learning: Model Performance

Random Forest ROC Curve

13. Key Predictors of Readmission

Tree-based models agreed on the most important predictive features.

Feature Importance Plot

14. Pushing for Perfection: Advanced Ensembles

We combined our top models (RF, KNN, LightGBM) to create a "super model."

Voting Classifier

A democratic approach that averages the models' 'soft' probability scores.

Stacking Classifier

A sophisticated model where a meta-learner optimally combines the base model predictions.

15. Ensemble Performance: The Winning Models

Voting Ensemble ROC Curve

Voting Ensemble ROC Curve

Stacking Ensemble ROC Curve

Stacking Ensemble ROC Curve

16. Final Results: The Stacking Ensemble

The Stacking Classifier achieved exceptional, balanced performance.

Accuracy98%
Precision98%
Recall98%
F1-Score98%
ROC-AUC0.9985

17. Limitations & Future Directions

Limitations

  • Data Age: Dataset is from 1999-2008.
  • Generalizability: Needs validation on other datasets.
  • Feature Gaps: Lacks socioeconomic data.

Future Research

  • Validate on contemporary data.
  • Incorporate social determinants of health.
  • Conduct clinical intervention trials.

18. Conclusion

  • Achieved 98% accuracy with a Stacking Ensemble model.
  • Identified 4 actionable patient profiles with risks from 6.3% to 18.9%.
  • Confirmed prior healthcare utilization as the strongest predictor.
  • Provided a powerful framework to improve patient outcomes.

Thank You

Questions?

Slide 1 / 20