A Dual Approach to Predict & Profile High-Risk Diabetic Patients

Data Mining Project Presentation

By Aavash Lamichhane & Prayash Shakya

1. The Challenge of Diabetic Readmissions

➤
High Stakes: Hospital readmissions lead to high costs & poor patient outcomes.
➤
Elevated Risk: Diabetic patients face nearly double the 30-day readmission risk (15.3% vs 8.4%).
➤
Our Solution: A dual Machine Learning approach for robust prediction & actionable patient profiling.

2. Project Objectives

Predictive Modeling

Accurately classify high-risk patients using supervised learning.

Patient Profiling

Identify distinct patient subgroups via unsupervised clustering.

Actionable Insights

Enable targeted, data-driven interventions to reduce readmissions.

3. Methodology Workflow

1Data Preprocessing: Cleaning, handling missing values, and deduplication.
2Feature Engineering: Creating new, informative features like `prior_utilization`.
3Exploratory Analysis (EDA): Visualizing relationships using plots to find key trends.
4Unsupervised Learning: Applying K-Means clustering to profile patients.
5Supervised Learning: Training, tuning, and evaluating multiple classification models.
6Ensemble Modeling: Combining top models for superior predictive accuracy.

4. Data Preprocessing

➤Initial Data: 101k records, 50 features.
➤Challenges: High missing data, low-variance features, duplicates.
➤Result: A clean, robust dataset of 67k unique patient records.

5. Feature Engineering & Encoding

We created new features and transformed existing ones to capture deeper clinical insights.

New Features Created

`numchange` (medication changes)
`comorbidity_score` (other diagnoses)
`prior_utilization` (previous visits)

Smart Encoding

Grouped high-cardinality diagnosis codes into 9 clinical categories.
Encoded lab results based on clinical meaning (e.g., 'Normal', 'High').

6. EDA: Readmission Distribution

7. EDA: Demographic Insights

8. Unsupervised Learning: Finding the Profiles

Elbow plot indicating optimal k=4.

PCA visualization of the 4 clusters.

9. Unsupervised Learning: Profile Comparison

10. The Four Patient Profiles

Clustering revealed four clinically meaningful profiles with vastly different readmission risks.

Profile 3: High-Risk

18.9%

"Frequent Flyers"

Profile 0: High-Acuity

10.9%

"Medically Complex"

Profile 1: High Comorbidity

9.3%

"Stable Chronic"

Profile 2: Low-Risk

6.3%

"Healthiest Group"

11. Supervised Learning: The Prediction Pipeline

➤Problem: Binary classification (readmitted ≤30 days).
➤Imbalance Handling: Used SMOTE to oversample the minority class.
➤Feature Selection: Filtered predictors using `ExtraTreesClassifier`.
➤Model Training: Evaluated 8 models (RF, KNN, Boosting, etc.).
➤Optimization: Tuned hyperparameters with 5-fold `GridSearchCV`.

12. Supervised Learning: Model Performance

13. Key Predictors of Readmission

Tree-based models agreed on the most important predictive features.

14. Pushing for Perfection: Advanced Ensembles

We combined our top models (RF, KNN, LightGBM) to create a "super model."

Voting Classifier

A democratic approach that averages the models' 'soft' probability scores.

Stacking Classifier

A sophisticated model where a meta-learner optimally combines the base model predictions.

15. Ensemble Performance: The Winning Models

Voting Ensemble ROC Curve

Stacking Ensemble ROC Curve

16. Final Results: The Stacking Ensemble

The Stacking Classifier achieved exceptional, balanced performance.

Accuracy98%

Precision98%

Recall98%

F1-Score98%

ROC-AUC0.9985

17. Limitations & Future Directions

Limitations

Data Age: Dataset is from 1999-2008.
Generalizability: Needs validation on other datasets.
Feature Gaps: Lacks socioeconomic data.

Future Research

Validate on contemporary data.
Incorporate social determinants of health.
Conduct clinical intervention trials.

18. Conclusion

✔ Achieved 98% accuracy with a Stacking Ensemble model.
✔ Identified 4 actionable patient profiles with risks from 6.3% to 18.9%.
✔ Confirmed prior healthcare utilization as the strongest predictor.
✔ Provided a powerful framework to improve patient outcomes.

Thank You

Questions?

A Dual Approach to Predict & Profile High-Risk Diabetic Patients

Presenter Notes:

1. The Challenge of Diabetic Readmissions

Presenter Notes:

2. Project Objectives

Predictive Modeling

Patient Profiling

Actionable Insights

Presenter Notes:

3. Methodology Workflow

Presenter Notes:

4. Data Preprocessing

Presenter Notes:

5. Feature Engineering & Encoding

New Features Created

Smart Encoding

Presenter Notes:

6. EDA: Readmission Distribution

Presenter Notes:

7. EDA: Demographic Insights

Presenter Notes:

8. Unsupervised Learning: Finding the Profiles

Presenter Notes:

9. Unsupervised Learning: Profile Comparison

Presenter Notes:

10. The Four Patient Profiles

Profile 3: High-Risk

Profile 0: High-Acuity

Profile 1: High Comorbidity

Profile 2: Low-Risk

Presenter Notes:

11. Supervised Learning: The Prediction Pipeline

Presenter Notes:

12. Supervised Learning: Model Performance

Presenter Notes:

13. Key Predictors of Readmission

Presenter Notes:

14. Pushing for Perfection: Advanced Ensembles

Voting Classifier

Stacking Classifier

Presenter Notes:

15. Ensemble Performance: The Winning Models

Presenter Notes:

16. Final Results: The Stacking Ensemble

Presenter Notes:

17. Limitations & Future Directions

Limitations

Future Research

Presenter Notes:

18. Conclusion

Presenter Notes:

Thank You

Presenter Notes: