A Machine Learning Approach to Predicting Future Onset of Type II Diabetes

Authors

  • Preston Badger Concordia International School Hanoi
  • Hamid Abuwarda Yale University

Abstract

Type 2 Diabetes is a critical global health concern, and this study aims to enhance its onset prediction using the National Health and Nutrition Examination Survey (NHANES) dataset. The research assesses machine learning models' accuracy in predicting Type 2 Diabetes onset in the U.S., utilizing NHANES data from 1988 to 2018 and a broad spectrum of factors such as examination, dietary, questionnaire, and demographic data. Employing Logistic Regression, Support Vector Machines (SVM), Random Forest, XGBoost, and an ensemble model that combines their strengths, the study meticulously integrates critical variables into feature selection.

The models, evaluated on ROC-AUC, Precision, Recall, and F1 Score, showed notable performance. In Case I, targeting Diabetic and Non-Diabetic patients, Logistic Regression achieved an AUC of 0.662649, SVM 0.739073, Random Forest 0.865298, XGBoost 0.856807, and the Ensemble model 0.856879. In Case II, emphasizing Undiagnosed Diabetic and Pre-Diabetic patients, Logistic Regression achieved an AUC of 0.837121, SVM 0.851885, Random Forest 0.891081, XGBoost 0.892435, and the Ensemble model 0.885736. When evaluated using 20% test data for Cases I and II, the models demonstrated high efficacy, particularly the Random Forest and XGBoost models, which exhibited nearly perfect ROC-AUC scores in Case I.

These results underscore the potential of machine learning in accurately predicting Type 2 Diabetes onset. The developed models, particularly the ensemble model, show high accuracy and offer a comprehensive view of risk factors. The study highlights the ongoing need for research in this area to refine predictive models and improve their applicability in real-world healthcare settings.

Downloads

Published

2024-10-02

Data Availability Statement

Source of data used in this research paper: https://www.kaggle.com/datasets/nguyenvy/nhanes-19882018

Issue

Section

Research Articles