Chinese Journal of Stroke ›› 2021, Vol. 16 ›› Issue (08): 779-786.DOI: 10.3969/j.issn.1673-5765.2021.08.005

Previous Articles     Next Articles

 Machine Learning Models for Predicting In-hospital Death in Patients with Acute New Ischemic Stroke Based onUnbalanced Data

  

  • Received:2021-03-18 Online:2021-08-20 Published:2021-08-20

机器学习基于不平衡数据预测急性新发缺血性卒中患者院内死亡模型研究

陈思玎, 谷鸿秋, 黄馨莹, 刘欢, 姜勇, 王拥军   

  1. 1北京 100070国家神经系统疾病临床医学研究中心
    2北京大数据精准医疗高精尖创新中心(北京航空航天大学&首都医科大学)
    3北京首都医科大学附属北京天坛医院神经病学中心
    4北京脑重大疾病研究院脑卒中研究所
  • 基金资助:
    “十三五”国家重点研发计划(2016YFC0901001)

Abstract: Objective To explore the value of machine learning based on unbalanced data to predict inhospital death in patients with acute new ischemic stroke, and compare the predictive performance of machine learning model and traditional logistic model. Methods Data of patients with new acute ischemic stroke from the multi-center registry database of Chinese Stroke Center Alliance (CSCA) were selected, to construct the prediction models of inhospital death based on machine learning [XGBoost, CatBoost, random forest and support vector machine (SVM)] and traditional logistic method, respectively. According to the ratio of 7:3, all the data were randomly divided into training set (to construct the prediction model) and test set

(to evaluate the prediction model). The unbalanced data of death outcome were dealed with the

undersampling and balancing weight methods. The AUC of the discrimination index and the Brier score of the calibration index were used to evaluate the models. Results A total of 601 466 eligible patients were included, including 231 235 females (38.45%) and 2206 in-hospital deaths (0.37%). The AUC of the logistic model, XGBoost model, CatBoost model, random forest model and SVM model to predict in-hospital death were 0.913±0.000, 0.921±0.000, 0.919±0.001, 0.925±0.000 and 0.900±0.001, respectively. The XGBoost model (P =0.0002), CatBoost model (P =0.0094) and random forest model (P <0.0001) had better prediction performance than the logistic model, and the logistic model was better than the SVM model (P =0.0029). The Brier scores of the logistic model, XGBoost model, CatBoost model, random forest model, and SVM model were 0.115±0.001, 0.096±0.001, 0.093±0.001, 0.084±0.000 and 0.045±0.001, respectively. The calibration of machine learning models was all better than the logistic model, and all the differences were statistically significant. Conclusions After balancing the data, machine learning models and the traditional logistic model all had a good and stable performance in predicting the risk of in-hospital death in patients with acute new ischemic stroke. Among them, the random forest model had the best predictive performance and the SVM model had the best calibration.

Key words: Ischemic stroke; In-hospital death; Prediction model; Machine learning

摘要: 目的 探索利用机器学习基于不平衡数据预测急性新发缺血性卒中患者的院内死亡风险,并比较 机器学习模型和传统logistic模型的预测性能。 方法 以中国卒中联盟多中心登记数据库中急性新发缺血性卒中患者为研究对象,分别基于机器学 习[XGBoost模型、CatBoost模型、随机森林模型、支持向量机(support vector machine,SVM)模型]和传 统logistic方法构建患者院内死亡预测模型。按照7∶3比例随机分为训练集和测试集,训练集用于构建 预测模型,测试集用于评价模型效果。采用欠采样技术和平衡权重的方法处理死亡结局的不平衡 数据。模型的评价指标包括区分度指标ROC中AUC和校准度指标Brier分数。 结果 共纳入601 466例急性新发缺血性卒中的患者,女性231 235例(38.45%),院内死亡2206 例(0.37%)。logistic模型、XGBoost模型、CatBoost模型、随机森林模型、SVM模型预测患者院内死亡的 AUC分别是0.913±0.000、0.921±0.000、0.919±0.001、0.925±0.000和0.900±0.001,其中XGBoost模 型(P =0.0002)、CatBoost模型(P =0.0094)和随机森林模型(P<0.0001)的预测性能优于logistic模型, logistic模型表现优于SVM模型(P =0.0029)。logistic模型、XGBoost模型、CatBoost模型、随机森林模型、 SVM模型的Brier分数分别为0.115±0.001、0.096±0.001、0.093±0.001、0.084±0.000和0.045±0.001, 机器学习模型的校准度均优于logistic模型,差异有统计学意义。 结论 平衡数据处理后,机器学习模型和传统logistic模型预测急性新发缺血性卒中患者院内死亡风 险表现均良好且稳定,其中,随机森林模型的预测性能最佳,SVM模型的校准度最佳。

关键词: 缺血性卒中; 院内死亡; 预测模型; 机器学习