机器学习基于不平衡数据预测急性新发缺血性卒中患者院内死亡模型研究

doi:10.3969/j.issn.1673-5765.2021.08.005

摘要/Abstract

摘要： 目的探索利用机器学习基于不平衡数据预测急性新发缺血性卒中患者的院内死亡风险，并比较机器学习模型和传统logistic模型的预测性能。方法以中国卒中联盟多中心登记数据库中急性新发缺血性卒中患者为研究对象，分别基于机器学习[XGBoost模型、CatBoost模型、随机森林模型、支持向量机（support vector machine，SVM）模型]和传统logistic方法构建患者院内死亡预测模型。按照7∶3比例随机分为训练集和测试集，训练集用于构建预测模型，测试集用于评价模型效果。采用欠采样技术和平衡权重的方法处理死亡结局的不平衡数据。模型的评价指标包括区分度指标ROC中AUC和校准度指标Brier分数。结果共纳入601 466例急性新发缺血性卒中的患者，女性231 235例（38.45%），院内死亡2206 例（0.37%）。logistic模型、XGBoost模型、CatBoost模型、随机森林模型、SVM模型预测患者院内死亡的 AUC分别是0.913±0.000、0.921±0.000、0.919±0.001、0.925±0.000和0.900±0.001，其中XGBoost模型（P =0.0002）、CatBoost模型（P =0.0094）和随机森林模型（P＜0.0001）的预测性能优于logistic模型， logistic模型表现优于SVM模型（P =0.0029）。logistic模型、XGBoost模型、CatBoost模型、随机森林模型、 SVM模型的Brier分数分别为0.115±0.001、0.096±0.001、0.093±0.001、0.084±0.000和0.045±0.001，机器学习模型的校准度均优于logistic模型，差异有统计学意义。结论平衡数据处理后，机器学习模型和传统logistic模型预测急性新发缺血性卒中患者院内死亡风险表现均良好且稳定，其中，随机森林模型的预测性能最佳，SVM模型的校准度最佳。

文章导读： 本文基于CSCA大样本数据库中急性新发缺血性卒中患者的数据，应用欠采样技术和平衡权重方法处
理患者院内死亡结局的不平衡数据，在此基础上比较XGBoost、CatBoost、随机森林和SVM四种机器学习模型和
logistic模型预测患者院内死亡结局的性能，结果显示机器模型的预测性能整体优于logistic模型。

关键词: 缺血性卒中; 院内死亡; 预测模型; 机器学习

Abstract: Objective To explore the value of machine learning based on unbalanced data to predict inhospital death in patients with acute new ischemic stroke, and compare the predictive performance of machine learning model and traditional logistic model. Methods Data of patients with new acute ischemic stroke from the multi-center registry database of Chinese Stroke Center Alliance (CSCA) were selected, to construct the prediction models of inhospital death based on machine learning [XGBoost, CatBoost, random forest and support vector machine (SVM)] and traditional logistic method, respectively. According to the ratio of 7:3, all the data were randomly divided into training set (to construct the prediction model) and test set

(to evaluate the prediction model). The unbalanced data of death outcome were dealed with the

undersampling and balancing weight methods. The AUC of the discrimination index and the Brier score of the calibration index were used to evaluate the models. Results A total of 601 466 eligible patients were included, including 231 235 females (38.45%) and 2206 in-hospital deaths (0.37%). The AUC of the logistic model, XGBoost model, CatBoost model, random forest model and SVM model to predict in-hospital death were 0.913±0.000, 0.921±0.000, 0.919±0.001, 0.925±0.000 and 0.900±0.001, respectively. The XGBoost model (P =0.0002), CatBoost model (P =0.0094) and random forest model (P <0.0001) had better prediction performance than the logistic model, and the logistic model was better than the SVM model (P =0.0029). The Brier scores of the logistic model, XGBoost model, CatBoost model, random forest model, and SVM model were 0.115±0.001, 0.096±0.001, 0.093±0.001, 0.084±0.000 and 0.045±0.001, respectively. The calibration of machine learning models was all better than the logistic model, and all the differences were statistically significant. Conclusions After balancing the data, machine learning models and the traditional logistic model all had a good and stable performance in predicting the risk of in-hospital death in patients with acute new ischemic stroke. Among them, the random forest model had the best predictive performance and the SVM model had the best calibration.

Key words: Ischemic stroke; In-hospital death; Prediction model; Machine learning

陈思玎, 谷鸿秋, 黄馨莹, 刘欢, 姜勇, 王拥军. 机器学习基于不平衡数据预测急性新发缺血性卒中患者院内死亡模型研究[J]. 中国卒中杂志, 2021, 16(08): 779-786.

CHEN Si-Ding, GU Hong-Qiu, HUANG Xin-Ying, LIU Huan, JIANG Yong, WANG Yong-Jun. Machine Learning Models for Predicting In-hospital Death in Patients with Acute New Ischemic Stroke Based onUnbalanced Data[J]. Chinese Journal of Stroke, 2021, 16(08): 779-786.

参考文献

[1] GBD 2016 Causes of Death Collaborators. Global，
regional，and national age-sex specific mortality
for 264 causes of death，1980-2016：a systematic
analysis for the Global Burden of Disease Study
2016[J]. Lancet，2017，390（10100）：1151-1210.
[2] ZHOU M G，WANG H D，ZENG X Y，et al.
Mortality，morbidity，and risk factors in China
and its provinces，1990-2017：a systematic analysis
for the global burden of disease study 2017[J]. The
Lancet，2019，394（10204）：1145-1158.
[3] 陈旭，刘鹏鹤，孙毓忠，等. 面向不均衡医学数据集
的疾病预测模型研究[J]. 计算机学报，2019，42（3）：
596-609.
[4] 柳培忠，洪铭，黄德天，等. 基于ADASYN与
AdaBoostSVM相结合的不平衡分类算法[J]. 北京工
业大学学报，2017，43（3）：368-375.
[5] WANG Y J，LI Z X，WANG Y L，et al. Chinese
Stroke Center Alliance：a national effort to improve
healthcare quality for acute stroke and transient
ischemic attack：rationale，design and preliminary
findings[J]. Stroke Vasc Neurol，2018，3（4）：256-
262.
[6] POWERS W J，RABINSTEIN A A，ACKERSON
T，et al. Guidelines for the early management of
patients with acute ischemic stroke：2019 update
to the 2018 guidelines for the early management of
acute ischemic stroke：a guideline for healthcare
professionals from the American Heart Association/
American Stroke Association[J/OL]. Stroke，2019，
50（12）：e344-e418[2021-02-05]. https：//doi.org/10.
1161/STR.0000000000000211.
[7] CHAUDHARY D，ABEDI V，LI J，et al. Clinical
risk score for predicting recurrence following a
cerebral ischemic event[J/OL]. Front Neurol，2019，
10：1106[2021-02-05]. https：//doi.org/10.3389/fneur.
2019.01106.
[8] CHEN T Q，GUESTRIN C. Xgboost：a scalable tree
boosting system[C]. Proceedings of the 22nd ACM
SIGKDD International Conference on Knowledge
Discovery and Data Mining. New York：ACM，2016：
785-794.
[9] PROKHORENKOVA L，GUSEV G，VOROBEV A，
et al. CatBoost：unbiased boosting with categorical
features[C/OL]. NeurIPS 2018，Montreal，2018：
6638-6648[2021-02-05]. https：//arxiv.org/pdf/1706.
09516v5.pdf.
[10] 方匡南，吴见彬，朱建平，等. 随机森林方法研究综
述[J]. 统计与信息论坛，2011，26（3）：32-38.
[11] CORTES C，VAPNIK V. Support-vector networks[J].
Machine learning，1995，20（3）：273-297.
[12] BOSER B E，GUYON I M，VAPNIK V N. A
training algorithm for optimal margin classifiers[C/
OL]. COLT '92，Pennsylvania，1992：144-152[2021-
02-05]. https：//doi.org/10.1145/130385.130401.
[13] BRIER G W. Verification of forecasts expressed in
terms of probability[J/OL]. Mon Weather Rev，1950，
78：1-3[2021-02-05]. https：//doi.org/10.1175/1520-
0493（1950）078<0001：VOFEIT>2.0.CO；2.
[14] XIONG Y Y，GU H Q，ZHAO X Q，et al. Clinical
characteristics and in-hospital outcomes of varying
definitions of minor stroke：from a large-scale
nation-wide longitudinal registry[J]. Stroke，2021，52
（4）：1253-1258.
[15] HAN H，WANG W Y，MAO B H. Borderline-
SMOTE：a new over-sampling method in imbalanced
data sets learning[C/OL]. International Conference
on Intelligent Computing 2005，Hefei，China，
2005：878-887[2021-02-05]. http：//dx.doi.org/10.
1007/11538059_91.
[16] LEE J，PARK K. GAN-based imbalanced data
intrusion detection system[J/OL]. Pers Ubiquit
Comput，2021，25：121-128[2021-02-05]. https：//doi.
org/10.1007/s00779-019-01332-y.