基于结构化数据和机器学习模型预测普通人群卒中发病风险的系统评价和meta分析

doi:10.3969/j.issn.1673-5765.2022.11.006

摘要/Abstract

摘要： 目的对基于结构化数据和机器学习方法预测普通人群卒中发病风险的模型表现和预测性能进行系统评价，以有针对性地提高研究质量以及模型的预测性能。
方法系统性回顾4个数据库（PubMed，Web of Science，Scopus和Embase）在2021年6月21日之前关于机器学习预测卒中发病风险的所有研究，并由两位研究者独立进行文献筛选、数据提取及偏倚风险评估。采用MedCalc软件，使用随机效应模型对衡量模型区分度的指标进行meta分析，并根据样本量、预测变量集数量、算法类型、预测时间间隔等进行亚组分析，并进行发表偏倚评估和敏感性分析等。
结果共纳入11项研究，其中，存在高偏倚风险的有3篇，未知偏倚风险的有6篇，低偏倚风险的有2篇。研究的数据来源包括电子健康档案和医疗保险数据库等，研究的中位预测时间间隔为3年；纳入预测变量个数的中位数为26，样本量的中位数为8175，最常应用的机器学习模型包括神经网络、随机森林和支持向量机。meta分析得出总AUC为0.745（95%CI 0.712～0.778，P<0.001），亚组分析结果显示，对于不同样本量、预测变量集数量，AUC差异均有统计学意义（95%CI无重叠），而对于不同的算法类型及预测时间间隔等，AUC差异较小（95%CI有重叠）。漏斗图和统计学检验结果均显示研究存在发表偏倚（P=0.050）；敏感性分析结果显示：剔除极端AUC值的模型后，meta分析得出总AUC为0.746（95%CI 0.714～0.777，P<0.001），对极端AUC值不敏感（P<0.001）。
结论采用结构化数据和机器学习方法预测人群中卒中发病风险的效果一般，且相关研究的质量均不高，实际应用时，需要通过针对性的改进提高模型的预测能力。

文章导读： 本文指出了数据库中基于结构化数据和机器学习模型预测普通人群卒中发病风险的文章的质量问题和偏倚风险，并对评价模型效果的指标进行了合并。

关键词: 机器学习; 卒中风险; 预后预测; meta分析; 系统综述

Abstract: Objective To evaluate the effect of machine learning algorithms in predicting risk of stroke in the general population based on structured data through systematic review and meta-analysis.
Methods The relevant literatures on stroke risk prediction by machine learning were retrieved from the database including the PubMed, Web of Science, Scopus and Embase before June 21, 2021. Two researchers screened, extracted the data and evaluated the publication bias independently. MedCalc software and random effect model were used to make meta-analysis, and subgroup analysis was made according to sample size, number of variables, machine learning algorithm type, prediction time, and etc. Publication bias and sensitivity analysis were also conducted.
Results A total of 11 literatures were included, with 3 articles of high bias risk, 6 articles of unknown bias risk and 2 articles of low bias risk. The data sources included electronic health records, health insurance databases and so on. The median prediction time interval was 3 years, and the median number of variables and samples were 26 and 8175, respectively. The most frequently used machine learning models included neural network, random forest and support vector machine. Meta-analysis showed the pooled AUC was 0.745 (95%CI 0.712-0.778, P<0.001). Subgroup analysis showed that the AUC had statistical differences for different sample sizes and number of variables (95%CI not overlapping), while the AUC had no statistical differences for different algorithms and prediction time (95%CI overlapping). Funnel plot and statistical testing showed the literatures all had publication bias (P=0.050), and the sensitivity analysis indicated that the pooled AUC was 0.746 (95%CI 0.714-0.777, P<0.001) excluding the outliers.
Conclusions The effect of machine learning algorithm in predicting stroke risk of the general population based on structured data was general, and the quality of relevant literatures was not high. So the prediction models need to be improved to enhance the prediction ability in practical application.

Key words: Machine learning; Stroke risk; Prognosis prediction; Meta-analysis; Systematic review

邓宇含, 刘爽, 王子尧, 汪雨欣, 刘宝花. 基于结构化数据和机器学习模型预测普通人群卒中发病风险的系统评价和meta分析 [J]. 中国卒中杂志, 2022, 17(11): 1189-1197.

DENG Yuhan, LIU Shuang, WANG Ziyao, WANG Yuxin, LIU Baohua. The Effect of Machine Learning Model for Predicting Stroke Risk in the General Population Based on Structured Data: A Systematic Review and Meta-Analysis[J]. Chinese Journal of Stroke, 2022, 17(11): 1189-1197.

参考文献

[1] KATAN M，LUFT A. Global burden of stroke[J]. Semin Neurol，2018，38（2）：208-211.
[2] ROCHMAH T N，RAHMAWATI I T，DAHLUI M，et al. Economic burden of stroke disease：a systematic review[J/OL]. Int J Environ Res Public Health，2021，18（14）：7552[2022-02-02]. https://doi.org/10.3390/ijerph18147552.
[3] FEIGIN V L，LAWES C M，BENNETT D A，et al. Worldwide stroke incidence and early case fatality reported in 56 population-based studies：a systematic review[J]. Lancet Neurol，2009，8（4）：355-369.
[4] BOOT E，EKKER M S，PUTAALA J，et al. Ischaemic stroke in young adults：a global perspective[J]. J Neurol Neurosurg Psychiatry，2020，91（4）：411-417.
[5] PANDIAN J D，GALL S L，KATE M P，et al. Prevention of stroke：a global perspective[J]. Lancet，2018，392（10154）：1269-1278.
[6] ESENWA C，GUTIERREZ J. Secondary stroke prevention：challenges and solutions[J/OL]. Vasc Health Risk Manag，2015，11：437-450[2022-02-02]. https://doi.org/10.2147/VHRM.S63791.
[7] YAHYA T，JILANI M H，KHAN S U，et al. Stroke in young adults：current trends，opportunities for prevention and pathways forward[J/OL]. Am J Prev Cardiol，2020，3：100085[2022-02-02]. https://doi.org/10.1016/j.ajpc.2020.100085.
[8] CHO S M，AUSTIN P C，ROSS H J，et al. Machine learning compared with conventional statistical models for predicting myocardial infarction readmission and mortality：a systematic review[J]. Can J Cardiol，2021，37（8）：1207-1214.
[9] SONG X，LIU X Y，LIU F，et al. Comparison of machine learning and logistic regression models in predicting acute kidney injury：a systematic review and meta-analysis[J/OL]. Int J Med Inform，2021，151：104484[2022-02-02]. https://doi.org/10.1016/j. ijmedinf.2021.104484.
[10] KOUROU K，EXARCHOS T P，EXARCHOS K P，et al. Machine learning applications in cancer prognosis and prediction[J/OL]. Comput Struct Biotechnol J，2014，13：8-17[2022-02-02]. https://doi.org/10.1016/j.csbj.2014.11.005.
[11] ARABASADI Z，ALIZADEHSANI R，ROSHANZAMIR M，et al. Computer aided decision making for heart disease detection using hybrid neural network-genetic algorithm[J/OL]. Comput Methods Programs Biomed，2017，141：19-26[2022-02-02]. https://doi.org/10.1016/j.cmpb.2017.01.004.
[12] KAMAL H，LOPEZ V，SHETH S A. Machine learning in acute ischemic stroke neuroimaging[J/OL]. Front Neurol，2018，9：945[2022-02-02]. https://doi.org/10.3389/fneur.2018.00945.
[13] LEE E J，KIM Y H，KIM N，et al. Deep into the brain：artificial intelligence in stroke imaging[J]. J Stroke，2017，19（3）：277–285.
[14] WANG W，KIIK M，PEEK N，et al. A systematic review of machine learning models for predicting outcomes of stroke with structured data[J/OL]. PLoS One，2020，15（6）：e0234722[2022-02-02]. https://doi.org/10.1371/journal.pone.0234722.
[15] MOHER D，LIBERATI A，TETZLAFF J，et al. Preferred reporting items for systematic reviews and meta-analyses：the PRISMA statement[J/OL]. PLoS Med，2009，6（7）：e1000097[2022-02-02]. https://doi.org/10.1371/journal.pmed.1000097.
[16] DENG Y H，LIU S，QIN W. Machine learning for prediction of stroke in community settings with structured data：a systematic review and meta-analysis[EB/OL]. (2021-07-27)[2022-02-02]. https://www.crd.york.ac.uk/PROSPERO/display_record.php?RecordID=264406.
[17] MOONS K G，DE GROOT J A，BOUWMEESTER W，et al. Critical appraisal and data extraction for systematic reviews of prediction modelling studies：the CHARMS checklist[J/OL]. PLoS Med，2014，11（10）：e1001744[2022-02-02]. https://doi.org/10.1371/journal.pmed.1001744.
[18] WOLFF R F，MOONS K G M，RILEY R D，et al. PROBAST：a tool to assess the risk of bias and applicability of prediction model studies[J]. Ann Intern Med，2019，170（1）：51-58.
[19] NEWCOMBE R G. Confidence intervals for an effect size measure based on the Mann-Whitney statistic. Part 2：asymptotic methods and evaluation[J]. Stat Med，2006，25（4）：559-573.
[20] HANLEY J A，MCNEIL B J. The meaning and use of the area under a receiver operating characteristic（ROC）curve[J]. Radiology，1982，143（1）：29-36.
[21] PEÑAFIEL S，BALOIAN N，SANSON H，et al. Predicting stroke risk with an interpretable classifier[J/OL]. IEEE Access，2021，9：1154-1166[2022-02-02]. https://doi.org/10.1109/ACCESS.2020.3047195.
[22] CHEN J，CHEN Y R，LI J Q，et al. Stroke risk prediction with hybrid deep transfer learning framework[J]. IEEE J Biomed Health Inform，2022，26（1）：411-422.
[23] YAO Q，ZHANG J，YAN K，et al. Development and validation of a 2-year new-onset stroke risk prediction model for people over age 45 in China[J/OL]. Medicine（Baltimore），2020，99（41）：e22680[2022-02-02]. https://doi.org/10.1097/MD.0000000000022680.
[24] WU Y F，FANG Y. Stroke prediction with machine learning methods among older Chinese[J/OL]. Int J Environ Res Public Health，2020，17（6）：1828[2022-02-02]. https://doi.org/10.3390/ijerph17061828.
[25] HUNG C Y，LIN C H，LAN T H，et al. Development of an intelligent decision support system for ischemic stroke risk assessment in a population-based electronic health record database[J/OL]. PLoS One，2019，14（3）：e0213007[2022-02-02]. https://doi.org/10.1371/journal.pone.0213007.
[26] ZHANG Y L，SONG W，FU L. Feature selection for risk detection of strokes：a 5-year longitudinal study[C]. Kunming：2017 9th International Conference on Modelling，Identification and Control（ICMIC），2017.
[27] TEOH D. Towards stroke prediction using electronic health records[J/OL]. BMC Med Inform Decis Mak，2018，18：127[2022-02-02]. http://dx.doi.org/10.1186/s12911-018-0702-y.
[28] MIN S N，PARK S J，KIM D J，et al. Development of an algorithm for stroke prediction：a national health insurance database study in Korea[J]. Eur Neurol，2018，79（3/4）：214-220.
[29] HUNG C Y，LIN C H，LEE C C. Improving young stroke prediction by learning with active data augmenter in a large-scale electronic medical claims database[J/OL]. Annu Int Conf IEEE Eng Med Biol Soc，2018，2018：5362-5365[2022-02-02]. https://doi.org/10.1109/EMBC.2018.8513479.
[30] KHOSLA A，CAO Y，LIN C Y，et al. An integrated machine learning approach to stroke prediction[C]. Washington：Acm Sigkdd International Conference on Knowledge Discovery & Data Mining，2010.
[31] CHEN P C，CHIEN K L，HSU H C，et al. Metabolic syndrome and C-reactive protein in stroke prediction：a prospective study in Taiwan[J]. Metabolism，2009，58（6）：772-778.
[32] DEO R C. Machine learning in medicine[J]. Circulation，2015，132（20）：1920-1930.
[33] LI Y，SPERRIN M，ASHCROFT D M，et al. Consistency of variety of machine learning and statistical models in predicting clinical risks of individual patients：longitudinal cohort study using cardiovascular disease as exemplar[J/OL]. BMJ，2020，371：m3919[2022-02-02]. https://doi.org/10.1136/bmj.m3919.
[34] TU J V. Advantages and disadvantages of using artificial neural networks versus logistic regression for predicting medical outcomes[J]. J Clin Epidemiol，1996，49（11）：1225-1231.
[35] CAPISTRANT B D，WANG Q，LIU S Y，et al. Stroke-associated differences in rates of activity of daily living loss emerge years before stroke onset[J]. J Am Geriatr Soc，2013，61（6）：931-938.
[36] GREGÓRIO T，PIPA S，CAVALEIRO P，et al. Prognostic models for intracerebral hemorrhage：systematic review and meta-analysis[J/OL]. BMC Med Res Methodol，2018，18（1）：145[2022-02-02]. https://doi.org/10.1186/s12874-018-0613-8.
[37] FLEUREN L M，KLAUSCH T L T，ZWAGER C L，et al. Machine learning for the prediction of sepsis：a systematic review and meta-analysis of diagnostic test accuracy[J]. Intensive Care Med，2020，46（3）：383-400.
[38] ANDAUR NAVARRO C L，DAMEN J A A G，TAKADA T，et al. Protocol for a systematic review on the methodological and reporting quality of prediction model studies using machine learning techniques[J/OL]. BMJ Open，2020，10（11）：e038832[2022-02-02]. https://doi.org/10.1136/bmjopen-2020-038832.
[39] LINARDATOS P，PAPASTEFANOPOULOS V，KOTSIANTIS S. Explainable AI：a review of machine learning interpretability methods[J/OL]. Entropy（Basel），2020，23（1）：18[2022-02-02]. https://doi.org/10.3390/e23010018.
[40] MOONS K G，ALTMAN D G，REITSMA J B，et al. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis（TRIPOD）：explanation and elaboration[J/OL]. Ann Intern Med，2015，162（1）：W1-W73[2022-02-02]. https://doi.org/10.7326/M14-0698.