中国卒中杂志 ›› 2022, Vol. 17 ›› Issue (11): 1189-1197.DOI: 10.3969/j.issn.1673-5765.2022.11.006

• 论著 • 上一篇    下一篇

基于结构化数据和机器学习模型预测普通人群卒中发病风险的系统评价和meta分析

邓宇含, 刘爽, 王子尧, 汪雨欣, 刘宝花   

  1. 北京 100191北京大学公共卫生学院
  • 收稿日期:2022-02-18 出版日期:2022-11-20 发布日期:2022-11-20
  • 通讯作者: 刘宝花 baohualiu@bjmu.edu.cn
  • 基金资助:
    国家重点研发计划课题(2018YFC1311700;2018YFC1311703)

The Effect of Machine Learning Model for Predicting Stroke Risk in the General Population Based on Structured Data: A Systematic Review and Meta-Analysis

DENG Yuhan, LIU Shuang, WANG Ziyao, WANG Yuxin, LIU Baohua   

  • Received:2022-02-18 Online:2022-11-20 Published:2022-11-20

摘要: 目的 对基于结构化数据和机器学习方法预测普通人群卒中发病风险的模型表现和预测性能进行系统评价,以有针对性地提高研究质量以及模型的预测性能。
方法 系统性回顾4个数据库(PubMed,Web of Science,Scopus和Embase)在2021年6月21日之前关于机器学习预测卒中发病风险的所有研究,并由两位研究者独立进行文献筛选、数据提取及偏倚风险评估。采用MedCalc软件,使用随机效应模型对衡量模型区分度的指标进行meta分析,并根据样本量、预测变量集数量、算法类型、预测时间间隔等进行亚组分析,并进行发表偏倚评估和敏感性分析等。
结果 共纳入11项研究,其中,存在高偏倚风险的有3篇,未知偏倚风险的有6篇,低偏倚风险的有2篇。研究的数据来源包括电子健康档案和医疗保险数据库等,研究的中位预测时间间隔为3年;纳入预测变量个数的中位数为26,样本量的中位数为8175,最常应用的机器学习模型包括神经网络、随机森林和支持向量机。meta分析得出总AUC为0.745(95%CI 0.712~0.778,P<0.001),亚组分析结果显示,对于不同样本量、预测变量集数量,AUC差异均有统计学意义(95%CI无重叠),而对于不同的算法类型及预测时间间隔等,AUC差异较小(95%CI有重叠)。漏斗图和统计学检验结果均显示研究存在发表偏倚(P=0.050);敏感性分析结果显示:剔除极端AUC值的模型后,meta分析得出总AUC为0.746(95%CI 0.714~0.777,P<0.001),对极端AUC值不敏感(P<0.001)。
结论 采用结构化数据和机器学习方法预测人群中卒中发病风险的效果一般,且相关研究的质量均不高,实际应用时,需要通过针对性的改进提高模型的预测能力。

文章导读: 本文指出了数据库中基于结构化数据和机器学习模型预测普通人群卒中发病风险的文章的质量问题和偏倚风险,并对评价模型效果的指标进行了合并。

关键词: 机器学习; 卒中风险; 预后预测; meta分析; 系统综述

Abstract: Objective  To evaluate the effect of machine learning algorithms in predicting risk of stroke in the general population based on structured data through systematic review and meta-analysis.
Methods  The relevant literatures on stroke risk prediction by machine learning were retrieved from the database including the PubMed, Web of Science, Scopus and Embase before June 21, 2021. Two researchers screened, extracted the data and evaluated the publication bias independently. MedCalc  software and random effect model were used to make meta-analysis, and subgroup analysis was made according to sample size, number of variables, machine learning algorithm type, prediction time, and etc. Publication bias and sensitivity analysis were also conducted.
Results  A total of 11 literatures were included, with 3 articles of high bias risk, 6 articles of unknown bias risk and 2 articles of low bias risk. The data sources included electronic health records, health insurance databases and so on. The median prediction time interval was 3 years, and the median number of variables and samples were 26 and 8175, respectively. The most frequently used machine learning models included neural network, random forest and support vector machine. Meta-analysis showed the pooled AUC was 0.745 (95%CI 0.712-0.778, P<0.001). Subgroup analysis showed that the AUC had statistical differences for different sample sizes and number of variables (95%CI not overlapping), while the AUC had no statistical differences for different algorithms and prediction time (95%CI overlapping). Funnel plot and statistical testing showed the literatures all had publication bias (P=0.050), and the sensitivity analysis indicated that the pooled AUC was 0.746 (95%CI 0.714-0.777, P<0.001) excluding the outliers.
Conclusions  The effect of machine learning algorithm in predicting stroke risk of the general population based on structured data was general, and the quality of relevant literatures was not high. So the prediction models need to be improved to enhance the prediction ability in practical application.

Key words: Machine learning; Stroke risk; Prognosis prediction; Meta-analysis; Systematic review