中国卒中杂志 ›› 2022, Vol. 17 ›› Issue (03): 217-226.DOI: 10.3969/j.issn.1673-5765.2022.03.002

• 专题论坛 • 上一篇    下一篇

脑血管病基因组学数据分析流程建设

许喆, 程丝, 刘阳, 石延枫, 李昊   

  1. 北京 100070首都医科大学附属北京天坛医院神经病学中心;国家神经系统疾病临床医学研究中心卒中多组学创新中心
  • 收稿日期:2021-12-03 出版日期:2022-03-20 发布日期:2022-03-20
  • 基金资助:
    2020年度首都卫生发展科研专项项目(首发2020-1-2041)

Construction of Bioinformatics Pipeline for Genomics Data in Cerebral Vascular Disease Researches

  • Received:2021-12-03 Online:2022-03-20 Published:2022-03-20

摘要:

目的 建立并优化适合于脑血管病基因组学数据分析的生物信息流程,促进脑血管病多组学和精 准医学研究的开展。 

方法 调研和梳理临床科研需求,参考脑血管病以及群体遗传领域基因组学、遗传学研究,总结常用分析方法。按照研究目标和分析内容的不同,对生物信息学流程进行模块化设计。依托中国国家卒中登记Ⅲ(China national stroke registry-Ⅲ,CNSR-Ⅲ)研究产生的基因组学数据,在高性能运算集群(浮点运算能力375万亿次/秒)进行分析流程的搭建、测试和优化。 

结果 本研究搭建的生物信息学分析流程,包括数据质控、关联分析、连锁分析、遗传变异注释、跨组学分析等多个模块。通过使用相应模块对上万例CNSR-Ⅲ样本的基因组学数据进行质控和分析,最终确认10 241例数据质控合格、无3度以内亲缘关系的全基因组测序样本用于全基因组关联分析。 

结论 结合脑血管病的特点,优化生物信息学分析流程,可以为脑血管病多组学研究提供数据保障,提升研究效率,为脑血管病风险评估、诊断与个体化治疗提供依据。

文章导读: 标准化、模块化生物信息分析流程的确立,为全面挖掘脑血管病基因组学数据、展示其蕴含的致病信息提供了系统性解决方案,将推动脑血管病多组学研究的发展。

关键词: 脑血管病; 基因组学; 遗传学; 生物信息学分析; 大数据

Abstract: Objective To construct an optimized bioinformatics analysis pipeline that was suitable for genomics researches in cerebrovascular diseases (CVD), and promote CVD multi-omics and precision medicine studies. Methods Clinical research needs and commonly used analysis methods from genomic and genetic studies in cerebrovascular diseases and population genetics were summarized. Modularized design was applied in the bioinformatics analysis pipeline according to the different research objectives and data. Based on the genomics data from China national stroke registry-Ⅲ (CNSR-Ⅲ) and highperformance computing cluster (floating point operation capacity of 375 trillion times/SEC), the pipeline was constructed, tested, and optimized. Results The bioinformatics analysis pipeline in this study included several modules, such as data quality control, association analysis, linkage analysis, genetic variation annotation, and multi-omics analysis. These modules were used to analyze the genomics data from CNSR-Ⅲ. A total of 10 241 whole genome sequenced samples passed the filter for data quality and familial relationships of 3rdor higher-order-degrees. These samples would be applied in genome-wide association studies. Conclusions Optimization of the bioinformatics analysis pipeline for CVD genomics researches can improve the study efficiency, support further multi-omics research, and provide basis for CVD risk assessment, diagnosis, and personalized treatment.

Key words: Cerebrovascular disease; Genomics; Genetics; Bioinformatics analysis; Big data