摘要: |
基因组注释是识别出基因组序列中功能组件的过程,其可以直接对序列赋予生物学意义,由此方便研究者探究和分析基因组功能。基因组注释可以帮助研究从三个层次上理解基因组,一种是在核苷酸水平的注释,主要确定DNA序列中基因、RNA、重复序列等组件的物理位置,包括转录起始,翻译起始,外显子边界等具体位置信息。同时可以注释得到变异在不同人群中的变异频率差异,这是解读不同人群表型差异的图谱基础。第二种是蛋白水平的注释,主要解读基因或变异的可能功能异常,评估变异所在基因位置、变异类型等对蛋白质改变的影响。第三种是生物学功能/过程注释,主要解读不同基因相互作用的对生物学过程和通路的影响,可以从系统生物学角度解释基因或调控元件对生命生化过程或功能的影响。自从人类基因组计划完成之后,各国陆续启动了基因组测序计划,完成绘制了人类基因详尽的基因多态性谱图,记录了不同表型群体的变异分布和频率差异情况等注释信息。我们结合已有的注释数据库知识,开发了具有高准确性和高效的面向大规模人群的基因组注释系统,实现对大规模的人群变异数据进行全自动化的功能性注释分析计算,进一步助力未来人群遗传变异分布等方面的研究。 |
关键词: 人群基因组 基因组注释 基因组变异 |
DOI:10.12113/202106002 |
分类号:Q343.1 |
文献标识码:A |
基金项目: |
|
Workflow of large-scale population genome annotation |
YAN Zhenlei, GUO Hongzhe
|
(Faculty of Computing, Harbin Institute of Technology, Harbin 150001, China)
|
Abstract: |
Genome annotation is the process of identifying functional components in a genome sequence, which can directly assign biological significance to the sequence, thus facilitating the exploration and analsis of genome functions. Genome annotation can help researchers understand the genome from three levels. The first is the annotation at the nucleotide level, which mainly determines the physical locations of genes, RNA, repetitive sequences. and other components in the DNA sequence, including specific location information such as transcription initiation, translation initiation, and exon boundaries. At the same time, the difference in the variation frequency of the variation in different populations can be obtained via annotation, which is the basis for the interpretation of the phenotypic differences of different populations. The second is the annotation at the protein level, which mainly interprets the possible functional abnormalities of genes or mutations, and evaluates the impact of the location of the mutations in the genes and the types of mutations on protein changes. The third type is biological function/process annotation, which mainly interprets the impact of different gene interactions on biological processes and pathways, and can explain the impact of genes or regulatory elements on life biochemical processes or functions from the perspective of systems biology. Since the completion of the Human Genome Project, countries across the world have successively launched genome sequencing projects, completed detailed gene polymorphism maps of human genes, and recorded annotation information such as variation distribution and frequency differences of different phenotypic groups. Based on the existing knowledge of annotation databases, a highly accurate and efficient genome annotation system for large-scale populations was developed, realizing fully automated functional annotation analysis and calculation of large-scale population variation data, and further assisting future research on variation distribution of population genetics and other aspects. |
Key words: Population genomes Genome annotation Genome variation |