引用本文: | 艾亮,冯杰.一种快速非比对的蛋白质序列相似性与进化分析方法[J].生物信息学,2023,21(3):179-186. |
| AI Liang,FENG Jie.A fast alignment-free method for protein sequence similarity and evolution analysis[J].Chinese Journal of Bioinformatics,2023,21(3):179-186. |
|
摘要: |
本文提出了一种新的快速非比对的蛋白质序列相似性与进化分析方法。在刻画蛋白质序列特征时,首先将氨基酸的10种理化性质通过主成分分析浓缩为6个主成分,并且将每条蛋白质序列里的氨基酸数目作为权重对主成分得分值进行加权平均,然后再融合氨基酸的位置信息构成一个26维的蛋白质序列特征向量,最后利用欧式距离度量蛋白质序列间的相似性及进化关系。通过对3个蛋白质序列数据集的测试表明,本文提出的方法能将每条蛋白质序列准确聚类,并且简便快捷,说明了该方法的有效性。 |
关键词: 蛋白质序列 主成分分析 相似性 系统进化树 |
DOI:10.12113/202209010 |
分类号:Q516 |
文献标识码:A |
基金项目: |
|
A fast alignment-free method for protein sequence similarity and evolution analysis |
AI Liang,FENG Jie
|
(School of Science, Minzu University of China, Beijing 100081, China)
|
Abstract: |
In this paper, we propose a new fast alignment-free method for protein sequence similarity and evolution analysis. First, 10 groups of physicochemical properties of amino acids are reduced to 6 principal components using principal component analysis, and the number of amino acids in each protein sequence is used as weights to the scores of the principal components. Then, the amino acid position information is fused to form a 26-dimension feature vector for each protein sequence. Finally, the Euclidean distance is used to measure the similarity and evolutionary distance between protein sequences. The test on three datasets shows that our method can cluster each protein sequence accurately, which illustrates the validity of our method. |
Key words: Protein sequences Principal component analysis Similarity Phylogenetic trees |