引用本文: | 包晓娜,何黎黎,崔景安.基于机器学习的蛋白质编码区识别[J].生物信息学,2023,21(4):270-276. |
| BAO Xiaona,HE Lili,CUI Jingan.Identification of protein coding region based on machine learning[J].Chinese Journal of Bioinformatics,2023,21(4):270-276. |
|
|
|
本文已被:浏览 602次 下载 487次 |
 码上扫一扫! |
|
基于机器学习的蛋白质编码区识别 |
包晓娜,何黎黎,崔景安
|
(北京建筑大学 理学院 北京102616)
|
|
摘要: |
针对DNA序列编码区的识别问题,本研究提出一个特征向量和逻辑回归的组合模型。首先对DNA序列进行数值处理转化为特征向量,并结合k字符相对频率技术提取特征向量的元素特征,之后利用二分类逻辑回归算法,对编码区和非编码区进行准确区分。选取了HMR195和BG570两个基准数据集进行五折交叉验证,结果表明,平均AUC(Area Under Curve)值分别为0.981 3和0.987 4,明显优于传统的贝叶斯判别法和VOSSDFT等方法。此外,本文提出的特征向量的维度很低,提高了运算效率。因此,本文组合模型能够较为高效准确地识别蛋白质编码区。 |
关键词: 编码区 特征向量 逻辑回归 机器学习 |
DOI:10.12113/202206004 |
分类号:TP181 |
文献标识码:A |
基金项目: |
|
Identification of protein coding region based on machine learning |
BAO Xiaona, HE Lili, CUI Jingan
|
(School of Science, Beijing University of Civil Engineering and Architecture, Beijing 102616, China)
|
Abstract: |
In order to identify the coding region of DNA sequence, a combined model of eigenvector and logistic regression is proposed in this article. Firstly, the DNA sequence is transformed into a feature vector by numerical processing, and the element features of the feature vector are extracted by combining the k-character relative frequency technology. Then, the binary classification logistic regression algorithm is used to accurately distinguish the coding region from the non-coding region. Two benchmark data sets, HMR195 and BG570, were selected for five-fold cross-validation. The results showed that the average AUC (Area Under Curve) values were 0.981 3 and 0.987 4 respectively, which are significantly better than the traditional Bayesian discriminant method and VOSSDFT. In addition, the dimension of the feature vector in this article is very low, which improves the operation efficiency. Therefore, the combined model in this article can identify protein coding regions more efficiently and accurately. |
Key words: Protein coding region Feature vector Logistic regression Machine learning |
|
|
|
|