基于机器学习的蛋白质编码区识别

包晓娜; 何黎黎; 崔景安

期刊检索

关键词检索

新闻公告MORE

主管单位 工业和信息化部 主办单位 哈尔滨工业大学主编任南琪 国际刊号ISSN 1672-5565 国内刊号CN 23-1513/Q

期刊网站二维码

微信公众号二维码

引用本文:	包晓娜,何黎黎,崔景安.基于机器学习的蛋白质编码区识别[J].生物信息学,2023,21(4):270-276.
	BAO Xiaona,HE Lili,CUI Jingan.Identification of protein coding region based on machine learning[J].Chinese Journal of Bioinformatics,2023,21(4):270-276.

【打印本页】【HTML】【下载PDF全文】【查看/发表评论】【下载PDF阅读器】【关闭】

←前一篇|后一篇→

过刊浏览高级检索

本文已被：浏览 719次下载 572次	码上扫一扫！
分享到：微信更多字体:加大+\|默认\|缩小-
基于机器学习的蛋白质编码区识别
包晓娜,何黎黎,崔景安
(北京建筑大学理学院北京102616)

摘要:

针对DNA序列编码区的识别问题,本研究提出一个特征向量和逻辑回归的组合模型。首先对DNA序列进行数值处理转化为特征向量,并结合k字符相对频率技术提取特征向量的元素特征,之后利用二分类逻辑回归算法,对编码区和非编码区进行准确区分。选取了HMR195和BG570两个基准数据集进行五折交叉验证,结果表明,平均AUC(Area Under Curve)值分别为0.981 3和0.987 4,明显优于传统的贝叶斯判别法和VOSSDFT等方法。此外,本文提出的特征向量的维度很低,提高了运算效率。因此,本文组合模型能够较为高效准确地识别蛋白质编码区。

关键词: 编码区特征向量逻辑回归机器学习

DOI：10.12113/202206004

分类号:TP181

文献标识码:A

基金项目:

Identification of protein coding region based on machine learning

BAO Xiaona, HE Lili, CUI Jingan

(School of Science, Beijing University of Civil Engineering and Architecture, Beijing 102616, China)

Abstract:

In order to identify the coding region of DNA sequence, a combined model of eigenvector and logistic regression is proposed in this article. Firstly, the DNA sequence is transformed into a feature vector by numerical processing, and the element features of the feature vector are extracted by combining the k-character relative frequency technology. Then, the binary classification logistic regression algorithm is used to accurately distinguish the coding region from the non-coding region. Two benchmark data sets, HMR195 and BG570, were selected for five-fold cross-validation. The results showed that the average AUC (Area Under Curve) values were 0.981 3 and 0.987 4 respectively, which are significantly better than the traditional Bayesian discriminant method and VOSSDFT. In addition, the dimension of the feature vector in this article is very low, which improves the operation efficiency. Therefore, the combined model in this article can identify protein coding regions more efficiently and accurately.

Key words: Protein coding region Feature vector Logistic regression Machine learning

期刊检索

关键词检索

新闻公告MORE

友情链接LINKS