引用本文: | 周晶,谢雪英,顾万君.基于序列特征的环状RNA识别[J].生物信息学,2018,16(2):113-118. |
| ZHOU Jing,XIE Xueying,GU Wanjun.Identification of circular RNAs using genomic sequence features[J].Chinese Journal of Bioinformatics,2018,16(2):113-118. |
|
摘要: |
环状RNA是新发现的一类具有重要生物学功能的RNA。现有的环状RNA识别工具依赖高通量测序数据,因数据本身和识别方式的弊端而普遍存在准确性不足、不同方法间重复性低以及假阳性率/假阴性率高等缺点。为了解决该问题,我们搭建模型来实现不依赖于测序数据而根据序列的内在特征的环状RNA从头预测。本文选取了包括剪接位点上下游内含子的长度、A-to-I密度和Alu重复序列等100个与RNA成环相关的序列特征,建立了机器学习模型,并识别了人类基因组中的环状RNA,比较了两种机器学习方法随机森林法(RF)和支持向量机(SVM)的分类效果。结果表明,所选序列特征能有效地鉴别RNA能否成环,同时,不同序列特征对模型的分类预测能力的贡献也不同。相比于SVM方法,RF分类的效果更好。 |
关键词: 环状RNA 序列特征 机器学习 随机森林 支持向量机 |
DOI:10.3969/j.issn.1672-5565.201709002 |
分类号:Q522+.6 |
文献标识码:A |
基金项目:国家自然科学基金(4,2 , 61571109);江苏省重点研发计划(BE2016002-3);中央高校基本科研业务费专项资金(2242017K3DN04). |
|
Identification of circular RNAs using genomic sequence features |
ZHOU Jing 1, XIE Xueying 2,3, GU Wanjun 2,3[HJ1.4mm]
|
(1. Research Center for Learning Sciences, Southeast University, Nanjing 210096, China; 2. State Key Laboratory of Bio-electronics, School of Biological Sciences and Medical Engineering, Southeast University, Nanjing 210096, China; 3. National Demonstration Center for Experimental Biomedical Engineering Education (Southeast University), Nanjing 210096,China)
|
Abstract: |
Circular RNAs (circRNAs) are a class of novel RNAs with important biological functions. Currently, the identification tools of circRNAs are dependent on high-throughput sequencing. However, due to defects in data and their identification mode, low accuracy, low overlapping rate of different methods, high false positive rate, and false negative rate generally exist. To solve this problem, we built a model to identify circRNAs from the very beginning based on the inherent features of the genomic sequence rather than sequencing data. We selected 100 genomic sequence features related to circRNAs including the length of flanking introns, the density of A-to-I RNA editing sites, and the pairing score of Alu elements in the flanking introns, built machine learning model, identified the circRNAs in human genome, compared the classifying results of two machine learning algorithms, random forest (RF) and support vector machine (SVM). The results showed that the selected features could effectively identify circRNAs and different sequence features had different contributions to the identification of circRNAs. In addition, RF model had a better performance than SVM model in identifying RNAs. |
Key words: Circular RNAs Sequence feature Machine learning Random forest Support vector machines |