主管单位 工业和信息化部 主办单位 哈尔滨工业大学 主编 任南琪 国际刊号ISSN 1672-5565 国内刊号CN 23-1513/Q

ZHOU Jing,XIE Xueying,GU Wanjun.Identification of circular RNAs using genomic sequence features[J].Chinese Journal of Bioinformatics,2018,16(2):113-118.
(1.东南大学 学习科学研究中心,南京 210096;2.生物电子学国家重点实验室,东南大学 生物科学与医学工程学院,南京 210096;3.生物医学工程国家级实验教学示范中心(东南大学),南京 210096)
关键词:  环状RNA  序列特征  机器学习  随机森林  支持向量机
基金项目:国家自然科学基金(4,2 , 61571109);江苏省重点研发计划(BE2016002-3);中央高校基本科研业务费专项资金(2242017K3DN04).
Identification of circular RNAs using genomic sequence features
ZHOU Jing 1, XIE Xueying 2,3, GU Wanjun 2,3[HJ1.4mm]
(1. Research Center for Learning Sciences, Southeast University, Nanjing 210096, China; 2. State Key Laboratory of Bio-electronics, School of Biological Sciences and Medical Engineering, Southeast University, Nanjing 210096, China; 3. National Demonstration Center for Experimental Biomedical Engineering Education (Southeast University), Nanjing 210096,China)
Circular RNAs (circRNAs) are a class of novel RNAs with important biological functions. Currently, the identification tools of circRNAs are dependent on high-throughput sequencing. However, due to defects in data and their identification mode, low accuracy, low overlapping rate of different methods, high false positive rate, and false negative rate generally exist. To solve this problem, we built a model to identify circRNAs from the very beginning based on the inherent features of the genomic sequence rather than sequencing data. We selected 100 genomic sequence features related to circRNAs including the length of flanking introns, the density of A-to-I RNA editing sites, and the pairing score of Alu elements in the flanking introns, built machine learning model, identified the circRNAs in human genome, compared the classifying results of two machine learning algorithms, random forest (RF) and support vector machine (SVM). The results showed that the selected features could effectively identify circRNAs and different sequence features had different contributions to the identification of circRNAs. In addition, RF model had a better performance than SVM model in identifying RNAs.
Key words:  Circular RNAs  Sequence feature  Machine learning  Random forest  Support vector machines

