引用本文: | 李凌波,张静,陈丹.基于SVM和平均影响值的人肿瘤信息基因提取[J].生物信息学,2013,11(1):72-78. |
| LI Ling-Bo,ZHANG Jing,CHEN Dan.Selection of human tumor information genes based on the support vector machine and mean impact value[J].Chinese Journal of Bioinformatics,2013,11(1):72-78. |
|
摘要: |
基于基因表达谱的肿瘤分类信息基因选取是发现肿瘤特异表达基因、探索肿瘤基因表达模式的重要手段。借助由基因表达谱获得的分类信息进行肿瘤诊断是当今生物信息学领域中的一个重要研究方向,有望成为临床医学上一种快速而有效的肿瘤分子诊断方法。鉴于肿瘤基因表达谱样本数据维数高、样本量小以及噪音大等特点,提出一种结合支持向量机应用平均影响值来寻找肿瘤信息基因的算法,其优点是能够搜索到基因数量尽可能少而分类能力尽可能强的多个信息基因子集。采用二分类肿瘤数据集验证算法的可行性和有效性,对于结肠癌样本集,只需3个基因就能获得100%的留一法交叉验证识别准确率。为避免样本集的不同划分对分类性能的影响,进一步采用全折交叉验证方法来评估各信息基因子集的分类性能,优选出更可靠的信息基因子集。与基它肿瘤分类方法相比,实验结果在信息基因数量以及分类性能方面具有明显的优势。 |
关键词: 基因表达谱,秩和检验,支持向量机,平均影响值,全折交叉验证 |
DOI:10.3969/j.issn.1672-5565.2013-01.20130112 |
分类号: |
基金项目:国家自然科学基金项目(11261066),云南省应用基础研究资助项目(2007A023M),云南省教育厅科学研究项目(2012Y497)。 |
|
Selection of human tumor information genes based on the support vector machine and mean impact value |
LI Ling-Bo, ZHANG Jing, CHEN Dan
|
(School of Mathematics and Statistics, Yunnan University, Kunming 650091, China)
|
Abstract: |
Selection of information genes for tumor classification based on gene expression profiles is a main means to find specific expression genes and to study their expression pattern. Tumor diagnosis via the information genes obtained from gene expression spectrum is becoming an important research field of bioinformatics and is expected to be a fast and effective method for molecular diagnosis of tumors in clinical medicine. Considering the characteristics of gene expression profiling data of tumors such as high dimensions, small sample size and large noise etc, an algorithm for searching information genes is proposed that exploits support vector machine (SVM) and combines mean impact value (MIV). The advantage of this algorithm is that more information gene subsets with less genes and powerful classification capacity could be searched. A binary classification tumor dataset is applied to examine this novel algorithm, the result shows that it is feasible and effective in tumor classification. For colon cancer sample set, only 3genes can reach 100% accuracy of leave-one-out cross validation (LOOCV). To avoid the influence of classification performance because of the different partition for the sample set, full cross validation method is further used to assess the classification performance of the information gene subsets. More credible information gene subsets are selected. Compared with other tumor classification methods, the result is superior both in information gene number and in classification capacity. |
Key words: Gene Expression Profile,Rank-sum Test,Support Vector Machine,Mean Impact Value,Full-fold Cross Validated |