摘要: |
UniProt(https://www.uniprot.org/)是国际知名蛋白质数据库,主要包括UniProtKB知识库、UniParc归档库和UniRef参考序列集三部分。UniProtKB知识库是UniProt的核心,除蛋白质序列数据外,还包括大量注释信息。UniProtKB知识库分Swiss-Prot和TrEMBL两个子库。Swiss-Prot子库中50多万条序列均由人工审阅和注释,而TrEMBL子库中1.4亿多条序列是由核酸序列数据库EMBL中的蛋白质编码序列翻译所得,并由计算机根据一定规则进行注释。UniParc归档库将存放于不同数据库中的同一个蛋白质归并到一个记录中以避免冗余,并赋予序列唯一性特定标识符。UniRef参考序列集按相似性程度将UniProtKB和UniParc中的序列分为UniRef100、UniRef90和UniRef50三个数据集。UniProt网站为用户提供了高效实用的高级检索系统和大量帮助文档。UniProt数据库每4周发布新版的同时也发布统计报表,用户可通过统计报表了解该数据库的数据量及更新情况、数据类别和物种分布等基本信息,查看常规注释信息、序列特征注释信息和数据库交叉链接等统计数据。UniProt是目前国际上序列数据最完整、注释信息最丰富的非冗余蛋白质序列数据库,自本世纪初创建以来,为生命科学领域提供了宝贵资源。 |
关键词: 数据库 蛋白质序列 蛋白质功能 数据库注释 数据库交叉链接 数据库高级检索 |
DOI:10.12113/j.issn.1672-5565.201903005 |
分类号:Q51;TP392 |
文献标识码:A |
基金项目: |
|
A brief introduction to UniProt |
LUO Jingchu
|
(College of Life Sciences, Peking University, Beijing 100871, China)
|
Abstract: |
The Universal Protein Resource (https://www.uniprot.org/, UniProt) is a well-known protein database, which consists of the UniProt knowledgebase (UniProtKB), the UniProt unique protein identifier archive (UniParc), and the UniProt reference sequence clusters (UniRef). Apart from protein sequence data, the UniProtKB has comprehensive annotations and is the core of the database. UniProtKB/Swiss-Prot has more than 500 thousand entries and is a manually reviewed and annotated subset of UniProtKB, while the UniProtKB/TrEMBL contains more than 140 million un-reviewed sequences which are translated from the coding sequences in the nucleotide database EMBL and computationally annotated based on certain rules. UniParc merges the same sequence stored in UniProtKB and other available protein sequence databases into a single record to avoid redundancy and gives each record a permanent and unique identifier. UniRef clusters the UniProtKB and the selected UniParc sequences into three different sets, i.e., UniRef100, UniRef90, and UniRef50, according to their sequence identity. The UniProt website provides users with an easy-to-use and highly efficient interface for advanced search and various help documents. The UniProt database releases statistics published online along with the update of the database every four weeks, which lists useful information such as the number of newly added and updated entries, the sequence types and their taxonomic sources, as well as general annotations, sequence features, and database cross-references. UniProt has been serving the user community of life sciences as the most-comprehensive, well-annotated, non-redundant, and freely-accessible resource of protein sequence and function since it was established at the beginning of this century. |
Key words: Database Protein sequence Protein function Database annotation Database cross-reference Database query |