|
EMBL Outstation, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
* To whom correspondence should be addressed. Tel: +44 1223 494453; Fax: +44 1223 494468; Email: ckanz{at}ebi.ac.uk
Received September 14, 2004; Revised October 6, 2004; Accepted October 14, 2004
ABSTRACT |
---|
TOP ABSTRACT INTRODUCTION SUBMISSIONS TO THE EMBL... DATA IN THE EMBL... ACCESSING THE EMBL NUCLEOTIDE... NEW DEVELOPMENTS CITING THE EMBL NUCLEOTIDE... CONTACTING THE EMBL DATABASE REFERENCES |
---|
INTRODUCTION |
---|
TOP ABSTRACT INTRODUCTION SUBMISSIONS TO THE EMBL... DATA IN THE EMBL... ACCESSING THE EMBL NUCLEOTIDE... NEW DEVELOPMENTS CITING THE EMBL NUCLEOTIDE... CONTACTING THE EMBL DATABASE REFERENCES |
---|
The mission of the Service Programme at the EBI is the building, maintenance and provision of biological databases and other information services to support data deposition and free access by the scientific community (1).
The EMBL Nucleotide Sequence Database (http://www.ebi.ac.uk/embl/) is Europe's primary nucleotide sequence resource. This database is the European part of an international collaboration with DDBJ (Japan) (2) and GenBank (USA) (3) (INSDC, International Nucleotide Sequence Database Collaboration). Data are exchanged on a daily basis between the collaborating institutes. The data in the EMBL Nucleotide Sequence Database originates from a combination of large-scale genome sequencing projects, direct submissions from individual scientists and the European Patent Office. There is a quarterly release of the whole database and new and updated records are distributed daily.
Over the last year, the size of EMBL Nucleotide Sequence Database has increased from 27.2 million entries in Release 76, September 2003 to 42.3 million entries in Release 80, September 2004, of which 4.4 million entries are WGS (Whole Genome Shotgun) data. There are now over 185 000 organisms represented in the database.
In 2004, the limit on sequence length has been dropped, the EMBLCDSs dataset containing all coding sequences annotated in the EMBL Nucleotide Sequence Database was launched, the data collection rules for Third Party Anotation (TPA) data were revised and the functionality of the Sequence Version Archive was extended further.
Other databases provided by the EBI include the protein resource UniProt (4), InterPro, a database of protein families, domains and functional sites (5), the Macromolecular Structure Database E-MSD (6), the automatic genome annotation database Ensembl (7), Genome Reviews, curated versions of complete Genomes from the EMBL Database, the Enzyme database IntEnz (8) and the database for protein interaction data, IntAct (9).
SUBMISSIONS TO THE EMBL NUCLEOTIDE SEQUENCE DATABASE |
---|
TOP ABSTRACT INTRODUCTION SUBMISSIONS TO THE EMBL... DATA IN THE EMBL... ACCESSING THE EMBL NUCLEOTIDE... NEW DEVELOPMENTS CITING THE EMBL NUCLEOTIDE... CONTACTING THE EMBL DATABASE REFERENCES |
---|
How to submit new sequences to the EMBL Nucleotide Sequence Database?
The primary tool for submission of nucleotide sequence data is Webin. For alignment data, it is Webin-Align. Projects with large-scale submissions can open a project account allowing direct updates.
Information for submitters can be found here: http://www.ebi.ac.uk/embl/Documentation/information_for_submitters.html. For submission guidelines please see http://www.ebi.ac.uk/embl/Submission/.
Webin
Webin is the preferred submission tool for nucleotide sequences and biological information. It should also be used for TPA submissions. Webin allows fast submissions of single, multiple and very large numbers of sequences (bulk submissions) and is available at http://www.ebi.ac.uk/embl/Submission/webin.html.
Genome project submissions
Large-scale sequencing projects can open a project account to deposit and update data directly using email or ftp. Groups producing large volumes of sequence data are advised to contact the database at datasubs{at}ebi.ac.uk. More information is available at http://www.ebi.ac.uk/embl/Submission/genomes.html.
Alignment submissions
Webin-Align (10) is the dedicated submission tool for multiple nucleotide and protein alignments. It accepts all common alignment formats and is available at http://www.ebi.ac.uk/embl/Submission/align_top.html.
WGS submissions
WGS data submission is not a continuous processWGS datasets are normally not updated more often than once every few months. Therefore email or ftp accounts are not opened for the submission of WGS data, but submissions are dealt with on a one-by-one basis. Potential submitters are advised to contact the EMBL database at datasubs{at}ebi.ac.uk.
How to update entries in the EMBL Nucleotide Sequence Database?
The editorial rights to an entry in the EMBL Nucleotide Sequence Database remain with the original submitter(s). The EBI team adds value to entries, e.g. via cross-references, but the data itself is archival and is not updated by the EBI. Submitters are advised to update their own entries via the update form (http://www.ebi.ac.uk/embl/webin/update.html).
DATA IN THE EMBL NUCLEOTIDE SEQUENCE DATABASE |
---|
TOP ABSTRACT INTRODUCTION SUBMISSIONS TO THE EMBL... DATA IN THE EMBL... ACCESSING THE EMBL NUCLEOTIDE... NEW DEVELOPMENTS CITING THE EMBL NUCLEOTIDE... CONTACTING THE EMBL DATABASE REFERENCES |
---|
Whole Genome Shotgun (WGS) data
Methods using WGS data are used to gain a large amount of genome coverage for an organism. The sequences of all contigs originating from one experiment are grouped in a set. WGS entries have the standard EMBL format, with accession numbers clearly distinct from those of non-WGS entries. The accession numbers of all entries in each WGS set share the same prefix.
Third Party Annotation (TPA) data
The Third Party Annotation data set was launched in response to requests from the research community to submit entries that include either re-annotation of existing data, or combinations of novel sequence, existing primary sequence, trace archive and WGS data.
To distinguish TPA entries from primary data, the abbreviation TPA appears at the beginning of each description (DE) line and in the keyword list. The link to the primary data information is given in the linetypes AH and AS that have been created for TPA entries. The following flatfile extract is taken from entry BN000024 [GenBank] :
|
The format of a CON entry is similar to that of a standard entry, with the additional CO linetype to accommodate the assembly information. A CON entry does not have any annotation apart from source features.
The following example of an assembly is taken from entry BX470249 [GenBank] :
EMBLCDSs dataset
Following requests from database users, a new subset of EMBL data, the EMBLCDSs database, has been created during the last year. Every CDS (coding sequence) feature annotated in EMBL entries is displayed as a single entry.
More details are provided in the New Developments section below.
ACCESSING THE EMBL NUCLEOTIDE SEQUENCE DATABASE |
---|
TOP ABSTRACT INTRODUCTION SUBMISSIONS TO THE EMBL... DATA IN THE EMBL... ACCESSING THE EMBL NUCLEOTIDE... NEW DEVELOPMENTS CITING THE EMBL NUCLEOTIDE... CONTACTING THE EMBL DATABASE REFERENCES |
---|
Sequence Retrieval System (SRS)
The EMBL Nucleotide Sequence Database can be accessed via the EBI SRS server (11,12) at http://srs.ebi.ac.uk/. In SRS, the data are available in the libraries shown in Table 1.
|
SRS also links to other databases, with cross-references to UniProt and publications available online, for example.
FTP Server
Release data, daily updates and cumulative files of all data types can be freely obtained from the ftp server at ftp://ftp.ebi.ac.uk/pub/databases/embl/. Please see the README file for further information.
To create and maintain a local copy of the cumulative file, the syncron tool (ftp://ftp.ebi.ac.uk/pub/software/unix/listtools/) can be used to download automatically newly available incremental data files from the ftp site and to merge them locally.
Dbfetch
Dbfetch (database fetch) is a tool for simple sequence retrieval via http. It can be used to retrieve up to 50 entries from various databases. Dbfetch can be found at http://www.ebi.ac.uk/cgi-bin/dbfetch.
Wsdbfetch provides programmatic access to the Dbfetch functionality. The service is described using Web Services Description Language (WSDL) and uses the Simple Object Access Protocol (SOAP) to communicate with other systems. For further information on Wsdbfetch please see http://www.ebi.ac.uk/Tools/webservices/WSDbfetch.html.
EMBL Sequence Version Archive
The EMBL Sequence Version Archive (SVA) (13) is a repository of all versions of any entry that have been distributed to the public from the EMBL Nucleotide Sequence Database. An interactive web-based interface to the SVA can be accessed at http://www.ebi.ac.uk/cgi-bin/sva/sva.pl.
Entries from the SVA can also be retrieved using dbfetch.
Completed genome sequences
Direct access to completely sequenced genomic components is available via the EBI Genomes server at http://www.ebi.ac.uk/genomes/. At the time of writing (September 2004) there are 162 completed genomes of bacteria, 19 archaea, 36 eukaryota, 540 organelles, 136 phages, 204 plasmids, 903 viruses and 36 viroids available.
Sequence searching
A comprehensive set of sequence analysis and database search algorithms is available at http://www.ebi.ac.uk/Tools/. The most commonly used algorithms available are FASTA (14) and WU-BLAST (15), permitting comparisons between query sequences and the nucleotide, translated nucleotide and protein databases.
Sequence similarity searches are available interactively over the WWW as well as by email. Instructions for email searches can be obtained by sending a message with the word HELP in its body to gpfasta{at}ebi.ac.uk.
Access via email
Data can also be retrieved by email using netserv (netserv{at}ebi.ac.uk). To get started send an email to netserv{at}ebi.ac.uk with HELP in the message body.
NEW DEVELOPMENTS |
---|
TOP ABSTRACT INTRODUCTION SUBMISSIONS TO THE EMBL... DATA IN THE EMBL... ACCESSING THE EMBL NUCLEOTIDE... NEW DEVELOPMENTS CITING THE EMBL NUCLEOTIDE... CONTACTING THE EMBL DATABASE REFERENCES |
---|
Third Party Annotationsnew rules
Following a decision taken at the 2004 Collaborative Meeting, the INSD Collaboration has increased the stringency for acceptance of data into the TPA dataset. The aim is to ensure that the TPA dataset includes the highest quality sequence and biological annotation.
To achieve this aim, the similarity between the TPA sequence and the contributing primary sequences is checked at the time of submission. We aim to achieve a similarity of at least 90%. In addition, there can be no more than 50 bp of the TPA sequence that does not correspond to primary entry(ies). All TPA records are manually curated and checked prior to public release.
To be released into the public TPA dataset, entries must also meet the following requirements:
Further details may be found at: http://www.ebi.ac.uk/embl/Documentation/third_party_annotation_dataset.html. and http://www.ebi.ac.uk/webin/webin_help.html.
EMBL Sequence Version Archiveextended functionality
In February 2004, a new batch retrieval functionality has been added to the SVA. Multiple entries can now be retrieved by supplying a list of accession numbers with either entry version number, sequence version number (user-indicated in the interface) or no version details for the most recent entry.
By the end of 2004, expanded CON entries will be included in the SVA.
A warning has been added to report the suppression date for entries that have been suppressed in the database.
EMBLCDSs dataset
Following requests from database users, a new subset of EMBL data, EMBLCDSs database, has been created during the year. Every CDS (coding sequence) feature annotated in EMBL entries is displayed as a single entry.
Entries are presented in an EMBL-like flatfile format, with addition of new line types (Figure 1).
|
The EMBLCDSs dataset is available via SRS [library: EMBL (Coding Sequences)] and ftp (ftp://ftp.ebi.ac.uk/pub/databases/embl/cds).
Finishing whole genome shotgun sets
Data from the WGS projects where the sequencing and assembling process is finished are moved into the main section of the database. At the time of writing only 5 out of 120 relatively small projects have been finished (example: Nanoarchaeum equitans Kin4-M, WGS project prefix: AACL, newly created entry in the main section: AE017199
[GenBank]
). In all cases, accession numbers of the WGS entries are added as secondary accession numbers to newly created entries in the main section to help track the data.
XML format
The International Nucleotide Sequence Database Collaboration INSDC has adopted a first draft for a common XML format for nucleotide data. The DTD can be found at http://www.ebi.ac.uk/embl/Documentation/DTD/INSDSeq_v1.dtd.txt.
CITING THE EMBL NUCLEOTIDE SEQUENCE DATABASE |
---|
TOP ABSTRACT INTRODUCTION SUBMISSIONS TO THE EMBL... DATA IN THE EMBL... ACCESSING THE EMBL NUCLEOTIDE... NEW DEVELOPMENTS CITING THE EMBL NUCLEOTIDE... CONTACTING THE EMBL DATABASE REFERENCES |
---|
CONTACTING THE EMBL DATABASE |
---|
TOP ABSTRACT INTRODUCTION SUBMISSIONS TO THE EMBL... DATA IN THE EMBL... ACCESSING THE EMBL NUCLEOTIDE... NEW DEVELOPMENTS CITING THE EMBL NUCLEOTIDE... CONTACTING THE EMBL DATABASE REFERENCES |
---|
Postal address: EMBL Nucleotide Sequence Database, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK.
Telephone: data submissions, +44 1223 494499; general, +44 1223 494444.
Fax: general, +44 1223 494468.
Notes |
---|
REFERENCES |
---|
TOP ABSTRACT INTRODUCTION SUBMISSIONS TO THE EMBL... DATA IN THE EMBL... ACCESSING THE EMBL NUCLEOTIDE... NEW DEVELOPMENTS CITING THE EMBL NUCLEOTIDE... CONTACTING THE EMBL DATABASE REFERENCES |
---|
|
J. Robinson, M. J. Waller, P. Stoehr, and S. G. E. Marsh IPD--the Immuno Polymorphism Database Nucleic Acids Res., January 1, 2005; 33(suppl_1): D523 - D526. [Abstract] [Full Text] [PDF] |
||||
|
C. Brooksbank, G. Cameron, and J. Thornton The European Bioinformatics Institute's data resources: towards systems biology Nucleic Acids Res., January 1, 2005; 33(suppl_1): D46 - D53. [Abstract] [Full Text] [PDF] |
||||
JOURNAL HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |