The RCSB Protein Data Bank: a redesigned query system and relational database based on the mmCIF schema -- Deshpande et al. 33 (Supplement 1): D233 -- Nucleic Acids Research

The RCSB Protein Data Bank: a redesigned query system and relational database based on the mmCIF schema

Nita Deshpande¹, Kenneth J. Addess¹, Wolfgang F. Bluhm¹, Jeffrey C. Merino-Ott¹, Wayne Townsend-Merino¹, Qing Zhang¹, Charlie Knezevich¹, Lie Xie¹, Li Chen³, Zukang Feng³, Rachel Kramer Green³, Judith L. Flippen-Anderson³, John Westbrook³, Helen M. Berman³ and Philip E. Bourne¹^,2^,*

¹ San Diego Supercomputer Center and ² Department of Pharmacology, University of California San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA and ³ Department of Chemistry and Chemical Biology, Rutgers, The State University of New Jersey, 610 Taylor Road, Piscataway, NJ 08854-8087, USA

^* To whom correspondence should be addressed. Tel: +1 858 534 8301; Fax: +1 858 822 0873; Email: bourne{at}sdsc.edu

Received September 15, 2004; Revised and Accepted October 1, 2004

ABSTRACT

TOP
ABSTRACT
INTRODUCTION
CONTENT
CONCLUSION
REFERENCES

The Protein Data Bank (PDB) is the central worldwide repositoryfor three-dimensional (3D) structure data of biological macromolecules.The Research Collaboratory for Structural Bioinformatics (RCSB)has completely redesigned its resource for the distributionand query of 3D structure data. The re-engineered site is currentlyin public beta test at http://pdbbeta.rcsb.org. The new siteexpands the functionality of the existing site by providingstructure data in greater detail and uniformity, improved queryand enhanced analysis tools. A new key feature is the integrationand searchability of data from over 20 other sources coveringgenomic, proteomic and disease relationships. The current capabilitiesof the re-engineered site, which will become the RCSB productionsite at http://www.pdb.org in late 2005, are described.

	ABSTRACT

INTRODUCTION

TOP
ABSTRACT
INTRODUCTION
CONTENT
CONCLUSION
REFERENCES

The production version of the Research Collaboratory for StructuralBioinformatics Protein Data Bank (RCSB PDB) (http://www.pdb.org)is mirrored at seven sites around the world and has been describedpreviously (1,2). In order to improve the accessibility of thePDB's structure data, display the increased level of detailand improved consistency resulting from the RCSB PDB data uniformityproject (3,4), and take advantage of advances in database andWeb/Internet technologies, the RCSB PDB has re-engineered itsdatabase and redesigned the associated website. This site isnow available for beta testing at http://pdbbeta.rcsb.org andwill henceforth be referred to as PDB Beta. The following featuresof PDB Beta have been introduced: software architecture, databasecontent and schema, data integration from other sources, andquery and analysis capabilities expanded from those reportedpreviously (5).

	INTRODUCTION

CONTENT

TOP
ABSTRACT
INTRODUCTION
CONTENT
CONCLUSION
REFERENCES

Software architecture, database content and schema
Using an Enterprise Java framework, the PDB Beta has been redesignedand it is composed of three tiers: an underlying relationaldatabase, a presentation tier designed in collaboration withusers and an object-relational J2EE middle tier based on Hibernate.The PDB Beta has been tested with both MySQL and IBM DB2 relationaldatabase tiers. The current system uses MySQL, which enablesunlimited distribution of the primary and secondary data inthe RCSB PDB database. Moreover, the current system makes extensiveuse of freely distributable Java components (Table 1).

	CONTENT

View this table:
[in this window]
[in a new window]

Table 1. Software components used by PDB Beta

The database uses an mmCIF-based (6) schema derived from thePDB Exchange Dictionary (7). Data are loaded from either XML(8) or mmCIF data files with their associated remediated andextended content. Data file parser/structure loaders are SQL-92compliant and therefore independent of the backend database.Data files are parsed and loaded weekly at the same time asthe current production site is updated. The design of this systemalong with its local weekly data updates will facilitate localdistribution and use.

Data integration from other sources
From a user's perspective, macromolecular structure does notexist in isolation, but is generally associated with an inquirythat might include relationships to genomic and proteomic sequence,biological function, cellular location and disease. While thefocus of the RCSB PDB remains on fully exposing the featuresof macromolecular structures, a wider spectrum of inquiry isnow possible. This is achieved through the weekly collectionand integration (warehousing) of external data. For example,data from the Gene Ontology (GO) (9), Enzyme Commission (http://www.chem.qmul.ac.uk/iubmb/enzyme/),KEGG Pathways (10,11) and NCBI resources (including LocusLink,OMIM, SNP and BookShelf) (12) are mapped onto structures andloaded into the database. This is achieved as follows.

Our ongoing data uniformity efforts enable accurate assignmentof external database references to structures in the RCSB PDBdatabase; these include identifiers from Swiss-Prot (13), GenBank,PubMed, EC numbers and the taxonomy of the source organism.These references are used to locate information in a furtherset of databases. For example, Swiss-Prot identifiers were usedto assign GO terms from the Gene Ontology Consortium to structures.Swiss-Prot and GenBank identifiers were also used to obtaingenome information: gene name, chromosome location, structuralgenomics targets (14) and OMIM numbers for structures. Figure 1summarizes the rich and varied linkages that have been establishedbetween structure data in the RCSB PDB database and data fromexternal biological databases. Note that many of these dataare related to structure through a one-to-many relationshipsince a structure consists of one or more components, such asmultiple polypeptide chains. The representation of structuresas a number of constituent components, each with external dataassignments, is an ongoing effort at the RCSB PDB.

View larger version (34K):
[in this window]
[in a new window]

Figure 1. Primary and secondary references assigned to structures. The primary references are assigned during structure annotation/data curation. Secondary references are collected from external databases using the primary reference identifiers and accession numbers. This is rerun on a weekly basis for new and all existing structures and stored in the database.

Data loaders written in Java access the external databases,parse the files and load relevant derived information into thedatabase. In some instances additional external informationis retrieved at query run time. For example, KEGG pathways associatedwith a given EC number are retrieved by issuing a Web servicecall to the KEGG database at query run time. Under an agreementwith the US National Library of Medicine, PubMed identifiersfor the primary citation associated with a structure are usedto load the PubMed abstracts into the RCSB PDB database. Indoing so, abstracts can be searched by keyword(s) as an alternativemeans to find structures of interest.

As a final example of how linkages between incorporated datawere generated, consider the relationship between structureand disease (Figure 2). The OMIM text was searched for diseaseterms obtained from the chapter and section headings in theonline book Genes and Disease from the NCBI Bookshelf (http://www.ncbi.nlm.nih.gov/books/bv.fcgi?call=bv.View..ShowSection&rid=gnd.preface.91).The disease term and the OMIM numbers returned for the termwere loaded into a relational table, as was a mapping of structuresto OMIM numbers from the Swiss-Prot site. The two tables werejoined using OMIM numbers and the disease term was thus mappedonto structures. These relationships have enabled the implementationof a hierarchical disease tree suitable for browsing. For example,users can locate all the PDB structures identified as beingassociated with cancer and drill down to find only those associatedwith breast cancer.

View larger version (28K):
[in this window]
[in a new window]

Figure 2. Procedure to create structure-OMIM-disease mapping. Disease terms listed in the NCBI book Genes and Disease are used to mine the OMIM text for OMIM numbers associated with the terms. This Disease-OMIM mapping is loaded into the database. Performing a join on the Disease-OMIM table created above and the OMIM-Structure mapping table results in Disease-Structure mapping.

Query and analysis
Browsing
A major feature of PDB Beta is the ability to browse databasecontent. Much of the data now integrated with structure is hierarchicaland lends itself to display via tree browsers. When each nodein a browser tree is moused over, the number of associated structuresat that branch is displayed. This gives the user a sense ofthe size of the result set, even before a query is made. Assome data that are browsed are not strictly hierarchical, concessionsare made. An example of this concession is a protein chain thathas been associated with multiple GO molecular functions. Thisprotein chain would therefore appear multiple times in the browsertree even though it is only associated with a single structure.Browsers are also useful in reverse; knowing the location ofa structure in a tree reveals its place in the hierarchy. Forexample, entering homo sapiens in the taxonomy browser willlocate the structures for which this has been identified asa source organism, and highlight where humans are located inthe tree of life, at least according to the NCBI classificationscheme. The results of browsing can be used as a starting pointfor query refinement using the SearchFields interface outlinedbelow.

Searching
The PDB Beta has retained the search capability of the RCSBPDB production site but has provided, based on user feedback,more intuitive interfaces and added additional Website navigationtools. Sequence searching functionality via BLAST (15) has alsobeen added.

SearchLite is based on the Lucene text indexing and search engine,and uses the content of the mmCIF versions of the PDB structureswhich provides indexes of terms not available in the originalPDB files. For example, at the time of submission it may nothave been known that the structure was associated with apoptosis.When the term is later added to the list of Swiss-Prot keywords,it will be accessible through SearchLite even though there wasno reference in the original PDB file. SearchLite can be consideredas an inclusive rather than an exclusive search engine, sinceit produces results that may require further query refinement.This feature is also available on the current production site.

StatusSearch continues to provide information on deposited butunreleased structures. Sequence data are available ahead ofa structure's release for some entries, which is useful fortheoretical modeling and avoiding duplication of effort.

SearchFields, an interface for performing advanced queries,has been enhanced to include the full extent of the experimentalinformation that is collected and much of the integrated informationoutlined above. Particular attention has been paid to NMR structures.A new SearchFields option is to search for structures with specificNMR experimental parameters like refinement method, selectioncriteria, spectrometer details and sample conditions.

Searches performed during a session are recorded and can berecalled and rerun or modified and rerun. Since the result ofone query may trigger a new line of inquiry, we have extendedthe notion of query by example in which a result from one querycan be used as a search term in a subsequent query. For example,a search for the structure with the PDB identifier 1AEW [PDB] displaysthe Structure Explorer page for the iron storage molecule ferritin(16). According to the GO term for the single polypeptide chainin the asymmetric unit of this structure, it is assigned a molecularfunction of ‘ferric iron binding’. Clicking on thisterm on the Structure Explorer page will reveal all other structuresin the PDB associated with the same molecular function as definedby the same GO term.

A histogram feature in the early stages of development is appliedto quantitative data and is accessible from the results display.For example, an X-ray structure is reported with a resolutionof 2.5 Å. A novice user may be interested in how thatresolution compares with the contents of the complete database.An icon next to the resolution on the structure page can beselected to present a graphical distribution of the resolutionsof all the structures in the PDB. Specific ranges can be expandedand the structures in a selected resolution range displayedin a report format.

The PDB Beta is also composed of $~$ 1000 curated Web pages, whichhave been made more accessible through site searching and indexing.All features are in the process of being better documented andmade accessible through a context-sensitive help system basedon the RoboHelp tool.

Results reporting
Query results are presented on the Structure Explorer page fora single structure or in the Query Results Browser for a setof structures. The Structure Explorer page presents data similarto the format of a scientific paper. This page can be printedas a PDF. Table 2 highlights the new features that are includedbeyond those available from the current production site. Ofthe new reports, the Materials and Methods section is customizedbased on the experiment type. Structure Explorer pages listcrystallization, diffraction and refinement information forX-ray structures; and NMR experiment, refinement and ensembleinformation for NMR entries.

View this table:
[in this window]
[in a new window]

Table 2. New results reported from PDB Beta

Molecular viewers
Recognizing the difficulties that users may have installingthe existing molecular viewers, four new, general purpose molecularviewers have been added to the PDB Beta: KiNG, Jmol, SimpleViewerand WebMol applets. KiNG, Jmol and WebMol require Java-enabledbrowsers; SimpleViewer requires the installation of Java3D toprovide high-quality rendering. SimpleViewer is an example ofan application built from the Molecular Biology Toolkit (MBT)(http://mbt.sdsc.edu). The concept behind MBT is to deliversimple context-sensitive molecular graphics applications atthe appropriate point in a query. So for example, a ligand viewerhas been implemented that provides a detailed view of the interactionbetween a macromolecule and a ligand (HET group in originalPDB terminology), but is not designed to be a general-purposemolecular viewer. A SNPviewer, indicating where non-synonymoussingle nucleotide polymorphisms (SNPs) are mapped onto structures,is another example of the use of a context-sensitive viewerthat has been added to the PDB Beta.

Distribution system
The RCSB production and PDB Beta sites distribute data filesin the PDB, mmCIF and XML formats. Data for sequences and completestructural descriptions are available in uncompressed as wellas various compressed formats. The PDB data can also be obtainedusing CORBA and Web services. A CORBA server may be establishedusing C++ (http://deposit.pdb.org/mmcif/FILM/) or the Java OpenMMSsoftware (http://openmms.sdsc.edu) (17). Web services, whichare currently in the early implementation stage, will allowusers to use XML and SOAP to perform queries and retrieve resultsprogrammatically from the PDB Beta. The PDB Beta Web ServicesDefinition Language (WSDL) is available at http://pdbbeta.rcsb.org/jboss-net/services/pdbWebService?wsdl.

CONCLUSION

TOP
ABSTRACT
INTRODUCTION
CONTENT
CONCLUSION
REFERENCES

While the RCSB PDB's primary mandate continues to be the deliveryof high-quality structure data in a timely manner, our servicesare being expanded. Recognizing that structure exists as a pointon a spectrum of biological inquiry, more integrated accessto structure data is being provided. Using PDB Beta locallyto manage private copies of the PDB data on a laptop or largercomputer, and a web interface that can be customized for individualuser preferences, either locally or on the RCSB's PDB serversare examples of forthcoming deliverables. Comments and suggestionsare always welcome by sending email to betafeedback{at}rcsb.org.Upon completion of the beta testing, anticipated to be in late2005 the re-engineered site will be available as the PDB productionsite (http://www.pdb.org).

	CONCLUSION

ACKNOWLEDGEMENTS

The RCSB PDB is operated by Rutgers, The State University ofNew Jersey; the San Diego Supercomputer Center (SDSC) at theUniversity of California San Diego (UCSD); and the Center forAdvanced Research in Biotechnology (CARB/UMBI/NIST)—threemembers of the Research Collaboratory for Structural Bioinformatics.This work is supported by grants from National Science Foundation(NSF), National Institute of General Medical Sciences (NIGMS),Office of Science, Department of Energy (DOE), National Libraryof Medicine (NLM), National Cancer Institute (NCI), NationalCenter for Research Resources (NCRR), National Institute ofBiomedical Imaging and Bioengineering (NIBIB) and National Instituteof Neurological Disorders and Stroke (NINDS). The RCSB PDB isa member of wwwPDB. We wish to thank the many users who haveprovided input into the development of the re-engineered RCSBPDB.

	ACKNOWLEDGEMENTS

Notes

The online version of this article has been published underan open access model. Users are entitled to use, reproduce,disseminate, or display the open access version of this articlefor non-commercial purposes provided that: the original authorshipis properly and fully attributed; the Journal and Oxford UniversityPress are attributed as the original place of publication withthe correct citation details given; if an article is subsequentlyreproduced or disseminated not in its entirety but only in partor as a derivative work this must be clearly indicated. Forcommercial re-use permissions, please contact journals.permissions{at}oupjournals.org.

	Notes

REFERENCES

TOP
ABSTRACT
INTRODUCTION
CONTENT
CONCLUSION
REFERENCES

	REFERENCES

Berman,H.M., Westbrook,J., Feng,Z., Gilliland,G., Bhat,T.N., Weissig,H., Shindyalov,I.N. and Bourne,P.E. ( (2000) ) The Protein Data Bank. Nucleic Acids Res., , 28, , 235–242.[Abstract/Free Full Text] .
Bernstein,F.C., Koetzle,T.F., Williams,G.J.B., Meyer,E.F.,Jr, Brice,M.D., Rodgers,J.R., Kennard,O., Shimanouchi,T. and Tasumi,M. ( (1977) ) Protein Data Bank: a computer-based archival file for macromolecular structures. J. Mol. Biol., , 112, , 535–542.[ISI][Medline] .
Westbrook,J., Feng,Z., Jain,S., Bhat,T.N., Thanki,N., Ravichandran,V., Gilliland,G.L., Bluhm,W., Weissig,H., Greer,D.S. et al. ( (2002) ) The Protein Data Bank: unifying the archive. Nucleic Acids Res., , 30, , 245–248.[Abstract/Free Full Text] .
Bhat,T.N., Bourne,P., Feng,Z., Gilliland,G., Jain,S., Ravichandran,V., Schneider,B., Schneider,K., Thanki,N., Weissig,H. et al. ( (2001) ) The PDB data uniformity project. Nucleic Acids Res., , 29, , 214–218.[Abstract/Free Full Text] .
Bourne,P.E., Addess,K.J., Bluhm,W.F., Chen,L., Deshpande,N., Feng,Z., Fleri,W., Green,R., Merino-Ott,J.C., Townsend-Merino,W., Weissig,H., Westbrook,J. and Berman,H.M. ( (2004) ) The distribution and query systems of the RCSB Protein Data Bank. Nucleic Acids Res., , 32, , D223–D225.[Abstract/Free Full Text] .
Bourne,P.E., Berman,H.M., Watenpaugh,K., Westbrook,J.D. and Fitzgerald,P.M.D. ( (1997) ) The macromolecular Crystallographic Information File (mmCIF). Methods Enzymol., , 277, , 571–590.[ISI] .
Westbrook,J., Henrick,K., Ulrich,E.L. and Berman,H.M. ( (2004) ) Definition and exchange of crystallographic data. In International Tables for Crystallography. Kluwer Academic Publishers, Dordrecht, The Netherlands, Vol. G (in press). .
Westbrook,J., Ito,N., Nakamura,H., Henrick,K. and Berman,H.M. ( (2004) ) PDBML: the representation of archival macromolecular structure data in XML. Bioinformatics, , doi:10.1093/bioinformatics/bti082. .
The Gene Ontology Consortium ( (2000) ) Gene Ontology: tool for the unification of biology. Nature Genet., , 25, , 25–29.[CrossRef][ISI][Medline] .
Kanehisa,M. ( (1997) ) A database for post-genome analysis. Trends Genet., , 13, , 375–376.[CrossRef][ISI][Medline] .
Kanehisa,M. and Goto,S. ( (2000) ) KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res., , 28, , 27–30.[Abstract/Free Full Text] .
Wheeler,D.L., Church,D.M., Edgar,R., Federhen,S., Helmberg,W., Madden,T.L., Pontius,J.U., Schuler,G.D., Schriml,L.M., Sequeira,E. et al. ( (2004) ) Database resources of the National Center for Biotechnology Information: update. Nucleic Acids Res., , 32, , D35–D40.[Abstract/Free Full Text] .
Bairoch,A. and Apweiler,R. ( (2000) ) The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res., , 28, , 45–48.[Abstract/Free Full Text] .
Chen,L., Oughtred,R., Berman,H.M. and Westbrook,J. ( (2004) ) TargetDB: a target registration database for structural genomics projects. Bioinformatics, , 20, , 2860–2862.[Abstract/Free Full Text] .
Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. ( (1990) ) Basic local alignment search tool. J. Mol. Biol., , 215, , 403–410.[CrossRef][ISI][Medline] .
Hempstead,P.D., Yewdall,S.J., Fernie,A.R., Lawson,D.M., Artymiuk,P.J., Rice,D.W., Ford,G.C. and Harrison,P.M. ( (1997) ) Comparison of the three-dimensional structures of recombinant human H and horse L ferritins at high resolution. J. Mol. Biol., , 268, , 424–448.[CrossRef][ISI][Medline] .
Greer,D.S., Westbrook,J.D. and Bourne,P.E. ( (2002) ) An ontology driven architecture for derived representations of macromolecular structure. Bioinformatics, , 18, , 1280–1281.[Abstract/Free Full Text] .
Walther,D. ( (1997) ) WebMol—a Java-based PDB viewer. Trends Biochem. Sci., , 22, , 274–275.[CrossRef][ISI][Medline] .
Conte,L.L., Brenner,S.E., Hubbard,T.J., Chothia,C. and Murzin,A.G. ( (2002) ) SCOP database in 2002: refinements accommodate structural genomics. Nucleic Acids Res., , 30, , 264–267.[Abstract/Free Full Text] .

This article has been cited by other articles:

P. Hao, W.-Z. He, Y. Huang, L.-X. Ma, Y. Xu, H. Xi, C. Wang, B.-S. Liu, J.-M. Wang, Y.-X. Li, and Y. Zhong
MPSS: an integrated database system for surveying a set of proteins
Bioinformatics, May 1, 2005; 21(9): 2142 - 2143.
[Abstract] [Full Text] [PDF]

C. Brooksbank, G. Cameron, and J. Thornton
The European Bioinformatics Institute's data resources: towards systems biology
Nucleic Acids Res., January 1, 2005; 33(suppl_1): D46 - D53.
[Abstract] [Full Text] [PDF]

				C. Brooksbank, G. Cameron, and J. Thornton The European Bioinformatics Institute's data resources: towards systems biology Nucleic Acids Res., January 1, 2005; 33(suppl_1): D46 - D53. [Abstract] [Full Text] [PDF]