In Memory of Frank H. Allen

 

Dr Frank H. Allen passed away on November 10, 2014, aged 70. Colin Groom of the Cambridge Crystallographic Data Centre (CCDC) reported: “Frank joined the Chemical Crystallography Group at the University of Cambridge in 1970 and played a pivotal role in the establishment of the Cambridge Structural Database. He went on to become the Scientific Director and then the Executive Director of the Cambridge Crystallographic Data Centre. Following his retirement in 2008, Frank remained with the CCDC as an Emeritus Research Fellow, enabling him to continue to indulge his passion for structural chemistry. Frank’s research involved collaboration with many scientists around the world, resulting in over 200 papers. He was also a wonderful teacher, supervising more than 20 doctoral students and introducing many more to structural chemistry through workshops over many years. His contributions to other influential organizations, his vigorous editorship of Acta Crystallographica, the numerous conferences he organized and presentations he made meant Frank was known to and respected by crystallographers the world over. Frank has long been a leading figure in international crystallography, and was a wonderful colleague, becoming a friend to all those who worked with him. He will be sadly missed.”

An obituary has been published: Taylor, R. Acta Cryst. 2014, B70, 1035-1036 doi:10.1107/S2052520614026201 http://scripts.iucr.org/cgi-bin/paper?S2052520614026201. Frank was the ACS CINF Herman Skolnik Awardee in 2003. A detailed biography, written in that year, and thus rather out of date, appears at http://www.acscinf.org/content/2003-herman-skolnik-award-memoriam-frank-allen. As a tribute to Frank, I have reproduced a section of my report on the relevant ACS meeting for this issue of the Chemical Information Bulletin.

 

Wendy A. Warr

 

 

RE-PRINTED EXTRACT FROM
CHEMICAL INFORMATION AND COMPUTATION 2003, NUMBER TWO

 

 

226TH ACS NATIONAL MEETING AND EXPOSITION
NEW YORK, NEW YORK, SEPTEMBER 7-11, 2003

 

 

 

 

 

 

A report by Dr. Wendy A. Warr
Wendy Warr & Associates
February 2004

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Dr Wendy A. Warr
Wendy Warr & Associates,
6 Berwick Court Holmes Chapel,
Cheshire CW4 7HZ,
England Tel/fax +44 (0)1477 533837
wendy@warr.com  
http://www.warr.com

 

 

 

 

 

  American Chemical Society
ImageDivision of Chemical Information  Image

Herman Skolnik Award Symposium

Crystallographic Databases
and their Applications

Tuesday 9 September 2003

In recognition of the presentation of the Herman Skolnik Award for 2003 to

Frank H. Allen

Cambridge Crystallographic Data Centre, Cambridge,UK
 

Herman Skolnik Award Symposium

Crystallographic Databases and their Applications

Tuesday 9 September 2003

In recognition of the presentation of the Herman Skolnik Award for 2003 to

Frank H. Allen

Cambridge Crystallographic Data Centre, Cambridge,UK

 

 

Image

2261  ACS National Meeting
Jacob Javits Convention Center, New York, NY

 

The Herman Skolnick Awardee 2003

Image

 

Dr. Frank H. Allen
Cambridge Crystallographic Data Centre, Cambdridge UK

FrankAllen is Executive Director of the Cambridge Crystallographic Data Centre (CCDC) and is responsible to the Board of Governors for the overall operation of the CCDC. He has been with CCDC since 1970, following undergraduate and graduate studies (BSc, ARCS, DIC, PhD) at Imperial College, London. UK and postdoctoral work at the University of British Columbia, Vancouver,Canada.

He has beeninvolved in most majordevelopments at the CCDC, indudingcreation of the Cambridge Structural Database (CSD) of organic and metal-organic crystal structures, and software development for structure validation.chemical indexing,database searching and numericaldata analysis.A particular interest has beentheapplication of the accumulated CSD data for research purposes. He has published more than 200 papers in crystallography, chemistry and chemical informatics,and has edited 15 reference books and conference proceedings volumes

Honours and professional activities include:Fellow of the Royal Society of Chemistry (FRSC). 1992; RSC Siver Medal and Prize for Structural Chemistry, 1994; Vice-President, British CrystallographicAssociation, 1997-2001; CouncilMember, European CrystallographicAssociation, 1997- 2001; Editor, Acta Crystallographica, Section B, 1994-2002 (IUCr);Chair,IUCr Committee on Crystallographic Databases 1999-. Editorial Boards: Chemical Communications, Structural Chemistry, Croatica Chimica Acta, Crystallography Reviews. He was appointed Visiting Professor of Chemistry at the University of Bristol in 2002.

Symposium  Programme

Crystallographic Databases and their Applications

8:30               Introductory Remarks
Frank H.Allen (CCDC, Cambridge, UK)

8:40               The Cambridge Structural Database (CSD) and its research applications in structural chemistry
Frank H.Allen (CCDC, Cambridge, UK)

9:20               Data mining of crystallographic databases as an aid to drug design

Robin Taylor (CCDC, Cambridge, UK)

10:00              Intermission

10:20             The evolution of the Protein Data Bank
Helen M. Berman, J.D. Westbrook, PE. Bourne, GL. Gilliland,
J.L. Flippen-Anderson and the PDB Team (Rutgers U., NJ, and SDSC, San Diego, CA, and NIST, Washington DC, USA)

11:00              The Protein Data Bank (PDB) as a research tool
Philip E. Bourne, J.D. Westbrook, Helen M. Berman, GL. Gilliland,
J.L. Flippen-Anderson and the PDB Team (SDSC, San Diego, CA, and Rutgers U., NJ, and NIST, Washington DC, USA)

11:40              Lunch

2:00               When can fractional crystallisation be expected to fail?
Information from the Cambridge Structural Database
Carolyn P. Brock (University of Kentucky, Lexington, KY, USA)

2:40               Applications of the Cambridge Structural Database to molecular inorganic chemistry
A. Guy Orpen (University of Bristol, Bristol, UK)

3:20               Intermission

4:00               Materials informatics: knowledge acquisition for materials design
John R.Rodgers (Toth Information Systems Inc., Ottawa. Canada)

4:40                First principles calculated databases for the prediction of intermetallic structures
Gerd Ceder, S. Curtaro/o, D. Morgan. J.R. Rodgers (MIT, Cambridge, MA, and Toth Information Systems Inc., Ottawa, Canada)

5:20               Close

 

Image

 

Applications. Chemistry, Biology and Drug Design

 

The Cambridge Structural Database (CSD) and its research applications in structural chemistry. Frank H. Allen, Cambridge Crystallographic Data Centre, 12 Union Road, Cambridge CB2 1EZ, United Kingdom, Fax: 44-1223-336033, allen@ccdc.cam.ac.uk

The Cambridge Structural Database (CSD) contains X-ray and neutron diffraction analyses, for single crystal and refined powder studies of organic and organometallic compounds.  The data came mostly from the open literature, although about 1% it is from private communications. Each structure forms a CSD entry, identified by a Reference Code. The contents of CSD are text and numerical data, 2D chemical structures and experimentally determined 3D structures. The 2D structure is mapped onto the 3D one. CSD has grown enormously form 1970 to 2000. As of August 14, 2003, it contained 297,507 structures. The prediction is that it will contain more than 500,000 structures by the end of 2010. Allen gave a diagram of the software supplied with CSD, for converting data into knowledge:

Image

 

 

ConQuest searches text and numerical data, structures in 2D and 3D, and intermolecular, non-bonded contacts. It retrieves a database subset and a user-defined set of geometric parameters for each structure located. Structures are visualized in Mercury. ConQuest, Mercury and VISTA permit structural chemistry in the CSD to be mined from the raw data. This is knowledge mining not data mining. Crystallographic knowledge, and intramolecular and intermolecular structural knowledge are mined.

 

Allen discussed intramolecular structural knowledge first. The Cambridge Crystallographic Data Centre (CCDC) published tables of standard bond lengths in J. Chem. Soc. Perkin Trans. 1987, S10-S19 and J. Chem. Soc. Dalton Trans. 1989, S1-S83. These form the standard “bible” of bond lengths. Conformational preferences can be determined by computational methods which give energies for model compounds in vacuo, can give multiple minima, and are less well developed, for example, for metal complexes. Condensed phase crystal data has no direct energy estimates. It is quick to use, high quality experimental data, which can be used to validate computational results. There remains, however, the question of whether conformations are affected by crystal packing. Histograms, scattergrams and multivariate methods can be employed.

Allen’s first example was cyclopropyl carbonyls. The structure

Image

 

is a key one in pyrethroid insecticides. What is the conformational relationship of the carbonyl group to the ring? It is found by searching CSD for the structure, calculating the O1-C9-C1-X (2,3) torsion to describe the conformer, and displaying in VISTA/Mercury. Allen showed a polar histogram in VISTA and a mercury plot (optionally, ball and stick).

 

Conformer mapping of benzophenones was published by Rappaport in J. Am. Chem. Soc. 1990, 112, 7742. The crystal conformers populate low energy regions of the 2D potential energy hypersurface. A low energy valley in the hypersurface shows the conformational preferences of the rings. 1’-aminoribofuranoside has five torsion angles. The main conformers are C1 endo and C3 endo. A general definition of ring puckering co-ordinates was published in Cremer, D.; Pople, J. A. J. Am. Chem. Soc. 1975, 97(6), 1354-1358. Mapping of crystal structure data using CP phase angles for cyclooctane rings (Acta. Cryst. 1996, B55, 882-889) showed that the twist-boat-chair is the preferred conformation.

 

A study on the effects of crystal packing on conformation was reported in J. Comput.-Aided Mol. Des. 1996, 10, 247-254. CSD torsion distributions were generated for 12 common fragments. Energy profiles were calculated at 6-31G* level. Allen showed the CSD torsion distributions versus potential energy curves and indicated the anti and gauche conformations. It was concluded that torsion angles with strain energies greater than 1 kcal/mole are rare in crystal structures. Taken over many structures, the effects of crystal packing on conformation seem to be the exception rather than the rule. Thus, crystal structure observations are good guides to the conformational preferences of isolated molecules.

 

Next, Allen discussed intermolecular structural knowledge. Such knowledge is useful in supramolecular synthesis, crystal engineering, crystal growth, structure determination, drug design, drug delivery, ab initio crystal structure polymorph prediction, and protein folding. The types of internal interaction that occur can be studied by crystallography or spectroscopy. Their geometric characteristics are studied by crystallography. Crystallography can also be used to find out if the interactions are directional. Ab initio calculations are needed to find out how strong they are.

 

An example is N-H…O (amide) hydrogen bonding. Extended crystal structures are searched in ConQuest for:

C=O…H < vdW +0.4
=O…H-N 90.0-180.0°
crystal R factor < 5%.
The following geometry is calculated:
O…H distance
O…H-N angle
C=O…H angle
Angle between N-H vector and amide plane.

 

Allen showed a typical CSD hit in Mercury and a parameter spreadsheet in VISTA. He also showed VISTA histograms for the O…H distance and a scatterplot for N-H…O versus O…H. Shorter hydrogen bonds correspond to the more linear N-H…O angles and tend to approach the O-acceptor in the plane of the >C=O system. The use of CSD in conjunction with ab initio calculations known as intermolecular perturbation theory (IMPT) is reported by Hayes and Stone in Mol. Phys. 1984, 53, 84-98. The interaction energy for fixed mutual orientations of small (model) molecules was calculated. CSD was used to indicate the preferred mutual orientations for exploration of the energy hypersurface using 6-31-G** basis sets. The total energy was calculated as a sum of individual components calculated by the IMPT procedure.

 

Hydrogen bonding at the C=S acceptor is reported in Acta Cryst. 1997, B53, 680. The electronegativities of C and S are both about 2.5 but thiourea is dominated by C=S…H bonding. Why is this? The dipole moment of C=S in CH2=S is small and in the opposite direction to that of C=O in CH2=O. However, when H is changed to NH2 in urea, the dipole moment of C=S is reversed, and S becomes a medium strength H-bond acceptor. Allen compared energy versus angle curves for IMPT energies of >C=S…H-O and >C=O…H-O.

 

Next, Allen outlined some highlights of hydrogen bond research. CSD has been used in establishing lone pair directionality and in studying resonance-assisted and resonance-induced hydrogen bonding, hydrogen bonding motifs and probabilities of their formation, and weak hydrogen bonds. Weak hydrogen bonds were a new subject in the 1980s. A book on the subject by Desiraju and Steiner was published by Oxford University Press in 1999. Examples of weak hydrogen bonds are:

 

-C-H…O,N,Cl
-C≡C-H, C=C-H and C-C-H as hydrogen bond donors
O,N-H…π and C-H… π bonds.

 

The C-H…O saga is illustrated by Sutar, D. J. J. Chem. Soc. 1963, 1105, which showed that short C-H…O bonds can be described as hydrogen bonds and Donohue, J. Structural Chemistry and Molecular Biology, published by Freeman in 1968, where it is said that the C-H…O “hydrogen bond” is a close contact. The years 1968-1982 were the dark ages of weak hydrogen bonds. In J. Am. Chem. Soc. 1982, 104, 5063, Taylor and Kennard proved that these bonds could be called hydrogen bonds. This is the 60th most highly cited paper in J. Am. Chem. Soc.

 

Allen briefly discussed some interactions not mediated by hydrogen, namely CO-CO interactions. He mentioned the anti-parallel motifs in the structures of BUCHAI and BAGTIM (the codes for two structures in CSD), the two dipoles and IMPT results for the anti-parallel motifs. It has been shown that when this bond can form it is quite a strong interaction. It can have a significant effect on protein secondary structure.

In conclusion, Allen quoted from Poincaré: “Science is built up of facts, as a house is built of stones; but an accumulation of facts is no more a science than a heap of stones is a house”. Informatics converts an accumulation of facts into fundamental structural knowledge with myriad applications. Every crystal structure is valuable and contributes to the creation of this knowledge. Unfortunately, the ConQuest-VISTA-Mercury process takes time and is not well integrated with the crystallographic or modelling software. In future there will be improved integration. Structural knowledge must be rapidly accessible and readily available to other groups. Many thousands of crystal structures are not being published; something must be done about this.

 

In a separate paper in the technical program of the Division of Inorganic Chemistry, Allen enlarged upon this problem. CCDC is developing rapid routes for placing new crystal structures into the public domain. Every crystallography machine can produce 700 structures a year. By the end of 2003 there may be 300,000 structures in CSD but there should be more than this in the high throughput era. Increasing numbers of new structures are never published in journals. For some laboratories the proportion may be as high as 75%.

 

The situation can only get worse. The logjam has moved to the publication process. The scientific community is losing valuable data resources. The accountability of crystallographers is compromised. The instruments are provided with public money and the data should be made public. Allen gave a 2003 variation on a quotation of Bernal’s: the growing abundance of crystal structure data and the time required to place them into the public domain act as a brake, or an element of friction, to the progress of science.

 

He listed some of the “brakes”. First there is the pressure of time: the process is labor-intensive. Who owns the data: the chemist who made the compound or the crystallographer who determined the structure? Who is responsible for publication? When the structure turns out to be not as expected there is often a lack of interest in publication. If the chemistry is rejected by referees some good crystal structures that go with it may never get published. There is a need for academic recognition for publication of structures.

 

Submission of pre-publication electronic data to CCDC is now required by an increasing number of journals: about 70 to date. The CCDC deposition number is printed in the published paper. CCDC has an archive of about 120,000 Crystallographic Information Files (CIFs) which are freely available at the CCDC Web site. CSD has been opened up to private communications; there are 2054 so far, 1648 of them (80%) submitted since 1997. Electronic journals are another possible solution to the problem of unpublished data: Acta Cryst. E was started in 2001 and published about 1000 structure in 2003.

 

Allen suggests that for high speed publication, structures should be directly deposited with CCDC for immediate entry to the distributed version of CSD, or for holding in the secure archive of CIFs with automatic publication to be allowed after, say, three years. In the future, Allen foresees automatic data harvesting by CCDC via a GRID route, accessing files placed in specific locations. He closed with some questions. What is a “publication”? Do CSD entries constitute sufficient recognition in their own right? Can data be separated from words and pictures? What is a database? Is it a secondary source of information or is it a primary one? What about quality control and refereeing, and adding a validation report to the CIF?

 

Data mining of crystallographic databases as an aid to drug design
Robin Taylor, Cambridge Crystallographic Data Centre, 12, Union Road, Cambridge CB2 1EZ, United Kingdom, Fax: 44 1223 336033, taylor@ccdc.cam.ac.uk

 

The requirements for crystallographic databases have changed over the last 5-10 years. User expectations have vastly increased: people expect easy answers. ConQuest search is not a one-step process but people want the information more easily. This is a challenge. User needs have changed vastly. Virtual high throughput screening has raised the bar: many, many more molecules are being studied. Crystal data continues to be valuable, because of the exquisite knowledge it gives us, but it must be instantly accessible. Taylor considered four levels of sophistication: processing raw data to make it easier to assimilate; coupling processed data to an application program; processing raw data into a knowledge base that can be coupled to any third party application; and processing raw data into objects for manipulation in an object-oriented script language.

 

The IsoStar database of intermolecular interactions can be used at the first level. It has information about non-bonded contacts, coming mainly from the Cambridge Structural Database (CSD) but to some extent from the protein data bank, PDB. CSD (or PDB) is searched for structures containing the desired contact and the hits are superimposed. Taylor used an example of least squares overlay of ketone groups. The results can be displayed as a scatter plot. Taylor showed such plots for ketone…OH and ether…OH. Because the ketone has mm symmetry, all the contacts are in one quadrant. The plot tells the user how frequently the contact occurs and what geometry it has. Ketone is quite common and forms many contacts to OH, implying (of course) that the contact is energetically favorable. There is a lack of hydrogen bonds along the C=O direction but there are more such bonds in the sp2 lone pair direction. Taylor showed how the plots can be rotated in 3D, indicating the hydrogen bonds in the lone pair plane for ether…OH.

 

He gave three examples of contacts to phenyl rings: Cl-, O and CH. He showed that the electronegative chloride ions and oxygen atoms tend to cluster around the edges of the ring with the weakly electropositive CH groups sitting above the pi electron density. IsoStar can be used to give quick answers to straightforward questions; to identify which groups will hydrogen bond and what directional preferences they show; to establish precedents for an interaction; and to generate ideas, suggesting novel ways to achieve non-covalent bonding. An example of this last use is suggesting ways in which bonding might be achieved to the indole ring of a tryptophan ring system. Hydrogen bonding to an NH is obvious but there are other strategies, for example as in the NH…π hydrogen bond in the CSD structure coded FIZWOA01. This could be used in a ligand design strategy.

 

At the second level of sophistication, the program SuperStar (the name of which indicates “IsoStar plus superimposition”) is used to find binding points on proteins using IsoStar data. Goodford’s GRID program does this based on molecular mechanics calculations but the SuperStar approach is based on experimental information. SuperStar calculates maps that depict the propensity for a functional group (probe group) to bind at different positions around a protein binding site or small molecule. SuperStar allows users to calculate interaction maps as 3D distributions (contour surfaces). IsoStar plots can be congested. The scatter plots are embedded in a grid, and after counting and contouring, an interaction surface is produced. Dividing the observed density by the expected density puts the plots on a meaningful scale.

 

For example, the stoichiometry of the crystal structure is examined and this information is used to see how many OH groups are expected at random. The actual number observed at a specific volume element can be divided by the random expectation value to give a propensity for contacts to occur at that position in space. A propensity less than one means a non-favorable interaction; a propensity greater than, or equal to, one means a favorable interaction. Taylor gave an example of ionized carboxylate with amino groups around it, contoured at a contour level of six, i.e., six times more than expected.

 

The procedure is as follows. Prepare the template molecule (e.g., protein binding site). Select a probe atom. The probe atom is a specific atom in the probe group for which the propensity will be calculated. It is usually an atom in an IsoStar contact group, e.g. carbonyl oxygen. Place the template molecule on a suitable three-dimensional grid. Analyze the template molecule and break it into fragments for which data are available in IsoStar. (The fragments correspond to IsoStar central groups). Overlay the IsoStar scatter plots onto the corresponding parts of the template molecule. In this way, all IsoStar information is projected onto the template molecule. Convert each transformed scatter plot to a density map, and scale the density to propensity; all maps are on the same propensity scale after performing this step. Combine overlapping maps by multiplication. Contour the final map and display.

 

Taylor displayed a SuperStar map for glutathione transferase (1glp) and results for CSD and PDB. The answers were much the same. Intra- and inter-molecular data can be combined. Taylor showed an IsoStar display of the distribution of carbonyl oxygens around an OH group, and a H-C(C)-O-H histogram, where SuperStar shows the OH in a secondary alcohol spinning. These two collections of information can be combined to indicate the preferred positions of carbonyl oxygen around a secondary alcohol, taking into account that the hydroxyl group can rotate to optimize its hydrogen-bonding interaction.

 

At the third level of sophistication, such distributions can be made available to other programs. Mogul provides intramolecular geometry data to people or client programs. It gives extremely rapid access to information on the preferred values of bond lengths, valence angles and acyclic torsion angles, using data derived from the CSD. Input to Mogul is a complete molecule, not a substructure. Given the instruction to retrieve data for a particular feature in that structure, e.g. a valence angle, Mogul will automatically derive a search query and use it to find the relevant CSD entries. The resulting statistics, such as the mean and median valence-angle values can then be passed via an ASCII file interface to other programs.

 

There are three libraries under Mogul, containing bond length, valence angle and torsion angle fragments generated from every entry in CSD. Every fragment is classified by evaluation of keys. Fragments are grouped together so that all fragments with the same set of key values are assigned to the same distribution. The distributions are accessible by searching a tree indexed on key values. Thus, evaluation of key values for a query fragment, followed by traversal of the tree, will find the distribution containing CSD fragments with the same key values. Taylor showed the keys used (about 20 of them) for a typical angle fragment.

Image

Tree search (traversing the tree based on keys) takes less than 1 second. Mogul is interactive and easy to use. For example, the user drags in a structure, clicks on three atoms and displays a histogram for that type of bond angle. The human element can even be removed and third party software can interact directly with Mogul. The program can be used in modelling for conformation validation (e.g., for filtering docking solutions) or in conformation generation. It can be used in crystallography for geometry validation and creation of restraint data and ligand dictionaries.

 

Finally, Taylor considered the fourth level of sophistication and Reliscript for manipulation of protein and ligand objects in the Python scripting language. Taylor presented a diagram in which all the items are objects.

Image

He showed a dubious docking where carboxylate formed some ugly looking contacts. Suppose that the user wants to access information from PDB to see if this carboxylate contact is likely. The procedure is as follows. Find all ligand carboxylates. Find all carboxylate-protein contacts. Determine the percentage of hydrophobic and polar contacts in PDB which are in the same or a worse environment than in the docking solution. Taylor showed some sample code with a SMILES search object in it. The result was that about 6% of carboxylates were classified as in unfavorable environments. The docking pose was in the worst 3%. This docking solution could thus be excluded. There are many ways in which the script could be extended, e.g., resolution limit, including crystal packing etc.

 

Crystallographic database providers must respond to the changing needs of users and they are doing so. This involves a change away from traditional structure search towards pre-processed data, application programming interfaces and the use of scripting languages.

 

The evolution of the Protein Data Bank
Helen M. Berman1, John D. Westbrook1, Philip E. Bourne2, Gary L. Gilliland3, Judith L. Flippen-Anderson1, and PDB Team4. (1) Department of Chemistry and Chemical Biology, Rutgers, The State University of New Jersey, 610 Taylor Road, Piscataway, NJ 08854, berman@rcsb.rutgers.edu, (2) San Diego Supercomputer Center, University of California, San Diego, (3) National Institute of Standards and Technology, Center for Advanced Research in Biotechnology, (4) Rutgers, SDSC/UCSD, CARB/NIST

 

Community discussions leading up to the formation of the Protein Data Bank (PDB) began in the late 1960s and early 1970s. A meeting of protein crystallographers at Cold Spring Harbor Laboratory preceded the establishment of the PDB at Brookhaven in October 1971. At that date, it contained seven structures. During the 1980s the number of structures increased. There were discussions about requiring depositions, following which the International Union of Crystallography (IUCr) guidelines were established. Thereafter, the number of structures deposited increased. Independent biological databases such as the Nucleic Acid Database (NDB) were established.

 

In the 1990s, the macromolecular Crystallographic Information File (mmCIF) project (another IUCr initiative) was completed. The mmCIF format expands the CIF dictionary by including data items relevant to the macromolecular crystallographic experiment. Eight years of work culminated in the establishment of a data dictionary and an ontology. During the same period, structural genomics was born. The PDB moved to RCSB (Research Collaboratory for Structural Bioinformatics) and is now managed by Rutgers, the State University of New Jersey, the San Diego Supercomputer Center at the University of California, San Diego, and the Center for Advanced Research in Biotechnology of the National Institute of Standards and Technology. International participants in data deposition and processing include the European Bioinformatics Institute Macromolecular Structure Database group (UK) and the Institute for Protein Research at Osaka University (Japan).

 

The mission of the PDB is to provide the most accurate, well-annotated data, in the most timely and efficient way possible to facilitate new discoveries and advances in science. Challenges are the growth in the number structures and the increase in their complexity. There are new methods for structure determination such as NMR cryoelectron microscopy. Users are demanding more complex queries: they do not just request co-ordinates but expect analysis. They are also requiring more annotation and integration with other genomic and proteomic information. The community of users is much larger and more diverse.

David Goodsell at The Scripps Research Institute has developed some lovely graphics for the types of structures in PDB. The database has a rich assortment of molecules. In 1995 there were about 5000 structures; now there are more than 24,000. Berman showed a growth curve. Types range from myoglobin in 1972 to the ribosome in the 1990s. Berman displayed a cityscape showing the growth in complexity. Structures had 1-2 chains in the 1970s; in 2003, some structures had 30-50 chains. Berman also illustrated the change in the number of new folds as a percentage of total PDB depositions. In 1980, 60% were unique folds; the percentage was less than 10% in 2001. Only 14% used to have less than 30% similarity; this number is now 5%. Berman tabulated some statistics, and then gave a data processing workflow diagram.

 

 

1993

1998

2003

Total structures

1727

8942

23,792

Number of structures deposited per year

792

2178

4,831

Average number of Web hits per day

N/A

57,000

188,000

Image

 

Image

The data processing system is based upon the CIF editor ADIT.  Different dictionaries can be put underneath ADIT without any software changes. Both functionality and content of ADIT can be simply customized. The data processing system automatically scales with changes in content. The data can be distributed to multiple deposition sites. There were many more items of data content in the 1990s than there were in the 1970s. Nowadays there are 350 data items per structure on average; in the early days there were only 200.

 

Berman gave a schematic diagram of the current query system and showed the structure explorer page from the Web. There are links to CATH structure classification, PDB Sum, a summary of the PDB structure, and SCOP, the structural classification. The site acts as a portal to other databases.

 

In order to achieve data uniformity, all the data files had to be reprocessed, and the data had to be validated and corrected, before integrated mmCIF files could be produced and loaded into a relational database management system. High quality data is needed for reliable query results. The processing has led to a greatly enhanced search capability extending from the biological assemblies down to atom level, and improved portability to other database efforts. Work is well advanced on the design of a new PDB with a three-tier architecture. Berman gave a diagram of the new query functionality, and a data flow diagram.

Image

Image

(See also Greer, D. S.; Westbrook, J. D.; Bourne, P. E. An Ontology Driven Architecture for Derived Representations of Macromolecular Structure. Bioinformatics 2002, 18, 1280-1281.) A more recent publication is Bourne, P. E.; Addess, K. J.; Bluhm, W. F.; Chen, L.; Deshpande, N.; Feng, Z.; Kramer Green, R.; Merino-Ott, J. C.; Townsend-Merino, W.; Weissig, H.; Westbrook, J.; Berman, H. M. The Distribution and Query Systems of the RCSB Protein Data Bank. Nucleic Acids Research 2004, 32, D223-225. In future the structure should be the user interface so the user can find information such as what other structures contain the same ligand, or what other structures have chains with >90% sequence identity directly from looking at a particular entry.

 

Berman next talked about the new challenges of structural genomics. She showed a flowchart

 

Target selection → crystallomics → data collection → structure solution → structure refinement → functional annotation → publication

 

A target registration database has been constructed (http://targetdb.pdb.org), containing 49,000 sequences, all downloadable in XML. The scope of TargetDB is to provide timely status and tracking information on the progress of the production and solution of structures. The targets are downloaded from 16 centers weekly. PDB entry sequences have been integrated. Targets can be searched by sequence (with FASTA), project target ID, project site, status (selected, cloned, expressed, ... in PDB etc.), update date, protein name and source organism. Reports of results can be constructed in HTML, FASTA and XML formats. Almost 600 structures in the PDB are from the structural genomics projects. The next stage beyond this will be a protein expression, purification and crystallization database, PEPCdb. It will include all the information about targets including the protocols for protein production.

 

 

The PDB (Berman, H. M.; Westbrook, J.; Feng, Z.; Gilliland, G.; Bhat, N.; Weissig, H.; Shindyalov, I. N.; Bourne, P. E. The Protein Data Bank. Nucleic Acids Research, 2000, 28, 235-242) has world-wide mirror sites, which means that users get the same structural information from anywhere in the world.

 

The Protein Data Bank (PDB) as a research tool

Philip E. Bourne1, John D. Westbrook2, Helen M. Berman2, Gary L. Gilliland3, Judith L. Flippen-Anderson2, and PDB Team4. (1) San Diego Supercomputer Center, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093, bourne@sdsc.edu, (2) Department of Chemistry and Chemical Biology, Rutgers, The State University of New Jersey, (3) National Institute of Standards and Technology, Center for Advanced Research in Biotechnology, (4) Rutgers, SDSC/UCSD, CARB/NIST

 

One of the PDB’s stated goals is “through timely distribution to enable complete analysis of macromolecular structure data”. One way of doing this is to ensure Web delivery, which gives the user access regardless of geographic location. Integration of crystallographic databases with applications is now critical. There are three classes of users: educators and students; structural biologists and chemists; and computational chemists and biologists. The PDB is already an important research tool, containing 22,333 structures (as of September 2, 2003) comprising 765 discrete folds, 2,164 protein families, and 20,907 structures containing proteins, 6,647 at the level of 90% sequence identity. Lots of structures have post translation modifications. Access 24 hours a day, seven days a week, is critical to a world-wide audience.

 

In future there will be further structure diversification. Bourne gave a graph illustrating the impact of structural genomics.

Image

He also showed a cityscape of fold distribution: the number of folds versus SCOP fold id. The graph illustrated the major types of folds as found in the PDB and as predicted for all structural genomics targets and for certain model organisms, including H. sapiens and E. coli. Some folds are over represented in the PDB representing a bias, but this is compensated by an under representation in the targets, that is, structures being attempted. PDB is the primary source of information from which secondary and tertiary sources, and value-added services, are made. By application of reductionism to PDB followed by further action, secondary sources (protein families, genomics, protein-protein interactions, dynamics, modeling) are made. How is the PDB facilitating this? In 1998 there was little software; in 2003 there were standard toolkits; in 2008 there will be extensive software. In 2000, PDB offered links, in 2003, Web services. In 2008, it plans to offer “PDB-in-a-box” and “MyPDB”.

New query and related features added recently are a sequence homology fitter, and XML (enabled by the macromolecular Crystallographic Information File, mmCIF). The “biological unit” is now handled. Features in alpha test include: query on PubMed abstracts; integration of data from other sources (e.g., SwissProt); improved ligand descriptions; review of a biologically active molecules; further experimental detail; relationships to disease; better classification of structures (compound/chain/ligand); and relationship to cellular location, molecular function and biochemical process.

 

Bourne passed a comment on the importance of usability. Biology suffers from the “high noon syndrome”. This is like the “12:00” symbol flashing on the video machine because people cannot be bothered to program the video: the barrier to entry is too high. People will only input data if the system is easy to use.  This has prompted PDB to offer better navigation of site content; more dynamic and intuitive access; better keyword access; query by example; better molecular visualization using a new toolkit; “MyPDB”; and use of Web services and CORBA.

 

Bourne gave an example of the research impact of all this. A search for “apoptosis” gave 103 hits recently but search for “apoptosis” on the new site gives 168 hits (SwissProt has been added) and annotation is better. Searching and displaying results takes less than half the time it used to take. A new visualization toolkit is being developed: see http://mbt.sdsc.edu. Local file and remote data loaders for PDB, mmCIF, and FASTA have been developed. 2D and 3D views are coupled. A rich and extensible API is being developed. Portable and Web deployable Java is used. Bourne also mentioned the molecular biology toolkit (MBT) structure, sequence and tree viewers. MBT is a flexible toolkit from which a variety of applications can be built. Applications are delivered via the Web as Java applets or run as stand-alone programs and allow integration and visualization of a variety of biological data types, most notably sequence and structure.

 

Thus, the human user has been considered but applications must also be enhanced. There has been a paradigm shift in the way that people work. Web services are finding favor. Nowadays people download the data and do the analysis locally but Web services will overcome this problem and allow use of up-to-date data, with applications described as “even I can do it”. Bourne gave an example of a Perl Web services client, showing some code from a small Perl program to access all PubMed abstracts containing the word “ferritin”. Each month, the query is updated automatically. This is also easy to do in Java.

 

Bourne discussed two scenarios for each of the three classes of users: educators and students, structural biologists, and chemists. The first concerned the educator searching for “ferritin”. The aim is to offer him new keyword search techniques: a “PDB Google”; new navigation tools; and the facility to search the database and Web content. Bourne showed literature references with a hyperlink to display, and a 4-helical bundle as the basic unit and then the biological molecule consisting of 24 such units displayed.

 

The second scenario concerned the structural biologist or chemist asking the following questions. What are the components of this quaternary complex of protein kinase A? What else can I learn about protein kinase A? With what diseases is it associated? In the new system there is a ligand viewer. Components of the structure are now well described. Users can tabulate and search ligand/chain/residue. Again, this is an effort to tackle the “high noon effect”. Current links are maintained to over 60 Web sites. In the disease browser, the user can browse by disease name, view the numbers of associated PDB structures, search for structures; and search for a disease name.

 

In five years’ time it is hoped that PDB will offer detailed descriptions of macromolecule-ligand interactions; a better description of stereochemistry, a better description of the overall contents of the database; relationships to genomic sequence descriptions; query by user type; and visual queries, for example, by molecule.