Public Databases Serving the Chemistry Community

How do you find a reaction of interest? How do you find a molecule that may have activity against a target you are working on? How do you synthesize that molecule? How do you find the sequence of that target? How do you find a crystal structure or NMR spectrum of a compound? It seems that almost every day biologists and chemists increasingly make use of public resources on the Internet to answer these and many more questions. These resources are likely predominantly databases. How, you might add, would we survive (or at least do our jobs) if we did not have access to these resources? At the ACS Meeting in New Orleans we had the opportunity to meet many of the people involved in developing and maintaining such databases, as well as those thinking beyond what we have now and addressing such topics as quality, the future, and new technologies.

The morning session was opened with Evan Bolton (National Center for Biotechnology Information, NIH, United States) presenting “PubChem: A community driven resource.” He described how PubChem is an open archive that is used globally for people to push data whether small molecules or increasingly RNAi’s. To date there are 47 million CIDs (PubChem Compound Identifiers), and 650,000 assays and 1.8 million molecules have bioactivity results. PubChem is also accessible via the PUG (Power User Gateway) and has various widgets for mining data.

Markus Sitzmann (National Cancer Institute, NIH, United States) then presented onNCI/CADD chemical structure Web services.” He described the chemical ID resolver launched in 2009 which is most widely used by Eli Lilly. The NCI chemical structure database has 84.6 million unique structures and they are working on a new database with 141.7 million unique structures. In addition, the group is working on several web apps that will be accessible by iPad.

In the talk titled “ChemSpider: Disseminating data and enabling an abundance of chemistry platforms” Antony Williams (Royal Society of Chemistry, United States) described the many projects which his group is involved, in ranging from mobile apps to consortia projects providing resources for the chemistry community both in the UK and elsewhere. These initiatives include Open PHACTS which is a triple store registering public and private facts, PharmaSEA which is a project to de-replicate natural products, and the National Chemical Database Service, which is a UK project providing access to a series of commercial databases and prediction services and will ultimately deliver a repository for data generated by the UK academic community. (slideshare).

Yanli Wang (National Center for Biotechnology Information, NIH, United States) then provided an overview of “PubChem BioAssay: A public database for chemical biology data.” She showed the growth in records and reported over 40,000 compounds with bioactivity <1 uM. To date 177 chemical probes have been identified. The database covers 8000 targets and 2000 organisms.

Gary Battle (European Bioinformatics Institute, United Kingdom) then presented “Chemistry-related resources at the Protein Data Bank in Europe.” He described the Protein Data Bank in Europe and how they have a strong focus on ligands and tools for chemists. He gave examples of molecules with incorrect ligand geometry and also cited a recent paper that described 20% of structures as having geometric errors (The good, the bad and the twisted: a survey of ligand geometry in protein crystal structures. J. Comput.-Aided Mol. Des. 2012, 26, 169-183).

Egon Willighagen (Maastricht University, The Netherlands) then presented “Architecture for an open science molecular compound database.” He described Open Notebook science, RDF graphs and (slideshare).

The afternoon session began with Julien Thibault (University of Utah, United States) speaking on Local and remote tracking of molecular dynamics data for global dissemination,” which described the iBiomes integrated biomolecular simulator and the iRoDS rule-orientated data management system.

In the next presentation titled “Chemical science that underpins the Reaxys database” Juergen Swienty-Busch (Elsevier Information Systems, Switzerland) discussed the recent advances they have made in order to support the daily workflow of a research chemist. The database started from the early publication of Beilstein in 1881 which collated 1,500 compounds over 2,200 pages and is now the Reaxys database covering chemical reactions from over 16,000 periodicals.

Valentina Eigner-Pitto (InfoChem, Germany) then described “ChemReact: A free database containing more than 524,000 reactions available at your fingertips.” She explained how this represents the most comprehensive free resource available today, and on a Mobile App! Interestingly, a plot that showed the reaction type with frequency appeared to show a power law. During discussions it was found that this was something that other groups had noticed, but it has not been widely disseminated.

Sean Ekins (Collaborations in Chemistry, United States) then delivered a lecture on behalf of Christopher Southan (TW2Informatics, Sweden) “Navigating between patents, papers, abstracts, and databases using public sources and tools.” He described how such navigation was possible due to ChEMBL’s capture of SAR from journals, the deposition of three major automated patent extractions (SureChem, IBM and SCRIPDB) in PubChem for over 15 million structures, open tools such as, OPSIN, and OSCAR, which enable the conversion of IUPAC names or images to structures, and the indexing of chemical terms (e.g. InChIKeys) that turn Google searches into a merged global repository of 40 to 50 million structures. (slideshare).

Colin Batchelor (Royal Society of Chemistry, United Kingdom) then described ChemSpider reactions: Delivering a free community resource of chemical syntheses.” This was a work in progress report regarding the work of the Royal Society of Chemistry to create an online resource of hundreds of thousands of reactions. The original source data that is to be unveiled will result from the PhD research of Daniel Lowe (originally at the University of Cambridge, Unilever School of Informatics and now at NextMove Software). (slideshare).

Michael Kappler (Roche, United States) presented the final talk of the day on “Intuitive and integrated browsing of reactions, structures, and citations: The Roche experience.” He described how they could not get data out of their CambridgeSoft ELN and created instead a unified data model leveraging Pipeline Pilot and Reaxys. He mentioned how they had 27 informaticians at Roche working on the project, and it took seven months to migrate 99.5% of all reactions.

Day 2 began with Noel O’Boyle (NextMove Software, United Kingdom) “Universal SMILES: Finally, a canonical SMILES string?” He discussed how to use the InChI’s canonical labels to derive a canonical SMILES string in a straightforward way and the performance of these methods. (slideshare).

Next, Laura Guasch (National Cancer Institute, NIH, United States) talked about “Analysis of tautomerism in databases of commercially available compounds.” She reported on the tautomerism analysis in a large database of commercially available compounds to investigate how many cases there are of the same chemical being sold as different products (at possibly different prices), and to test the tautomerism definition of the widely used chemoinformatics toolkit, CACTVS. She reported on thousands of cases where at least two products are listed as different compounds in the Aldrich Market Select (AMS) database from ChemNavigator/Sigma-Aldrich.

Colin Batchelor (Royal Society of Chemistry, United Kingdom) reported on the RSC’s Chemical Validation and Standardization Platform (CVSP) and their efforts to use algorithmic checking on chemical compound representations to try and provide a potential path to quality-conscious databases. The CVSP platform checks chemicals using a set of rules such as hypervalency, charge-imbalance, absent stereo, etc. and uses algorithms to convert submitted structure representations into standardized representations such as those expected by the FDA for their substance registry system. The system, when released, will be available for the community to use. (slideshare).

Sean Ekins (Collaborations in Chemistry, United States) then elaborated on “Challenges and recommendations for obtaining chemical structures of industry-provided repurposing candidates.” He described recently published efforts (Drug Discovery Today 2013, 18, 58-70) to find the structures for repurposing candidates provided by the pharmaceutical industry to the National Center for Advancing Translational Sciences (US) and Medical Research Council (UK) initiatives. He also described efforts to make the structures identified available publically and analyze them in silico to identify new uses. (slideshare).

Juergen Swienty-Busch (Elsevier Information Systems, Germany) then reviewed “One size fits all or how to find the needle in the haystack?” He described a workflow that used Pathway Studio, Reaxys medicinal chemistry to design molecules with good ADME properties, and PharmaPendium to prioritize the drug pipeline. He also mentioned that Reaxys and PubChem overlap by 20%.

Alex Clark (Molecular Materials Informatics, Canada) presented the final talk entitled “Pistoia Alliance AppStore: Apps for life sciences R&D.” The app strategy of the Pistoia Alliance was introduced: the precompetitive organization is exploring ways to encourage the uptake of mobile apps for life sciences R&D, and has released its own catalog of relevant apps. Future directions for the project were discussed. (figshare).

In summing up, the presentations described a broad array of databases and efforts that are enriching the chemistry community and will likely be a starting point for future ACS presentations and research.

Sean Ekins and Antony Williams, Symposium Organizers