Exchangeable Molecular and Analytical Data Formats

The importance of facilitating data exchange

During the morning session on molecular data formats, Keith Taylor (Accelrys) and Roger Sayle (NextMove Software) both noted that while a small number of molecular graphic formats were in common use (like the ubiquitous molfile), some users did not conform to either the Mol V2000 or V3000 published standards. Roger noted that for a data set that was created with a range of different element and charge types and then tested with 24 different “mol” file reader packages the failures and errors were disturbingly large.

Geoffrey Hutchinson (University of Pittsburgh) then gave a description of the OpenBabel project (http://openbabel.org) that has produced a toolbox to read, write and convert over 110 chemical file formats, and the difficulties that have been created by the non-conformity with formats.

Image
Slide courtesy of Phil McHale

In his presentation, Phil McHale noted that Perkin-Elmer was working on an Open XML format for export of data from Electronic Laboratory Notebooks (ELN). At present formats were generally proprietary. He reviewed the CDX and CDXML formats as well, both of which have been widely accepted and utilized.

Image
Slide courtesy of Stephen Heller

Stephen Heller gave an update on the InChI representation and the InChI Trust. Like barcodes, and QR codes, InChIs are not designed to be interpreted by humans, but are produced by computer from structures drawn on-screen with existing structure drawing software. The original structure can be regenerated from an InChI with appropriate software. Steve noted that a number of videos have been produced to attempt to explain their application.

InChI videos:  

Evan Bolton (National Center for Biotechnology Information, National Institutes of Health) spoke about the new features in the PubChem data submission portal that support a wide range of user-defined data, and about the need for data standards.

Barry Bunin (Collaborative Drug Discovery) noted that there was no standard approach for a computer-based way of managing large molecules such as: peptides, antibodies, therapeutic proteins or vaccines. HELM (Hierarchical Editing Language for Macromolecules) was being introduced as an Open Source approach by Pfizer and was released into production in 2008. He then introduced the CDD (Collaborative Drug Discovery) vault as a hosted database solution for secure management and sharing of chemical and biological data.

For the afternoon session on spectroscopic data, the first presentation was a joint paper from Tony Davies (AkzoNobel Chemicals) and Robert Lancashire (University of the West Indies) who gave some history on the JCAMP-DX data formats. Recognition was given to Paul Wilks, Bob McDonald and Jeannette Grasselli-Brown as pioneers in the publication of JCAMP-DX standards. Since 1988 the standards for a wide range of techniques have been published and in 1995 they became the responsibility of IUPAC.

Michael Boruta (Advanced Chemistry Development) followed by showing the transition from hand written annotations on chart paper copies of spectra to electronic equivalents that could be stored in “knowledgebases.” For example, ACD/Labs Spectrus Process includes separate knowledgebases for IR and Raman. The assignments can be exported as part of JCAMP-DX files, but no standard for this exists.

Image
Slide courtesy of Clemens Anklin

Clemens Anklin (Bruker Biospin) identified the common data formats used for various techniques. In the case of NMR this was predominantly JCAMP-DX. He lamented the fact that whilst 2D NMR had existed before any JCAMP-DX standards were published, the latest accepted standard for NMR was 5.01 published in 1999 and this only covered 1D. The version 6 format for 2D has been in draft since 2002 and has been implemented by vendors who could not wait any longer.

Stuart Chalk (University of North Florida) introduced the AnIML specification and highlighted the features and benefits of using an XML protocol that could be fully validated. He noted that from 2003 it was designed to be a (backwards compatible) replacement for JCAMP-DX. The task group guiding the process set its charter: "to develop an analytical data standard that can be used to store data from any analytical instrument" and holds virtual meetings on a monthly basis to develop the specification. The first set of specifications is targeted to go through ASTM balloting in early 2014

Bob Hanson (St. Olaf College) finished this session with a proposal to have an extension to the JCAMP-DX standard whereby a single file could contain the molecular graphics data as well as the spectrum, together with annotations linking the two. This would allow interaction with cloud services such that a molfile could be passed to a server and a simulated spectrum returned with sufficient information to apply all the required annotations to identify the peaks.

The full symposium program is listed in Chemical Information Bulletin, 2013, 65(3) at: http://bulletin.acscinf.org/node/486#THa.

Robert Lancashire and Antony Williams, Symposium Organizers