Development & Use of Data Format Standards for Cheminformatics

The symposium on the “Development and Use of Data Format Standards for Cheminformatics” included speakers discussing standards already in existence, as well as standards still under development. Following the symposium on “Reproducibility, Reporting, Sharing & Plagiarism” earlier in the week, this symposium provided a perspective on what might be done to enable the reporting and sharing.

Stuart Chalk, from University of North Florida, described three standards projects in which he is involved. The first, the Analytical Information Markup Language (AnIML), has been in progress since 2003, and is nearing completion. Chalk showed some of the details of the standard, some examples of the markup, and examples of the display of spectral data marked-up according to the standard. Whether this is enough to establish uptake of the standard remains to be seen.

Chalk also described ChAMP (the Chemical Analysis Metadata Project), creating a standard for standards. The goal of the project is to define a minimum set of metadata to be used for all chemical standards development projects. While JCAMP-DX (Joint Committee on Atomic and Molecular Physical Data Exhange), AnIML, CSX (Chemical Semantics XML), and others all have had to develop metadata for their standards, those metadata efforts are all duplicative. It would be much more efficient to settle on a standard set of metadata, extend those standards where necessary, and allow the standards efforts to focus on the data unique to their discipline.

Chalk and Mirek Sopek, Chemical Semantics, Inc., discussed the details of an ontology standard for computational chemistry called CSX, Common Standard of eXchange. Chemical Semantics, who developed the standard with Department of Energy funding, are seeking to establish adoption by the major computational chemistry software packages. The goal of the project is to collect and organize the methodology as well as the results of computational chemistry calculations, accepting input from the major computational packages. This is accomplished in either of two ways: adding a capability to create an XML output file in addition to the normal output file, or converting the standard output to the CSX format. The CSX ontology is expressed as semantic web RDF (Resource Description Framework) triples, with mechanisms for representing data for both humans and machines.

Ian Bruno, Cambridge Crystallographic Data Centre, discussed the development of the CIF (Crystallographic Information Framework) format. While the JCAMP-DX format served as a starting point for AnIML, the STAR format (Self-defining Text Archive and Retrieval) was the basis for CIF and for PDB (Protein Data Bank). Although it is possible to represent other types of data using STAR and, in fact, a STAR-based chemical substance representation was developed, Bruno noted that the STAR format found application only in the structural chemistry and structural biology communities. An effort from PDB is now underway to convert to macromolecular CIF. The format has been used internally by PDB, and will become the standard PDB format in 2016.

Steve Heller, InChI Trust Project Director, provided an update on InChI (IUPAC International Chemical Identifier). This standard has already been incorporated into a number of databases, as well as into some journal publications. Heller noted that InChI is well-defined for small organic molecules, and several working groups have completed the specification for Markush structures, polymers and mixtures, and reactions. The software needs to be modified to handle these new specifications. Future plans include inorganics, organometallics, large molecules, such as biopolymers, macromolecules, proteins, and enzymes, as well as crystal structures. Finally, regarding the speed of adoption of standards, Steve noted that the metric system was adopted as the measurement of the United States in 1866, which was amended as the SI system in 2007, but it is still has not taken hold as a standard in the United States.

Ken Kroenlein, from NIST (the National Institute of Standards and Technology) in Boulder, described the workflow for incorporating thermodynamic data from a number of journals, including the Journal of Chemical and Engineering Data and the Journal of Chemical Thermodynamics, into the NIST ThermoML database. Initially, it was anticipated that authors would use the ThermoML tools to submit their data to NIST. This did not work out as expected and, at the current time, undergraduate students are employed to extract the data from journal articles, after acceptance, but before publication. Errors are found in about 30% of the articles and these are corrected prior to publication. The data files are then made available after publication. NIST is currently working on automated procedures to better identify those articles containing thermodynamic data, to reduce the number of papers for human curation, and more quickly and efficiently process the data for publication.

Demonstrating some of the flexibility of JCAMP, Bob Hanson described JCAMP-MOL, an extension to JCAMP that allows for the representation of a structure, in any format that the software Jmol can read, in addition to the JCAMP spectral data. Using this extension, it is possible, using JSmol and JSpecView, to display structural and spectral data and highlight spectral features and the corresponding structural features. Hanson also demonstrated real-time NMR prediction, using APIs to transmit molfiles via the cloud to servers in remote locations and return the data necessary to build the JCAMP-MOL file with the spectrum/structure correlations.

While not giving the last presentation of the day, Tony Williams, from the Royal Society of Chemistry, posed a question that perhaps provides as good as a summary about any current state of affairs regarding cheminformatics standards; that question is: “How good is good enough?” While much effort has been expended on standards, both for spectral data and for structure representation, Williams proposed that two standards, JCAMP and molfile, could suffice to enable data sharing at much greater levels than we do now. These standards aren’t perfect. Vendors have added custom tagging in order to store metadata not captured by the formal standard. The formats could certainly be improved, but they are good enough to use and to share data, and they can be used now. Better standards could be developed, and as we learned, many standards are under development. However, standards that are not yet available are certainly not any better than no standard at all. While JCAMP may be imperfect, it is available now. When (and if?) the new standards are ready, the data in the current formats can be converted.

David Martinsen, Symposium Organizer