Using InChI to manage data

Peter Linstrom

To explain the usefulness of InChI, Peter Linstrom of NIST started by defining a problem as follows. “I have data about a substance and my colleague has data about a substance. Are these substances the same so that we can combine the data about them? Are we talking about well-defined molecular species?” The term “well-defined” can mean different things to different people. A well-drawn structure can precisely identify a molecule, but there are issues with formats and drawing conventions. Drawing a structure from a name by itself does not improve identification because additional information is often required to improve specificity. Moreover, sometimes we do not have a “well-defined” molecular structure. This is a general problem which cannot be solved for a significant portion of historical data.

nChI can help because it identifies a molecule based on its structure, and it allows us to ask whether two “well-defined” structures are the same. Also, InChI has a layered design allowing matches to related compounds such as stereoisomers, geometric isomers, and “isotopologues” (compounds that differ only in isotopic composition). In addition, with a little string manipulation we can ask even more questions.

An InChI is hierarchically layered. There are several InChI layer types, each representing a different class of structural information. These include: formula, connectivity, geometric and stereo isomerization, isotopic composition, charge, and protonation state layers. Layers are separated by a forward slash. Consider the two isomers of carvone, the InChIs of which differ only in the stereochemical layer (emboldened in the following). One isomer smells of spearmint and has


The other smells of caraway and has


(The “1S” at the beginning of each string indicates a standard InChI.)

The NIST Chemistry WebBook provides an example of the use of InChI. It combines data from many sources. It is over 19 years old and there are many problems with identifiers from older datasets. Historically, CAS Registry Numbers and other accession numbers were used in matching species, but there were many problems (even the check sums in CAS Registry Numbers were wrong in one case out of ten). Newer data often come with structures and InChI can be used. Moreover, drawing structures can force additional analysis. Nevertheless there are still legacy data with incomplete identifiers (e.g., for stereoisomers and isoanalogues). An example is the species labeled as “gamma-elemene,” where 81 chromatographic retention values in the literature were analyzed, and found to correspond to five different chemical species (with similar mass spectra).9

PubChem is a great resource. Apart from the features that we all know and love, there are lesser known features that help disambiguate species. The substance database, separate from the compound database, records the mapping of names to structures by the various people who submitted the data. Partial InChIKey search allows compounds with the same composition and connectivity, but different information in further InChI layers, to be retrieved.

Voltaire said that perfect is the enemy of good. We cannot fix all chemical structure errors without abandoning valuable historical data, and newer data also are not immune to identification problems, but we can make progress where resources permit. There are tools such as InChI and PubChem that can help, but not solve the entire problem. “Zero Defects” was an industrial quality management approach championed in the 1960s and 1970 which was criticized as an exhortation to do something that may not be possible. Total Quality Management, the approach championed by W. Edwards Deming, is based on continuous improvement of systems, driven by measurement. It has been dramatically successful and has succeeded where “Zero Defects” failed. The transition from “econoboxes” in the early 1970s to modern, reliable compact cars did not happen overnight. Similarly, our chemical structure tools are getting better, but we still have a long way to go.