Developing databases and standards in chemistry

Steve Heller

Steve Heller was the first speaker, with an amusing scene-setting talk. He admitted that his secret in getting to where he is now was “luck, luck, luck”. He disliked chemistry lab work; he was at the right place at the right time with the right people; he worked with supportive people; and he planned for who would take over the work next. If the problem were just technology, someone would have solved it already. The real problem is always cultural and political, not technical. Steve had the good luck to be at NIH to collaborate with Hank Fales and Bill Milne; at the Environmental Protection Agency (EPA) with Morris Yaguda, when EPA started using mass spectrometry to identify pollutants; at the National Institute of Standards and Technology (NIST) with Steve Stein, when CAS stopped providing Registry Numbers to the NIST Mass Spectrometry database; and to be retiring just when Ted Becker and Alan McNaught thought that the International Union of Pure and Applied Chemistry (IUPAC) needed to move into the 21st century of chemical structure representation.

The NIH/EPA/NIST mass spectrometry database1,2 originated at MIT (with Klaus Biemann), and was run at NIH in the 1970s using a modification of Richard Feldmann’s search software. Control moved to EPA, and eventually to NIST in the 1980s. NIST was the right home for the database: NIST now collects a few million dollars a year in mass spectrometry database royalties. The NIH/EPA Chemical Information System (CIS)3 was a collection of chemical structures with links to various databases supporting environmental and scientific needs. It also had a number of analysis and prediction programs. All the databases had CAS Registry Numbers4 as their link. The CIS worked for a number for years, but never had the full support of the government or of ACS. It died in the mid-1980s; it was a bit ahead of its time.

Steve’s next example of luck dates back to November 1999 when he and Steve Stein seeded the idea of a chemical identifier. The right people in this case were the IUPAC International Chemical Identifier (InChI) team: Steve himself, Alan McNaught, Igor Pletnev, Steve Stein, and Dmitrii Tchekhovskoi. InChI5 was developed as a freely available, non-proprietary identifier for chemical substances that can be used in printed and electronic data sources, thus enabling easier linking of data compilations, and unambiguous identification of chemical substances. It is a machine-readable string of symbols which enables a computer to represent a compound in a completely unequivocal manner. The InChI algorithm normalizes chemical structures and includes a “standardized” InChI, and the hashed form called the InChIKey. InChI is easy to generate, expressive, unambiguous, and unique and it does not require a centralized operation. It enables structures to be searched by Internet search engines using the InChIKey.

InChI is not a replacement for any existing internal structure representations, but an addition to them. Its value is in finding and linking information. The proof of its success is in its widespread adoption.6 All the major structure drawing programs have incorporated the InChI algorithm in their products. There are millions of InChIs in large chemical databases. Regardless of controversies and differing opinions, InChI has been more widely adopted than SMILES. Currently, the InChI algorithm can handle neutral and ionic organic molecules, radicals, and some inorganic, organometallic, and coordination compounds. Steps to expand it to handle more complex chemical structures are underway, under the auspices of the InChI Trust.

Finally, Steve had the luck to join the PubChem Advisory Board, and worked with the right people, Steve Bryant and Evan Bolton. The database now contains nearly 92 million compounds, 223 million substances, and 1.2 million bioassays, and related data and publications. More than 100,000 searches are carried out every day, by 1.6 million unique users in a month. The success of PubChem, like that of InChI, is measured by its widespread use.