Chemical structure representation in PubChem

Roger Sayle

A unique and invaluable feature of the architecture of PubChem is the distinction between the deposited structures (substances) and the normalized structures (compounds), and the retention of both. This feature allowed PubChem to avoid the early mistakes of CAS, said Roger Sayle of NextMove Software. PubChem Substance contains about 209.6 million structures; PubChem Compound contains about 91.7 million structures. The PubChem standardization service aims to determine when two chemical structures are the same.

Consider, for example, implicit and explicit hydrogens. Ethanol (PubChem CID 702) has been deposited 1569 times with six different explicit atom counts, and thus, six different SIDs. All have the same SMILES and InChI. Nitrobenzene (PubChem CID 7416) has been deposited as 164 distinct substance depositions, with five SIDs, two with molecular formula C6H5NO2, and the others with extra hydrogens: C6H6NO2+, C6H6NO2-, and C6H7NO2. To complicate matters, BIOVIA 2017 changed the interpretation of CTfiles (the default valences of some neutral main group elements have changed); this affects 342,689 SIDs and 213,097 CIDs. PubChem is inconsistent on protonation, but generally protonation state is preserved.

A major challenge in chemical databases is aromaticity: two compounds that differ in Kekulé forms are the same molecule. A significant novel innovation in cheminformatics was Evan Bolton’s development of a “canonical” Kekulé SMILES form of a molecule. This enabled PubChem to avoid the early mistakes of Daylight Chemical Information Systems. Different chemistry toolkits (and chemists) differ in opinion on which ring systems are aromatic and which are not, hence PubChem’s wish to remain “neutral” by only providing non-aromatic SMILES. Unfortunately, Evan’s algorithm aromatizes all conjugated cycles, and not just those associated with the smallest set of smallest rings, a computationally demanding requirement. PubChem does not restrict aromaticity to 4n+2 Hückel aromaticity; thus conjugated ring systems such as pentalene are deemed aromatic.

Tautomers are normalized. Thus 4-(phenylazo)-1-naphthalenol (CAS RN 3651-02-3), a case of classic tautomerism, has only one CID (5355205), but there are two InChIs, one for each tautomer. Unfortunately not all tautomers are handled so well: four tautomers of this molecule are recorded:


PubChem follows InChI in breaking bonds to metals. It currently handles 109 of the 118 elements in the periodic table. PubChem registration confirms that any specified isotope has been observed experimentally. Hence 7CH4 is rejected, but 8CH4 (which has an exceptionally short half-life) is allowed. Another quirk is that PubChem does not normalize mononuclidic isotopes. Hence fluoromethane has CID 11638, while fluoromethane with 19F has CID 58338844. PubChem rejects chlorine dioxide, and carbide anions, but it accepts disulfur dioxide (O=S=S=O) which is stable for only a few seconds.

It is one of the innovations of PubChem that it explicitly stores relationships (such as having similar 3D shape) in the database. Given a CID, you can find all similar CIDs based on Tanimoto similarity, for example, but you can also find all the tautomeric forms provided by depositors by following the links from CID to SID. Likewise, there are internal links (backwards and forwards) between mixtures and their components, and between isotopes of a compound, and between enantiomers of a compound.

PubChem allows depositors to specify advanced representations of molecular structures such as inorganics and organometallics via SD tags. Quadruple, dative, complex, and ionic bonds can be specified with the non-standard bond option; hydrogen, resonance, bold, and Fischer bonds, and close contacts can be specified with the bond annotations option. Relatively few depositors make use of these options.

Roger concluded by saying that PubChem represents the current state-of-the-art in chemical structure representation.33,34,35 Under the surface, unseen to most users, are many technical and scientific innovations that have enabled PubChem to scale to contain nearly 100 million compounds. From simple design decisions such as the substance versus compound distinction, to breakthroughs such as canonical Kekulé SMILEs, the architecture of PubChem contains a treasure trove of cheminformatics innovations, covering normalization, tautomers, mixtures, 2D fingerprints and similarity, substructure search, biopolymers, text mining, and much more.