PubChem a resource for cognitive computing

Stephen Boyer

Stephen Boyer of the IBM Almaden Research Center has collaborated with OntoChem, the University of Alberta, NIH, EMBL-EBI, and others on a chemical ontology approach to addressing drug discovery. Their work with chemical ontologies identifies a family of molecular attributes that define a molecule and explores how those attributes might be used for identifying functional attributes based on molecules with similar structure activity. An example of their use of molecular attributes can be seen below, illustrated by assignments within the target molecule (Azulfidine) of benzoic acid, carboxylic acid, carbonyl compound, phenol, azobenzene, azo compound, sulfone, sulfonamide, pyridine, benzene, and hydroxyl groups:


In this example of Azulfidine, assignments are also made for functional attributes, for example, “it is used for” the treatment of Crohn’s disease, rheumatoid arthritis, and ulcerative colitis.

The process begins by converting a compound name to SMILES. From the SMILES, molecular attributes (also known as molecular descriptors or chemical labels) such as “hydroxy” or “benzoic” or “phenyl” are generated. Steve’s team submitted about 1.4 million SMILES strings from ChEMBL to two different auto-classification systems to make a ChEMBL ontology database with two computer-generated chemical ontologies: ClassyFire (written by David Wishart of the University of Alberta and Ph.D. student Yannick Djoumbou Feunang) and OntoChem (Lutz Weber).

Steve then used this database in a multi-step process. He queried it for a gene or target of interest (“XYZ”); created a set of candidate compounds with reported activity for XYZ; refined the candidate set to create a training set of compounds (e.g., with EC50 <30); scored and ranked the molecular attributes; and then used those results to query the ChEMBL database minus the candidate set and the training set. He thus identified 100 compounds with potential activity, exclusive of the candidate or training sets.

Steve reported two experiments. The first concerned MDM2 (mouse double minute 2 homologue), a protein that in humans is encoded by the MDM2 gene. The key target of MDM2 is the p53 tumor suppressor. Steve carried out a sample analysis, using the two chemical ontologies, to predict compounds that may have MDM2 activity, scored with a chi-squared test. In ChEMBL, 20,558 molecules have activity for MDM2, but only 27 of these have IC50 < 30 nM. He compared the top 100 compounds identified by ClassyFire with the top 100 compounds identified by OntoChem, generated with the parameters of the top 10 labels, assay minimum = 30, and corpus count cut off = 300,000. He found 57 predicted compounds in common between the two ontologies. Not having a laboratory, he was unable to test any of these compounds, but he did find structure activity data in numerous patents that had 26 compounds with reported assay data for MDM2, and some of them matched compounds in his set of 57 potential actives.

Steve’s second example concerned SGLT2 (sodium/glucose cotransporter 2) inhibitors that reduce blood glucose levels and have potential use in the treatment of type II diabetes. Thirty compounds with assay data for SGLT2 were derived from the ChEMBL database, but only 12 had EC50 < 10 nM. Using these 12 molecules as a training set, the team identified several new molecules as possibly having SGLT2 activity. A search of patents and the scientific literature confirmed that several of the identified compounds had reported significant activity as SGLT2 inhibitors.

Steve closed with some final thoughts on innovation. Steven Johnson25 coined the term “hummingbird effect” to describe how an innovation in one field ends up triggering changes that seem to belong to a different domain altogether. Innovations arise from the “adjacent possible” (a term Johnson borrows from the theoretical biologist Stuart Kauffman): you get railroads when it is railroading time, and not before, even if some prescient inventor sketches them out far in advance, and they open up all kinds of new possibilities.