Open chemistry resources provided by the NCI computer-aided drug design (CADD) group

Marc Nicklaus

NCI has a 60-year history of cheminformatics, starting with the drug development program authorized by Congress in 1955, said Marc Nicklaus, the leader of the NCI CADD group. By 1963, “it became clear the system must track not just individual chemical compounds, but distinct samples of chemical compounds…magnifying the data management problem considerably”.10 This was a direct antecedent of the concept of separate PubChem Substance and Compound databases. The open NCI structure database was made publicly available in 1994 (see the talk by Daniel Zaharevitz, summarized above). The NCI Database Browser was, in 1998, the first public Web GUI for a large, small-molecule database, with advanced capabilities such as full substructure search. It arose from a collaboration between NCI and Wolf-Dietrich Ihlenfeldt at the University of Erlangen-Nürnberg. The Enhanced NCI Database Browser has 250,250 structure records and about 60 million data points: mostly Prediction of Activity Spectra for Substances (PASS)11 predictions. Sophisticated search and output options are available.

The CACTUS Web Server offers many services, tools, and downloadable datasets centered on small molecules. Apart from the database browser, Marc singled out the Chemical Structure Lookup Service (CSLS, pronounced “sizzles”), the Optical Structure Recognition Application (OSRA), and the Chemical Identifier Resolver (CIR). Developed by Igor Filippov in 2006, CSLS is a “phone book for chemical structures”, linking 74 million indexed structures (46 million unique structures) to over 100 databases. OSRA, developed by Igor Filippov in 2007, converts graphical representations of chemical structures in journal articles, patents, or other text, into SMILES. CIR, developed by Markus Sitzmann in 2009, converts one structure identifier or representation into another. Its workflow involves lookups in the CADD group’s chemical structure database (CSDB). CSDB contains about 121 million structure records for 85 million unique structures, in 140 databases, including PubChem, and the Sigma Aldrich iResearch Library.

It might be thought that the many large databases now available for CADD are enough, but perhaps we need a new approach. Perhaps we should not design a new molecule, and then ask how it can be made. Instead, we could look into what can be made reliably and cheaply, and then search only among those molecules for new, potentially bioactive compounds, using the usual CADD approaches.

Therefore, Marc’s team has begun building the Synthetically Accessible Virtual Inventory (SAVI), using a set of predictive and richly annotated rules (transforms) from Lhasa Limited and Lhasa LLC, a set of reliably available and inexpensive starting materials from MilliporeSigma, and the cheminformatics engine CACTVS from Xemistry GmbH.

A parser has been implemented in CACTVS for the CHMTRN/PATRAN retrosynthetic transforms (of which there are more than 2,300), and it has been adapted for the forward-synthetic SAVI approach. Fourteen transforms have been implemented and used in production runs so far. Among the 3.3 million building blocks in sets from Sigma-Aldrich, and other catalogs, 377,484 compounds were identified as highly available, and in their majority annotated with pricing and availability data.

Using 11 “productive” transforms in one-step reactions, a sample subset of about 610,000 compounds was generated in summer 2015, and made available for download. It is annotated with (but not yet filtered by) 54 compound, reaction, and typical drug design properties. As of August 2016, 238 million products have been generated; it is estimated that there might 280 million when the runs are completed. Overlap with PubChem is minimal: more than 99% of the compounds appear to be novel.

Eleven new transforms are being added, and in future, products will be steered toward interesting novel rings and scaffolds. The product files will be offered for download. Multi-step reactions will be investigated in future, and a Web GUI with extensive search capabilities will be developed. Topics of the ongoing work are how the predicted synthetic routes will work in actual syntheses, what filter rate will be needed for truly “interesting” compounds, and how the editing and adding of transforms can be made as easy as possible.