Retrosynthesis, Synthesis Planning, Reaction Prediction: When Will Computers Meet the Needs of the Synthetic Chemist?

On a hot and humid, yet sunny, Monday in Boston we were treated to a tour de force of the current thoughts on how the machine may aid (or even eclipse) the synthetic chemist. This full-day symposium covered the full spectrum of past, current and hopefully future masters of the art. A wide variety of papers was given, each taking a slightly different view on the topic.

Juergen Swienty-Busch (Elsevier) spoke about the next steps of your synthesis. He described the problem related to searching for substances, and for reactions, and the issues with atom mapping. He described Reaxys’ approach to reaction similarity, and reaction classification. He went on to describe how the Reaxys taxonomy was developed and how this is applied to Ask Reaxys. Finally, he pulled all of this together to describe how Reaxys uses all of these underpinning technologies in order to solve the synthetic chemist’s problems.

Peter Johnson (University of Leeds) described some new advances from Leeds, in conjunction with the Chem21 IMI project, aimed at developing new methods for addressing key bottlenecks in synthetic processes. He is particularly involved with work package 5: assessing greenness of chemistry. He discussed the work that has gone into the creation and development of the Chem21 Reaction database. Synthetic chemists are working on data entry, appropriate atom mapping, and, crucially, entry of all important synthetic data. The system proceeds to provide a green score for the reaction. Peter ended by describing work for a reaction database looking at biocatalysed reactions.

Marc Nicklaus (National Institutes of Health) began by asking the question that defines this work: “What can I make reliably and cheaply?” After answering that question, you can search for those compounds that should be good for testing. To answer the first question Marc suggests we need good rules, and predictions, and available and inexpensive materials. Marc’s team needed to create forward synthetic routes, and they have examined some 1,500 transforms from LHASA, and looked at the robustness of the transforms (yield, reliability, thermodynamics, etc.), resulting in (so far) 13 transforms. They have also gathered building block information from Sigma Aldrich, looking for materials with high availability, low costs, etc. The initial work has resulted in a significant number of new compounds (with a low overlap with PubChem). Marc expects that building this work up to cover all 1,500 transforms from LHASA, and all Sigma Aldrich’s 3 million building blocks, will result in an enormous number of new, potentially interesting compounds.

Roger Sayle (NextMove Software) has worked with a number of large pharmaceutical companies, and gave us some insights into the gold mine of information contained in these companies’ electronic lab notebooks (ELN). A significant number of reactions fail, and by and large, these reactions never make it into the published (journal or patent) literature. Roger gave the example of a 50 by 50 library made by GSK (GlaxoSmithKline): 566 compounds were not synthesized, and assay results were reported for only 1,706, so, despite the use of reliable, predictable chemistry, a number of compounds were not made, and did not deliver results. Roger went on to describe an example of computational forensic chemistry whereby his team had examined over 11 million patents and extracted approximately 2 million reactions. He noted that the sixth most common reaction was the transformation of NOto NH2, but that the introduction of NO2 was very rare. Roger concluded that NO2 is delivered into most compounds via a building block. In 1990, Suzuki coupling appeared in almost no patents, but by 2013 it accounted for 7% of reactions.

Jonathan Goodman (University of Cambridge) delivered his usual amusing and animated insight into synthesis questions. While aiming for a world in which computers and machines could happily replace the bench chemist, he concluded that such a world is a long way off. He stressed that (fortunately) there are lots of complex pieces and parts required that make up a chemist. He described the various data that the chemist might take from systems to help in the decision process for synthesis design (literature analysis, spectral analysis, model design, property prediction etc.). His group has recently published an in silico inspired synthesis of Dolabriferol (2012). He concluded by discussing the future needs of a synthetic chemist, and how we might go about designing synthesis systems and machines. He noted that there are many obstacles: the current state of the art is uncertain, each step is uncertain (purification, selectivity, etc.), even reliable reactions can fail, and, ultimately, the question of whether a reaction is intended for discovery, process, or “just” publication. Jonathan felt that a successful retrosynthesis tool would need many things to determine success, including data on performance; use in industry and publishing; experimental verification of new examples; and discussion on the generality of reactions.

Brian Masek (Certara Inc.) introduced a problem in which a simple analysis of compound space for a set of 80 generic reactions, and a database with 1,000 reactants per reaction class, and schemes of 5 steps generates a space of at least 3 x 1027, so the problem becomes one of what is interesting and what can be made. Brian then described the process Certara has developed to perform de novo design, and then perform the retrosynthetic analysis. The system is “biased” towards pharmaceutical chemistry, and hence typically involves short(er) routes. Certara has focused on high probability, generic reactions. Brian examined a set of predicted analogues of Abilify and their predicted syntheses. The best predicted routes and several published routes were given in a blind test to practicing synthetic chemists. The published routes scored better (8.2 versus 7.5 out of 10) but the results give rise to hope.

John Figueras (retired) described the SynTree application he has developed. This works on a simple MacBook and can be downloaded from his website (http://www.jfigueras.com/COMPUTER%20Progs/Chemistry.html). The program is intuitive and seemingly gives the chemist complete control, by enabling selection of precursors at each branch of the tree. The system comprises various modules which together enable the synthetic analysis. There are two main modules. One, handling transforms, currently contains 240 transforms, and a variety of different atom types to enable appropriate mapping and definition. The tools are basic operations that add or remove atoms, change the order of bonds, and alter the hydrogen count at an atom. The second module is IPLists, a list of interference groups, that is, those groups that should not be allowed in a particular reaction. The program was demonstrated to show its ease of use.

Orr Ravitz (John Wiley & Sons) gave a good overview of the field. He said that systems should be productive, efficient, and creative, and help identify opportunities. He outlined a number of the key terms and nomenclature around retrosynthesis. He reviewed a number of initiatives that came before Automated Reasoning in Chemistry (ARChem), and then proceeded to describe the ARChem system in detail, including the generation and curation of rules, and the use of differing reaction databases to provide background and credibility to each prediction. He ended by announcing the launch of a new service from John Wiley & Sons which is underpinned by the ARChem software.

Valentina Eigner-Pitto (InfoChem GmbH) gave an overview of the development of various tools from InfoChem. She thanked the previous speaker who had covered a number of the key definitions already. She described some of the work with which InfoChem has been involved, especially with the process chemistry department at AstraZeneca. The main tools she described were ICSYNTH and ICFRP and the SPRESI database. ICSYNTH is a retrosynthetic planning tool, while ICFRP is a forward reaction planner. The advantage of InfoChem’s approach is the automatic generation of transform libraries from any reaction database, using ICMAP and CLASSIFY. SPRESI data have been an advantage over time and have helped to develop and optimize InfoChem’s algorithms using big quantities of data.

Bartosz Grzybowski (Ulsan National Institute of Science and Technology) discussed the evolution of his Chematica system. He takes the analogy of the expert chess system with six different pieces, and on average 10 rules to define how the pieces move. The rules for organic synthesis are many, many times more complex. Chematica currently has defined in excess of 20,000 rules. Using similar logic from chess computers, the system does not follow a single path, but rather checks back to see if it follows an earlier “less satisfactory” path, so it may get to a superior position that necessarily always follows the best path at each stage. Bartosz followed this up with a video demonstration of the Chematica software.

Alexandre Varnek (University of Strasbourg) started by explaining why chemical reactions are difficult objects (many species; two types, namely reactants and products; multi-steps; dependence upon reaction conditions, etc.). He described his team’s work on the condensed graph of a reaction (CGR). CGR may be used in a variety of ways including reaction searching, reaction data curation, reaction classification, analysis and visualization of reaction data, predictive models for reaction conditions, and models for kinetic and thermodynamic properties. Alexandre proceeded to show examples for each. Atom-atom mapping and quality of the underlying data were flagged as bottlenecks in the CGR generation process. Alexandre then described approaches to structure-reactive modelling, with many of the data coming from PhD and habilitation theses. He finished by showing the web page (http://infochim.u-strasbg.fr/webserv/VSEngine.html) where a number of his tools may be accessed.

Timur Madzhidov (Kazan Federal University) has been studying protecting group chemistry. He referred to Greene’s Protective Groups in Organic Synthesis, noting that the reactivity charts result from manual analysis of relatively small datasets. He proceeded to describe how he and his colleagues have analyzed approximately 142,000 reactions from Reaxys using the condensed graph of reaction approach. Their initial analysis showed some disagreements with Greene’s standard text. Timur finished by describing a prototype expert system to provide synthetic chemists with detailed recommendations of experimental conditions, in order to achieve the desired transformation.

Lee-peng Wang (University of California, Davis) described his ab initio nano-reactor. He described the problem associated with trying to describe events which occur infrequently on the time scales used during the calculations. In an effort to force the system, his algorithms squeeze the system hence forcing the temperature up, increasing the likelihood of reactions occurring. He corrects the pathway information with periodic minimizations. He finally described the application of the nudged elastic band method.

Acknowledgment

Many thanks to my co-chair Wendy Warr (Wendy Warr & Associates) for her great assistance in the preparation of this fine symposium, encouraging submissions, and working with me through the pains of new MAPS abstract submission system.

David Evans, Symposium Organizer

Conflict of Interest: David Evans is an employee of Reed Elsevier Properties SA, a member of the RELX Group. All comments herein are David’s own and do not necessarily reflect the views of the RELX Group.