What do synthetic chemists want from their reaction systems?

CINF symposium at the Fall 2017 ACS Meeting in Washington, DC
Wendy Warr, symposium organizer

David Evans and I organized a CINF symposium at the fall 2017 ACS national meeting. We had sought talks on progress in reaction searching, reaction planning, synthesis design, retrosynthesis, and reaction prediction. We would really have liked contributions from practicing synthetic chemists on their current needs, both met and unmet, and their frustrations with current systems, but no end users volunteered. Nevertheless, it was an interesting symposium and I have received positive feedback.

Academic research

Connor Coley of MIT was the first speaker. A critical challenge for computer-assisted synthesis design is that the reaction steps proposed may fail when attempted in the laboratory. The true measure of success for any synthesis program is whether the predicted outcome matches what is observed experimentally. Connor and his co-workers have trained a neural network model on experimental data from the USPTO and Reaxys to provide qualitative predictions of organic reaction outcomes in silico. In this method reaction databases are supplemented with chemically plausible negative reaction examples to overcome the literature bias towards successful reactions. Traditional reaction templates are used to generate a list of candidate outcomes for the machine learning model to score, so reactivity rules are implicitly learned rather than encoded. A new, edit-based reaction representation has been developed to focus on the fundamental transformation at the reaction site. In a 5-fold cross-validation, the trained model assigns the major product rank 1 in 71.8% of cases, rank ≤ 3 in 86.7% of cases, and rank ≤ 5 in 90.8% of cases.1 Connor presented some correct and incorrect predictions. Mispredictions are often chemically reasonable or attributable to data quality issues. Extension of the method to condition-dependent predictions achieves similar performance, but conditions are rarely necessary to make the prediction. Multi-step pathway planning remains challenging.

Mark Waller of the University of Muenster and Shanghai University has also used neural networks, but in this case deep neural networks, in both retrosynthesis and reaction prediction.2,3 The machine is trained with essentially the complete published knowledge of organic chemistry (more than 3.5 million reactions acquired from the Reaxys database). Circular fingerprints are used to represent the structures. Training can be carried out overnight with GPUs, and retraining can be carried out weekly. The approach has a higher than 95% accuracy when allowed to suggest up to 10 different routes for a target molecule on a test set of around one million reactions. Deep learning is 150 times faster than a rule-based approach, so handling multistep syntheses becomes feasible. Furthermore, preliminary studies indicate that coupling the neural networks with Monte Carlo tree search techniques outperforms traditional computational synthesis planning with hand-coded transformations.4,5

The international chemical identifier for reactions

The next two talks concerned the International Chemical Identifier for Reactions (RInChI). Gerd Blanke of StructurePendium Technologies explained that RInChI is a single string providing a unique representation of a reaction, independent of how the reaction has been drawn. The Long-RInChIKey is calculated from the IUPAC International Chemical Identifiers (InChIs) of each reactant, product and reagent. The Short-RInChIKey is a fixed-length hash over all reagents, products and agents. Web-RInChIKey is a fixed-length hash developed from the reaction components, but ignoring the specific role within the reaction.

Long-RInChIKeys are valuable for the storage of reactions. They allow uniqueness checks, and the identification of each reaction component by simple text searches based on Standard InChIKeys, but they do not have a fixed length. Short-RInChIKey has a fixed length of 55 letters, plus 8 hyphens as separators. The fixed length of Short-RInChIKey makes it suitable for exact searches of reactions in databases (and on the Web), indexing reactions in databases, and linking identical reactions in different databases. Web-RInChIKey allows for the fact that the depiction of a chemical reaction is not uniquely defined. For Web-RInChIKey, all InChIs of the reaction components are ordered alphabetically. Roles of the components are ignored. The Web-RInChIKey has a fixed length of 47 characters, with 17 letters in the major layer, and 15 letters in the minor layer. It is used for searches over reaction databases with an unknown drawing model, and comparison of reaction databases with different data models. The longer string sets for the major and minor layers make searches over the Web more precise. The first RInChI release was in March 2017. The InChI and RInChI formats and algorithms are non-proprietary, and the software is open source. RInChIs for 4.5 million reactions from the SPRESI database have been generated by InfoChem: only 239 reactions could not be converted.

Jonathan Goodman of the University of Cambridge started his talk with an example of an in silico inspired6 total synthesis of (-)-Dolabriferol.7 Synthetic chemists want data that are accessible, comprehensive, and reliable. InChIs are successful because people use them. Can RInChI be useful too? A good synthesis uses cheap, sustainable, and reproducible starting materials; has low hazards; produces low waste products; uses familiar reactions, and chemists’ expertise; has no inseparable by-products; gives high yields and high stereoselectivity; uses convenient processes; makes a product quickly, cheaply, and reproducibly; and is suitable for making analogues.

Jonathan believes that to achieve a good synthesis, we need to understand our reactions, to make best use of our analytical data, to search the literature effectively, and to store our results, so we, and others, can make best use of this knowledge for the next project and the next molecule. The contributions of Jonathan’s team to experimental chemistry, computational chemistry, and chemical informatics have helped advance all of these areas. Jonathan presented some examples of work that his team has done on the automatic generation of diastereomers using InChI strings,8 prediction of stereochemistry,9 the conformational properties of a polypeptide,10 and the risk assessment of chemicals.11 It is desirable to bring these disparate fields together, so that a single reaction system can enable users to benefit from them all. Using RInChI, we can connect diverse data to individual reactions. Jonathan concluded with an amusing vision of the future synthesis machine.

Search and faceting of large reaction databases

The next talk was by John Mayfield of NextMove Software. Synthetic chemists want data, diagrams, classification and search for their reaction systems. Workers at NextMove have previously described the extraction of reactions from patents. LeadMine and Chemical Tagger convert unstructured text to a structured reaction table. NextMove have also assembled over six million extracted reaction details consisting of the connection tables, procedure, quantities, solvents, catalysts and yields into a searchable ELN for multiple pharmaceutical companies. Good reaction diagrams are essential in communicating synthetic chemistry: NextMove has also done work in this field. In the area of classification, NameRXN software allows the recognition and categorization of reactions from their connection tables. Using a large rule-base of known reaction mechanisms and transformations, NameRXN is able to categorize reactions to a NameRXN code.12 Reactions are classified and assigned to leaves in the RXNO ontology. The ontologies are used to provide organization, faceting, and filtering of results. Pistachio is a reaction dataset interface providing loading, querying, and analytics of chemical reactions. NextMove’s Arthor technology is reportedly up to 100 times faster than other “fast search” systems.

The history of chemical reactivity

Guillermo Restrepo of the University of Leipzig showed that a computational approach to the history of chemical reactions sheds light on the patterns behind the development and use of substances and reaction conditions along two centuries. He and his co-workers have explored more than 45 million reactions in Reaxys and revealed historical patterns for substances, types of substances, catalysts, solvents, temperatures, and pressures of those reactions. Reaxys was treated as a graph database. Despite the exponential growth of substances and reactions, little variation of catalysts, solvents, and reactants is observed throughout time. The vast majority of reactions fall into a narrow domain of temperature and pressure. World wars caused a drop in chemical novelty for substances and reactions. The First World War took production back around 30 years and the Second around 15. After the Second World War, the use of organic solvents skyrocketed. Guillermo anticipates that this study, and especially its methodological approach, will be the starting point for the history of chemical reactivity, where social and economic contexts are integrated.

SciFindern and ChemPlanner

The next two papers concerned work that CAS is doing to enhance Wiley’s ChemPlanner13,14 with additional reaction content and associated references, including reactions from patents. A new version of ChemPlanner, including stereoselective retrosynthetic prediction and customizable relevance ranking, will be delivered exclusively in SciFindern. Orr Ravitz spoke first, largely concentrating on ChemPlanner itself. Chemists use ChemPlanner to boost creativity, overcome biases, and cover more options. Previous perceptions of retrosynthesis have been skepticism, fear of overload of information, and concerns about the coverage and the currency of the reaction database, and about accuracy and selectivity. Orr discussed automatic rule generation. Deriving selectivity from data requires statistical power, which is not always sufficient with a database such as CIRX. Literature examples, sorted by similarity to predictions, provide insight into experimental conditions, and enhance user confidence. Greater coverage is expected by using Chemical Abstracts data instead of CIRX. A nearly exhaustive reaction source will have many variations on the same reaction, or the same reaction with very similar reactants and products. Growth of the rule set will be significantly sublinear. Adding examples to existing rules will address functional group tolerance, give more statistical power for regioselectivity calculations, more automation for stereoselective rules, and improved yield prediction. There will be some consolidation of rules.

Jonathan Taylor of CAS started his talk with an introduction to SciFindern. Everything about SciFindern is new: the interface, the application, the search architecture and the data model. User feedback and usability testing were critical in the design. Layout and first surface information were users’ main priorities. The final design balances surfaced information, aesthetics, and browsability and filter options. In the past, synthetic chemists wanted reaction finding tools, today they have synthetic planning tools, and in future they will have help with predictive synthetic routes: SciFindern will deliver new predictive synthesis planning capabilities by integration of an enhanced ChemPlanner. Having ten times the reaction content will provide ChemPlanner with more synthetic options to build pathways and improve prediction quality. Jonathan concluded with some screen mockups of user input, of how SciFindern will propose potential synthetic routes, and of how users will know how the prediction was constructed.

Reaction classification

Next, Valentina Eigner-Pitto of InfoChem spoke about the renaissance of reaction classification and visualization. InfoChem’s ICMAP reaction mapping software identifies reaction centers. The CLASSIFY15 software automatically categorizes a reaction according to the type of chemical transformation, and it can be used for organization of large reaction databases and hit lists. It provides unique identifiers (ClassCodes) that can be used in reaction database analysis. This allows companies to study the kind of chemistry performed in-house, to examine the evolution of chemistry over time, and to compare in-house content with other repositories. Classification can also be used in network graphs, which can be used as visualization tools for reaction content. Workers at Merck KGaA, in collaboration with BioSolveIT and InfoChem, have demonstrated a workflow which exploits the chemist’s electronic laboratory notebook (ELN) in order to obtain and refine transforms for existing and novel chemical transforms,16 which in turn are used to enrich existing virtual libraries. The novelty of the added chemical space is assessed through a multitude of descriptors with a particular focus on three-dimensionality, scaffold diversity, and fingerprint enrichment. Additionally, each added transform is evaluated for its propensity to reconstitute known drugs and chemical probes. Computer-aided synthesis design programs include ChemPlanner and InfoChem’s17 ICSYNTH. Prediction of chemical space (forward reaction prediction) is also illustrated in the Merck poster.16

Use of Reaxys and ReaxysTree

Two papers followed from experts at Elsevier. Juergen Swienty Busch discussed ReaxysTree and the taxonomies used in Reaxys. He began with an exposition of new Reaxys, before turning to the taxonomies. Reaxys has information on documents, substances, reactions, and substance properties, and on bioactivities and targets in Reaxys Medicinal Chemistry Index. For documents, terms from ReaxysTree, Embase, Compendex, and Geobase make search and analysis possible on ReaxysTree. For substances, analysis is possible on substance classes and available properties. For reactions, search and analysis is possible on reaction classes, catalyst classes, and solvent classes. For targets, search and analysis is possible on gene and protein taxonomy, organisms, cell lines, and administrative route. Substances have been curated by Richter classes, rings, and functional groups. Solvents, reagents, and catalysts have been curated for reactions. ReaxysTree allows concepts and synonyms to be used for search, filtering, analysis, and indexing. ReaxysTree concepts for reactions include name reactions, and classes and types such as cyclization, condensation, and addition. Juergen next outlined how reaction mapping is carried out, a transition state is assigned, and the transform is coded. In searching reactions with ReaxysTree, taxonomy terms are connected with actual Reaxys queries using transforms, and other appropriate search terms such as product substructures.

Matt Clark of Elsevier thinks that medicinal chemists themselves only want to find transformation details for chosen steps in synthesis, while management wants to lower the cost of making compounds, and wants reliable reaction schemes that can be sent to a contract research organization (CRO) for fast turnaround. Reaxys is a treasury of reported chemistry, with a built-in synthesis planning tool and display of experimental procedures. The API allows you to use similarity for compounds and reactions, access some data elements not visible in the user interface, and create your own analytics and reaction networks. Pipeline Pilot and KNIME offer an easy way to use the API, and offer interoperability with other software products.

Matt discussed a reaction graph analysis application to address questions around a specific potential CDK8 inhibitor. What chemistries are known about compounds like this? What conditions and solvents were used by different chemists? Where is this chemistry reported? Ultimately, what are the most efficient and flexible methods to make compounds like this? The application involves searching for reactions with the target compound as product, and similarity search for very similar compounds, and then searching for reactions using the reactants as the product, and then repeating for the desired graph depth. An interesting finding was that for very similar compounds, different chemistries and starting materials have been used. One tree showed a set of compounds that used a common set of starting materials. Using Cytoscape you can drill down to references for each edge. You can compare intermediates for similar compounds made by different groups and, by accessing Scopus, examine a network of institutions publishing a specific chemistry.

Using Reaxys you can also analyze reaction conditions, grouping known transformations at different levels of detail to get the best conditions. Grouping uses reaction similarity, based on Reaxys transformation codes. Searching for “Buchwald-Hartwig Aminations” by keyword produced 4,179 results. These were grouped by transformation codes, from general to specific: level 0 had one group with 4,179 members, level 1 had 99 groups, level 2 had 160 groups, and so on. A summary of solvent and conditions for level 0 showed that toluene is a popular solvent, a temperature of around 110°C is common, reaction time is not very long, and inert atmosphere and microwave use were mentioned. These conditions can be selected based on membership in one of the other groupings.

An expert searcher’s viewpoint

The final speaker was Judith Currano of the University of Pennsylvania. Introducing variable substituents during a reaction search is challenging. A researcher may not have a definite substituent in mind, instead suggesting that a site can be occupied by “any aryl group” or, still worse, “any electron withdrawing group”. Even a researcher who generates an R-group and populates it with specific substituents can run into problems because atom mapping from reactant to product is prohibited within R-group fragments. Judith used the term “specific ambiguity” to talk about a type of attachment without specifying exactly what it is. This includes general classes of attachments, user-defined groups of attachments (variables or R-groups), and stereocenters where you do not care about the identity of all of the attachments. She presented case studies based on troublesome requests from synthetic chemists.

The first examples concerned functional group transformations (plus mapping from reactant to product), and sensitive functional groups. Searchers should understand that sometimes a review source is worth a thousand searches. (Science of Synthesis was good for one example.) Searchers should also use caution when employing mapping. Database vendors should perhaps give users the ability to make mapping less atom-specific and more atom-type-specific. Structure search algorithms should have a way of manually grouping fragments that appear in the same reactant or product, allowing the searcher to specify multiple fragments in one substance while allowing additional substances on that side of the equation. (Old Beilstein Crossfire worked well in one of Judith’s examples.)

The second set of examples involved specific ambiguity of stereocenters or variables. Judith recommends searchers make use of system-defined generics whenever possible. In the case of user-defined generics, it may be necessary to run multiple searches if your generic does not exist. Vendors should note that all structure search algorithms should permit stereocenters containing system- or user-defined variables, and all search algorithms should permit stereo-specific reaction searches.

Finally Judith discussed reaction searches involving both specific transformations and specific ambiguity (mapping R-groups, mapping variables, and including the elusive electron-withdrawing group). She warns users that if it is essential that they map a user-defined R-group from reactant to product, they should be prepared to do multiple searches for the various substances represented. Database vendors should note that adding generics like electron withdrawing groups would make users very, very happy.

Acknowledgments

My thanks to all the speakers for their interesting contributions, and for providing me with copies of their slides, allowing me to study the talks in more depth, and, ultimately to include more detailed summaries in my meeting report. My thanks also to Matt Clark for handling all the PC and projector issues so that I could concentrate on introducing speakers, on handling questions, and, above all, on being stimulated by the interesting science.

References

  1. Coley, C. W.; Barzilay, R.; Jaakkola, T. S.; Green, W. H.; Jensen, K. F. Prediction of Organic Reaction Outcomes Using Machine Learning. ACS Cent. Sci. 2017, 3 (5), 434-443.
  2. Segler, M. H. S.; Waller, M. P. Neural-Symbolic Machine Learning for Retrosynthesis and Reaction Prediction. Chem. - Eur. J. 2017, 23 (25), 5966-5971.
  3. Segler, M. H. S.; Waller, M. P. Modelling Chemical Reasoning to Predict and Invent Reactions. Chem. - Eur. J. 2017, 23 (25), 6118-6128.
  4. Segler, M. H. S.; Preuss, M.; Waller, M. P. Learning to Plan Chemical Syntheses. 2017, arXiv.org e-Print archive. https://arxiv.org/abs/1708.04202 (accessed September 22, 2017).
  5. Segler, M. H. S.; Preuss, M.; Waller, M. P. Towards "AlphaChem": Chemical Synthesis Planning with Tree Search and Deep Neural Network Policies. 2017, arXiv.org e-Print archive https://arxiv.org/abs/1702.00020 (accessed September 22, 2017).
  6. Socorro, I. M.; Goodman, J. M. The ROBIA Program for Predicting Organic Reactivity. J. Chem. Inf. Model. 2006, 46 (2), 606-614.
  7. Currie, R. H.; Goodman, J. M. In Silico Inspired Total Synthesis of (-)-Dolabriferol. Angew. Chem., Int. Ed. 2012, 51 (19), 4695-4697.
  8. Ermanis, K.; Parkes, K. E. B.; Agback, T.; Goodman, J. M. Expanding DP4: application to drug compounds and automation. Org. Biomol. Chem. 2016, 14 (16), 3943-3949.
  9. Reid, J. P.; Simon, L.; Goodman, J. M. A Practical Guide for Predicting the Stereochemistry of Bifunctional Phosphoric Acid Catalyzed Reactions of Imines. Acc. Chem. Res. 2016, 49 (5), 1029-1041.
  10. Fedorov, M. V.; Goodman, J. M.; Schumm, S. To Switch or Not To Switch: The Effects of Potassium and Sodium Ions on α-Poly-L-glutamate Conformations in Aqueous Solutions. J. Am. Chem. Soc. 2009, 131 (31), 10854-10856.
  11. Allen, T. E. H.; Goodman, J. M.; Gutsell, S.; Russell, P. J. A History of the Molecular Initiating Event. Chem. Res. Toxicol. 2016, 29 (12), 2060-2070.
  12. Schneider, N.; Lowe, D. M.; Sayle, R. A.; Landrum, G. A. Development of a Novel Fingerprint for Chemical Reactions and Its Application to Large-Scale Reaction Classification and Similarity. J. Chem. Inf. Model. 2015, 55 (1), 39-53.
  13. Law, J.; Zsoldos, Z.; Simon, A.; Reid, D.; Liu, Y.; Khew, S. Y.; Johnson, A. P.; Major, S.; Wade, R. A.; Ando, H. Y. Route Designer: A Retrosynthetic Analysis Tool Utilizing Automated Retrosynthetic Rule Generation. J. Chem. Inf. Model. 2009, 49 (3), 593-602.
  14. Cook, A.; Johnson, A. P.; Law, J.; Mirzazadeh, M.; Ravitz, O.; Simon, A. Computer-aided synthesis design. 40 years on. Wiley Interdiscip. Rev.: Comput. Mol. Sci. 2012, 2 (1), 79-107.
  15. Kraut, H.; Eiblmaier, J.; Grethe, G.; Loew, P.; Matuszczyk, H.; Saller, H. Algorithm for reaction classification. J. Chem. Inf. Model. 2013, 53 (11), 2884-2895.
  16. Knehans, T.; Klingler, F.-M.; Kraut, H.; Saller, H.; Herrmann, A.; Rippmann, F.; Eiblmaier, J.; Lemmen, C.; Krier, M. Merck AcceSSible InVentory (MASSIV): In silico synthesis guided by chemical transforms obtained through bootstrapping reaction databases, Abstracts of Papers, 254th ACS National Meeting & Exposition, Washington, DC, USA, August 20-24, 2017; American Chemical Society: Washington, DC, 2017; COMP 283.
  17. Bøgevig, A.; Federsel, H.-J.; Huerta, F.; Hutchings, M. G.; Kraut, H.; Langer, T.; Löw, P.; Oppawsky, C.; Rein, T.; Saller, H. Route Design in the 21st Century: The ICSYNTH Software Tool as an Idea Generator for Synthesis Prediction. Org. Process Res. Dev. 2015, 19 (2), 357-368.