David Winkler: Sparse QSAR modeling methods for therapeutic and regenerative medicine

David WinklerDavid Winkler’s award address was co-authored by his colleague Frank Burden, now retired from CSIRO, and by co-workers at Imperial College London, King’s College London, and the University of Nottingham, whose work is acknowledged in the literature references.

David’s research concerns computational chemistry applied to a molecular level understanding of interactions of molecules and materials with biology. He has a strong interdisciplinary, translational research focus. His modeling, design, and optimization of bioactive materials focus on testing model predictions by subsequent experiments. He employs a range of computational tools including quantum chemistry, molecular dynamics and mechanics, molecular graphics, pharmacophore models, protein docking, and, in the case of this talk, quantitative structure-property relationship modeling. He is interested in the design of drugs and materials for therapeutic and regenerative medicine, especially control of stem cell fate, with a particular focus on the application of artificial intelligence (AI), machine learning, pattern recognition, complex systems science, evolutionary algorithms, and adaptive learning.

His work has had commercial impact, including the transfer of neural network modeling technology to BioRAD Corporation; several field trials candidates with Du Pont and Schering Plough; and clinical trials of a radioprotectant drug for cancer radiotherapy patients (with Sirtex and the Peter Mac Cancer Institute). He developed core intellectual property (a novel antibacterial target in bacterial replisome) for the Betabiotics company spinoff, and discovered a new mechanism for strontium biomaterial-induced differentiation of mesenchymal stem cells to bone. He carried out a large project with Air Liquide Santé on using in silico methods to understand the surprisingly rich biological properties of noble gases. He discovered new antifibrotic and antihypertensive agents for Vectus Biosystems (allowing them to float on the stock market) and a first in class drug lead for myelofibrosis, which will be further developed by a new spin off company soon.

Winkler’s research thinking was greatly influenced by complex systems science, which finds deep mechanistic similarities between areas of science that appear to have nothing in common. Concepts include nonlinear dynamical behavior, networks and their attractor states, self-organized criticality, chaos, and emergent properties. Complex systems science stimulates substantial lateral thinking and novel problem solving. Methods from other areas of science can provide novel solutions to problems in drug discovery; and methods developed for drug discovery can provide novel solutions to problems in other areas of science, such as biomaterials, gene expression, non-biological materials, and regenerative medicine.

QSAR was invented by Toshio Fujita (very recently deceased) and Corwin Hansch, and rapidly evolved into a method for optimization of drugs and agrochemicals. David and Toshio published a recent paper43 on the two forms of QSAR: “explain” and “predict”. Graham Richards’ and Peter Andrews’ seminal commercialization ventures influenced David to make translation a strong focus in his research.

The research for which David received the Skolnik award involved the application of modern computational and mathematical methods to optimizing the QSAR modeling process.44 The first operation is to generate descriptors. Model quality is critically dependent on descriptors. Descriptors with low or no relevance to the property modeled degrade the model. Bad descriptors were a problem in early QSAR work, and there is still a major research need for good descriptors for materials. Next a subset of descriptors is chosen for the model in a context-dependent way. Choosing too many subsets can give chance correlations. In generating the relationship between the descriptors and the target property, model quality is less dependent on the modeling algorithm than on the descriptors, but there can be issues in overfitting, overtraining, ambiguity in network architecture, and subjective choices. The next operation is validating the performance of the model in predicting properties of new data. Here, cross validation and bootstrapping generate optimistic measures of performance, and an independent test set not used in training is best. The final operation is making new predictions from the model and synthesizing and testing new materials.

Descriptors are the last major research problem for QSAR. Many (such as DRAGON descriptors) are arcane; efficient, interpretable descriptors are needed. Descriptors specific to complex materials are essential, but the field is embryonic. High throughput characterization data can augment computed descriptors.

There are advantages in removing irrelevant features. Least squares in multiple linear regression (MLR) has a Gaussian prior. This can be replaced with a Laplacian prior which effects the removal of uninformative weights by driving them to zero. Sparse Bayesian feature selection methods (feature selection using expectation maximization) identify a small number of relevant features very efficiently.45

There are many methods of varying sophistication in finding structure-activity relationships,44 including simple linear statistical regression methods such as multiple linear regression; nonlinear regression methods using polynomials or nonlinear kernels, and nonlinear machine learning; bioinspired methods such as neural nets; support vector machines; and random forests. These have new applications in materials, nanotechnology, and regenerative medicine.

The universal approximation theorem states that neural networks can model any complex relationship given sufficient training data. Neural networks are very well suited to modeling of complex data, but they have problems such as overfitting and overtraining. They raise an ill-posed problem in statistics (instability), and optimum network architecture is ambiguous. The contribution of David and his co-workers is to develop very robust, self-optimizing sparse feature selection and neural network methods that overcome all these problems.46 These methods have been shown to have performance similar to that of deep neural networks.

Sparse Bayesian modeling and feature selection, replacing the Gaussian prior with the Laplacian prior, is a general nonlinear modeling method45,47-49 that automatically optimizes model complexity, prunes neural network weights to avoid overfitting, and prunes irrelevant descriptors to optimize the predictivity of a model. A sparsity-inducing Laplacian prior (LP) was introduced into Winkler’s Bayesian Regularized Artificial Neural Network algorithm (BRANN) creating BRANNLP.47,49 Low relevance weights are set to zero, and descriptors are also pruned from the model if all weights are zero.

From selection and mapping, David turned to validation. Cross validation, bootstrapping, and other methods give an overly optimistic estimate of predictive power because the test set is not independent of the training set. An independent test set never seen by the model is the gold standard. Many measures of predictivity have been proposed. Test set validation is actually a simple problem in statistics; standard error of prediction, test set (SEP) is preferred over r2 as it is less dependent on dataset size and model complexity.46,50

Methods from other areas of science can provide novel solutions to problems in drug discovery, and methods developed for drug discovery can provide novel solutions to problems in other areas of science. Implantable medical devices are an example. Bacterial adhesion and growth on biomaterial surfaces of joint prostheses, heart valves, shunts, vascular and urinary catheters, and intraocular lenses are serious problems in health care. There is a major unmet medical need for new coating materials for implantable and indwelling medical devices. David and his co-workers from Morgan Alexander’s research team at the University of Nottingham have used machine learning methods to derive quantitative models relating the molecular structure of a polymer to the attachment of the bacteria to that polymer surface. These models can be used to screen large databases of new materials for those with low pathogen attachment.

Hook et al. have detected the attachment of selected bacterial species to 576 polymeric materials in a high-throughput microarray format.51 In work by David and his colleagues, data from a large polymer microarray exposed to three clinical pathogens were used to derive robust and predictive machine learning models of pathogen attachment.52 The BRANN models can predict pathogen attachment for the polymer library quantitatively. The models also successfully predict pathogen attachment for a second-generation library, and identify polymer surface chemistries that enhance or diminish pathogen attachment. A manuscript on work on multiple pathogen attachment models has been submitted.

Sparse feature selection methods have also identified a new mechanism for strontium biomaterial-induced differentiation of mesenchymal stem cells to bone. Strontium ranelate (Protelos) is a drug approved in the European Union for the treatment and prevention of osteoporosis. It reduces risk of vertebral and non-vertebral fractures in post-menopausal women. Although controversial, it is reported to have an anabolic and anti-catabolic effect on bone. Strontium ion’s mechanism of action is not fully understood, but it is thought to up-regulate differentiation of osteoprogenitors or stimulate bone formation.53-55

David and his Imperial College co-workers,56 Molly Stevens, Eileen Gentleman, and Hélene Autefage, have evaluated the global response of human mesenchymal stem cells to strontium-substituted bioactive glasses using a combination of unsupervised biological and physical science techniques. Their objective analyses of whole gene-expression profiles, confirmed by standard molecular biology techniques, revealed that strontium-substituted bioactive glasses up-regulated the isoprenoid pathway, suggesting an influence on both sterol metabolite synthesis and protein prenylation processes.

In future, David hopes to see exploitation of new AI methods such as deep learning; improved descriptors for molecules that are effective and interpretable; exploitation of evolutionary methods of discovery aided by robotics; synergy of AI and evolutionary methods for adaptive evolution; adoption of in silico methods from drug discovery for materials and regeneration; development of autonomous or semiautonomous “closed loop” design methods; and more effective exploration of vast molecular or materials spaces.

Deep learning was predicted to be a breakthrough technology in 2013. Deep neural networks are not necessarily magic. According to the universal approximation theorem, a feed-forward network with a single hidden layer containing a finite number of neurons can approximate any continuous function, under mild assumptions on the activation function. This was first proved by Cybenko in 1989 for sigmoid activation functions. Hornik showed in 1991 that it is not the choice of the activation function, but the multilayer architecture itself which gives neural networks the potential of universal approximators.46

Deep learning methods have generated impressive improvements in image and voice recognition, and are now being applied to QSAR and QSAR modeling. A recent publication46 describes the differences in approach between deep and shallow neural networks, compares their abilities to predict the properties of test sets for 15 large drug datasets, discusses the results in terms of the universal approximation theorem for neural networks, and describes how deep neural networks may ameliorate or remove troublesome “activity cliffs” in QSAR datasets. Materials space is vast and at least in some of its many dimensions, the fitness landscape is smooth. This allows adaptation, one step (one mutation) at a time. Evolution and machine learning can be combined in adaptive learning (the Baldwin effect).

A recent review discusses the problems of large materials spaces, the types of evolutionary algorithms employed to identify or optimize materials, and how materials can be represented mathematically as genomes.57 It describes fitness landscapes and mutation operators commonly employed in materials evolution, and provides a comprehensive summary of published research on the use of evolutionary methods to generate new catalysts, phosphors, and a range of other materials. Another recent paper describes the materials genome in action.58

Machine learning methods have achieved wide applicability: for example, in aqueous solubility of drugs;59 polymers for stem cell growth;60 cubane as a benzene isostere;61 benign organic corrosion inhibitors;62 markers for stem cell division;63 materials for stem cell factories;64 adverse effects of nanomaterials;65 anticancer farnesyltransferase inhibitors;66 and prediction of materials properties.44

In summary, AI tools developed for therapeutic medicine also work well for regenerative medicine. Neural networks are machine learning methods that are very applicable to (bio)materials design. The universal approximation theorem means that deep learning methods should not be superior to shallow neural networks for molecular design. Bayesian regularized neural networks can generate robust, predictive models of many types of materials and properties. Sparse Bayesian feature selection methods can reduce the dimensionality of problems, improve interpretability, and generate robust models with better predictivity. Evolutionary methods, combined with machine learning (adaptive evolution) can find effective materials quickly and efficiently.

Conclusion

Erin Davis, chair of the ACS Division of Chemical Information, formally presented the Herman Skolnik Award to David Winkler at a reception held in honor of David, following the symposium.

David Winkler receives award from Erin Davis

Erin Davis and David Winkler

References

  1. Fujita, T.; Winkler, D. A. Understanding the Roles of the "Two QSARs". J. Chem. Inf. Model. 2016, 56 (2), 269-274.
  2. Le, T.; Epa, V. C.; Burden, F. R.; Winkler, D. A. Quantitative Structure-Property Relationship Modeling of Diverse Materials Properties. Chem. Rev. (Washington, DC, U. S.) 2012, 112 (5), 2889-2919.
  3. Figueiredo, M. A. T. Adaptive sparseness for supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 2003, 25 (9), 1150-1159.
  4. Winkler, D. A.; Le, T. C. Performance of Deep and Shallow Neural Networks, the Universal Approximation Theorem, Activity Cliffs, and QSAR. Mol. Inf. 2017, 36 (1-2), 1600118.
  5. Burden, F. R.; Winkler, D. A. Robust QSAR models using Bayesian regularized neural networks. J. Med. Chem. 1999, 42 (16), 3183-3187.
  6. Burden, F. R.; Winkler, D. A. An Optimal Self-Pruning Neural Network and Nonlinear Descriptor Selection in QSAR. QSAR Comb. Sci. 2009, 28 (10), 1092-1097.
  7. Burden, F. R.; Winkler, D. A. Optimal sparse descriptor selection for QSAR using Bayesian methods. QSAR Comb. Sci. 2009, 28 (6-7), 645-653.
  8. Alexander, D. L. J.; Tropsha, A.; Winkler, D. A. Beware of R2: Simple, Unambiguous Assessment of the Prediction Accuracy of QSAR and QSPR Models. J. Chem. Inf. Model. 2015, 55 (7), 1316-1322.
  9. Hook, A. L.; Chang, C.-Y.; Yang, J.; Luckett, J.; Cockayne, A.; Atkinson, S.; Mei, Y.; Bayston, R.; Irvine, D. J.; Langer, R.; Anderson, D. G.; Williams, P.; Davies, M. C.; Alexander, M. R. Combinatorial discovery of polymers resistant to bacterial attachment. Nat. Biotechnol. 2012, 30 (9), 868-875.
  10. Epa, V. C.; Hook, A. L.; Chang, C.; Yang, J.; Langer, R.; Anderson, D. G.; Williams, P.; Davies, M. C.; Alexander, M. R.; Winkler, D. A. Modeling and prediction of bacterial attachment to polymers. Adv. Funct. Mater. 2014, 24 (14), 2085-2093.
  11. Reginster, J. Y.; Seeman, E.; De Vernejoul, M. C.; Adami, S.; Compston, J.; Phenekos, C.; Devogelaer, J. P.; Curiel, M. D.; Sawicki, A.; Goemaere, S.; Sorensen, O. H.; Felsenberg, D.; Meunier, P. J. Strontium Ranelate Reduces the Risk of Nonvertebral Fractures in Postmenopausal Women with Osteoporosis: Treatment of Peripheral Osteoporosis (TROPOS) Study. J. Clin. Endocrinol. Metab. 2005, 90 (5), 2816-2822.
  12. Meunier, P. J.; Roux, C.; Seeman, E.; Ortolani, S.; Badurski, J. E.; Spector, T. D.; Cannata, J.; Balogh, A.; Lemmel, E.-M.; Pors-Nielsen, S.; Rizzoli, R.; Genant, H. K.; Reginster, J.-Y.; Graham, J.; Ng, K. W.; Prince, R.; Prins, J.; Seeman, E.; Wark, J.; Reginster, J. Y.; Devogelaer, J. P.; Kaufman, J. M.; Raeman, F.; Ziekenhuis, J. P.; Walravens, M.; Pors-Nielson, S.; Beck-Nielsen, H.; Charles, P.; Sorensen, O. H.; Meunier, P. J.; Aquino, J. P.; Benhamou, C.; Blotman, F.; Bonidan, O.; Bourgeois, P.; De Vernejoul, M. C.; Dehais, J.; Fardellone, P.; Kahan, A.; Kuntz, J. L.; Marcelli, C.; Prost, A.; Vellas, B.; Weryha, G.; Lemmel, E. M.; Felsenberg, D.; Hensen, J.; Kruse, H. P.; Schmidt, W.; Semler, J.; Strucki, G.; Phenekos, C.; Balogh, A.; De Chatel, R.; Ortolani, S.; Adami, S.; Bianchi, G.; Brandi, M. L.; Cucinotta, D.; Fiore, C.; Gennari, C.; Isaia, G.; Luisetto, G.; Passariello, R.; Passeri, M.; Rovetta, G.; Tessari, L.; Badurski, J. E.; Hoszowski, K.; Lorenc, R. S.; Sawicki, A.; Diez, A.; Cannata, J. B.; Diaz Curiel, M.; Rapado, A.; Gijon, J.; Torrijos, A.; Padrino, J. M.; Roces Varela, A.; Bonjour, J. P.; Rizzoli, R.; Spector, T. D.; Clements, M.; Doyle, D. V.; Ryan, P.; Smith, I. G.; Smith, R. The effects of strontium ranelate on the risk of vertebral fracture in women with postmenopausal osteoporosis. N. Engl. J. Med. 2004, 350 (5), 459-468.
  13. Meunier, P. J. Postmenopausal osteoporosis and strontium ranelate. Reply. N. Engl. J. Med. 2004, 350 (19), 2002-2003.
  14. Autefage, H.; Gentleman, E.; Littmann, E.; Hedegaard, M. A. B.; Von Erlach, T.; O'Donnell, M.; Burden, F. R.; Winkler, D. A.; Stevens, M. M. Sparse feature selection methods identify unexpected global cellular response to strontium-containing materials. Proc. Natl. Acad. Sci. U. S. A. 2015, 112 (14), 4280-4285.
  15. Le, T. C.; Winkler, D. A. Discovery and Optimization of Materials Using Evolutionary Approaches. Chem. Rev. (Washington, DC, U. S.) 2016, 116 (10), 6107-6132.
  16. Thornton, A. W.; Simon, C. M.; Kim, J.; Kwon, O.; Deeg, K. S.; Konstas, K.; Pas, S. J.; Hill, M. R.; Winkler, D. A.; Haranczyk, M.; Smit, B. Materials Genome in Action: Identifying the Performance Limits of Physical Hydrogen Storage. Chem. Mater. 2017, 29 (7), 2844-2854.
  17. Salahinejad, M.; Le, T. C.; Winkler, D. A. Aqueous Solubility Prediction: Do Crystal Lattice Interactions Help? Mol. Pharm. 2013, 10 (7), 2757-2766.
  18. Epa, V. C.; Yang, J.; Mei, Y.; Hook, A. L.; Langer, R.; Anderson, D. G.; Davies, M. C.; Alexander, M. R.; Winkler, D. A. Modelling human embryoid body cell adhesion to a combinatorial library of polymer surfaces. J. Mater. Chem. 2012, 22 (39), 20902-20906.
  19. Chalmers, B. A.; Xing, H.; Houston, S.; Clark, C.; Ghassabian, S.; Kuo, A.; Cao, B.; Reitsma, A.; Murray, C.-E. P.; Stok, J. E.; Boyle, G. M.; Pierce, C. J.; Littler, S. W.; Winkler, D. A.; Bernhardt, P. V.; Pasay, C.; De Voss, J. J.; McCarthy, J.; Parsons, P. G.; Walter, G. H.; Smith, M. T.; Cooper, H. M.; Nilsson, S. K.; Tsanaktsidis, J.; Savage, G. P.; Williams, C. M. Validating Eaton's Hypothesis: Cubane as a Benzene Bioisostere. Angew. Chem., Int. Ed. 2016, 55 (11), 3580-3585.
  20. Winkler, D. A.; Breedon, M.; Hughes, A. E.; Burden, F. R.; Barnard, A. S.; Harvey, T. G.; Cole, I. Towards chromate-free corrosion inhibitors: structure-property models for organic alternatives. Green Chem. 2014, 16 (6), 3349-3357.
  21. Huh, Y. H.; Noh, M.; Burden, F. R.; Chen, J. C.; Winkler, D. A.; Sherley, J. L. Sparse feature selection identifies H2A.Z as a novel, pattern-specific biomarker for asymmetrically self-renewing distributed stem cells. Stem Cell Res. 2015, 14 (2), 144-154.
  22. Celiz, A. D.; Smith, J. G. W.; Langer, R.; Anderson, D. G.; Winkler, D. A.; Barrett, D. A.; Davies, M. C.; Young, L. E.; Denning, C.; Alexander, M. R. Materials for stem cell factories of the future. Nat. Mater. 2014, 13 (6), 570-579.
  23. Epa, V. C.; Burden, F. R.; Tassa, C.; Weissleder, R.; Shaw, S.; Winkler, D. A. Modeling Biological Activities of Nanoparticles. Nano Lett. 2012, 12 (11), 5808-5812.
  24. Polley, M. J.; Winkler, D. A.; Burden, F. R. Broad-Based Quantitative Structure-Activity Relationship Modeling of Potency and Selectivity of Farnesyltransferase Inhibitors Using a Bayesian Regularized Neural Network. J. Med. Chem. 2004, 47 (25), 6230-6238.