Ceyda Oksel: Accurate and interpretable nano-QSAR models from genetic programming-based decision tree construction approaches

Ceyda OkselCeyda Oksel of Imperial College London reported on the PhD work27 she had done at the University of Leeds in collaboration with Xue Wang and David Winkler. Given the ever-increasing use of ENMs, it is essential to assess properly all potential risks that may occur as a result of exposure to ENMs. The distinctive characteristics of ENMs that have made them superior to bulk materials for particular applications might also have a substantial impact on the level of risk they pose. Despite the clear benefits that nanotechnology can bring, there are serious concerns about the potential health risks associated with the production and use of ENMs, intensified by the limited understanding of what makes ENMs toxic and how to make them safe.

The involvement of computational specialists in nano-safety research has become more prominent since Registration, Evaluation, Authorization and restriction of CHemicals (the European Union’s REACH regulation) promoted the use of in silico techniques such as QSAR for toxicity assessment. Data-driven models that decode the relationships between the biological activities of ENMs and their physicochemical characteristics provide an attractive means of maximizing the value of scarce, and expensive, experimental nanotoxicity data.

Nano-QSAR models can be used to predict the properties of new materials and to design safer materials. Leeds-based genetic programming-based decision tree (GPTree) approach27 applies decision tree learning algorithms to identify the best combination of physicochemical properties to predict biological activity of ENMs. The trees are automatically constructed from the data. Decision trees have several advantages. They are able to deal with small, large and noisy datasets; they can detect nonlinear relationships (as well as linear ones); they allow input variables to be selected automatically; they are transparent; and they represent knowledge clearly (i.e., the models are interpretable).

GPTree begins with a random population of solutions and repeatedly attempts to find better solutions by applying genetic operators such as mutation and crossover. The first step is to construct a user-specified number of trees (usually a large number) starting from a random compound and a randomly chosen descriptor. Once the initial population is generated, tournament selection is performed to identify the best tree to be used as a parent tree for genetic operators such as crossover. The best tree from the subset of trees is chosen by its fitness (e.g., accuracy). Genetic operators such as crossover and mutation are used to form the next generation of trees that are added or replace the current generation. These steps are repeated until the user-specified number of generations has been created. The decision tree model with the highest accuracy of classification for the training set is selected as the optimal decision tree model.

Ceyda demonstrated the application of genetic-programming-based decision tree construction algorithms to QSAR modeling of ENM toxicity by five case studies. The accuracy of the model predictions was satisfactorily high and clearly highly statistically significant relative to the classification rate due to chance.

In the first case study, a large set of in-house in vitro data (obtained in collaboration with Edinburgh University) was used. The dataset included a panel of 18 ENMs with varying structures (e.g., carbon-based materials and metal oxides), a set of in vitro cytotoxicity assays (e.g., LDH release, apoptosis, necrosis, viability, MTT and hemolytic effects), and several experimentally measured physicochemical properties (e.g., particle size and size distribution, surface area, morphology, metal content, reactivity and free radical generation). After a set of data preparation and scaling steps, a heat map of toxicity data combined with hierarchical clustering was constructed. As a second step, C-Visual Explorer (CVE) was used as a tool to create a parallel coordinate plot of the multivariate toxicity data. Similar to the heat map visualization results, the parallel coordinate plot showed that the aminated polystyrene latex beads and zinc oxide had the highest toxicity values in nearly all assays, followed by nanotubes that had medium to high toxicity values in viability and MTT assays.

Then, a dimensionality reduction technique, principal component analysis, was performed on all the toxicity data and the ENMs were divided into five categories according to their toxicity values. GPTree was used to identify potential descriptors contributing to the toxicity of four particular ENMs that were clearly separated from the main cluster formed by low-toxicity ENMs. It was concluded that high aspect ratio contributed to the toxicity of nanotubes, while the most likely factor driving the toxicity of zinc oxide was its high zinc content.


In the second case study, the cellular uptake of nanoparticles, 13 descriptors representing the hydrogen-bonding characteristics, functional group counts, molecular shape, composition and polarizability were found to be significant among a larger set of 147 chemically interpretable descriptors. The findings of GPTree analysis regarding the large contribution of lipophilicity, hydrogen bonding and molecular shape descriptors in the cellular uptake behavior of nanoparticles is consistent with earlier studies.


For a cytotoxicity to human keratinocytes dataset (the third case study),28 the descriptors selected by GPTree were the enthalpy of formation of metal oxide nanocluster representing a fragment of the surface (), the Mulliken’s electronegativity of the cluster, Xc, and the chemical hardness, η. The former two descriptors are consistent with the properties reported to be important for cytotoxicity of metal oxide nanoparticles. In addition, the chemical hardness corresponding to the reactivity was found to be an influential parameter on the cytotoxicity of nanoparticles.

GPTree3 GPTree4

The descriptors selected by GPTree were used to develop a regression model which was statistically significant and had good predictivity (R2 = 0.92, Q2 = 0.72). A variable importance plot showed that Xc was twice as important as , which was a little more important than η.

The data used in the fourth case study included a set of 27 descriptors, 23 ENMs, and a set of multi- and single-parameter toxicity screening assays. The descriptors selected by the GPTree model included nanoparticle conduction band energy, EC, and ionic index of metal cation, Z2/r. This finding is very consistent with past studies that identified these two descriptors as being important for the toxicity of metal oxide nanoparticles.


In the last case study, exocytosis of gold nanoparticles in macrophages, the optimal descriptors for predicting the exocytosis were the charge accumulation, zeta potential and charge density. These findings are in line with previous studies revealing an association between surface characteristics of gold nanoparticles, especially high positive surface charge, and their exocytosis patterns in macrophages.


Ceyda concludes that the genetic-programming-based decision tree construction algorithm shows considerable promise in its ability to identify the relationship between molecular descriptors and biological effects of ENMs. Selected decision tree models yielded (external) prediction accuracies of 86-100%. Another statistical test (Y-randomization) was also performed to demonstrate the robustness of the selected models. This work is a first step in the implementation of a genetic programming based decision tree construction algorithm to nano-QSAR studies.


  1. Oksel, C.; Winkler, D. A.; Ma, C. Y.; Wilkins, T.; Wang, X. Z. Accurate and interpretable nanoSAR models from genetic programming-based decision tree construction approaches. Nanotoxicology 2016, 10 (7), 1001-1012.
  2. Gajewicz, A.; Schaeublin, N.; Rasulev, B.; Hussain, S.; Leszczynska, D.; Puzyn, T.; Leszczynski, J. Towards understanding mechanisms governing cytotoxicity of metal oxides nanoparticles: Hints from nano-QSAR studies. Nanotoxicology 2015, 9 (3), 313-325.