Tudor Oprea: Understudied proteins. Time to shift the paradigm

Tudor OpreaTudor Oprea of the University of New Mexico believes that identifying novel targets as a precompetitive endeavor can lead to new therapeutic opportunities if academia and industry work together. Most protein classification schemes are based on structural and functional criteria. For therapeutic development, it is useful to understand how many data and what types of data are available for a given protein, thereby highlighting well-studied and understudied targets. Tudor and his co-workers classify proteins annotated as drug targets as “Tclin”; proteins for which potent small molecules are known as “Tchem”; proteins for which biology is better understood as “Tbio”; and proteins that lack antibodies, publications or National Center for Biotechnology Information (NCBI) Gene References Into Function (GeneRIFs) as “Tdark”.

Tclin proteins are associated with drug mechanism of action (MoA). Tchem proteins have bioactivities in ChEMBL and DrugCentral, plus human curation for some targets. A Tbio protein lacks small molecule annotation, and is above the cutoff criteria for Tdark, or is annotated with a Gene Ontology (GO) molecular function or biological process leaf term(s) with an experimental evidence code, or has confirmed Online Mendelian Inheritance in Man (OMIM) phenotype(s). Tudor and his colleagues used name entity recognition software35 from L. J. Jensen’s lab to evaluate nearly 27 million abstracts to derive a publication score per protein. Tdark proteins (“understudied proteins”) have little information available, and meet two of the following three criteria: a PubMed text mining score of less than five, three or fewer GeneRIFs, and 50 or fewer antibodies available according to antibodypedia. As external validation, Tdark proteins have statistically significantly lower values compared to the other three target development levels (TDLs) in terms of fewer GO terms, fewer patents, fewer National Institutes of Health (NIH) R01 grants, and fewer searches of the STRING-db database.

Tudor’s first “take home message” was that there is a knowledge deficit: over 37% of the proteins remain understudied (the Tdark ones) and only about 10% of the proteome (Tclin and Tchem) can be targeted by potent small molecules. Are Tdark proteins underfunded because there is no scientific interest in this category, or is the lack of knowledge perpetuated by lack of funding? It is possible that the absence of high-quality, well-characterized molecular tools (i.e., antibodies or chemical probes) may be a root cause for this situation, but lack of tools leads to lack of interest, and lack of interest diminishes the probability of such tools being developed.

The patent literature is also of interest. Almost half of patent bioactivity data are never published elsewhere, and compounds may appear in patents two to four years before they appear in the literature. The SureChEMBL team has annotated the SureChEMBL patent corpus with gene and disease terms. Looking at patents between 2001 and 2013, they processed a set of 99 approved patents of interest to the Illuminating the Druggable Genome (IDG) consortium. These bioactivity data from 99 patents were manually extracted: 20,941 activity measurements for 11,358 compounds, and 1,134 assays. These data are already uploaded into ChEMBL 23. Data for seven IDG Phase 2 targets were uncovered by this patent data extraction exercise, data which progress TDLs of two targets (GPR6 and HCAR1) from Tbio to Tchem.

Anne Hersey of ChEMBL has estimated that more than 50% of the data from patents do not end up in peer-reviewed papers. IDG, Open Targets, BindingDB, and others could collectively, in a precompetitive manner, mine data from patents (if necessary, for only terminated projects, or out-of-patent drugs) and upload these data into ChEMBL and Pharos. Pharos36 is the user interface to the Knowledge Management Center (KMC) for the IDG program funded by the NIH.

Approximately one-third of all mammalian genes are essential for life. Phenotypes resulting from knockouts of these genes in mice have provided insight into gene function and congenital disorders. The International Mouse Phenotyping Consortium (IMPC) has published research on the high-throughput discovery of novel developmental phenotypes.37 They identified 2,788 genes with 8,241 significant phenotype calls in 25 major categories. The promise of the IMPC annotations is illustrated by examining the definite and clear links between human neurological and behavioral disorders (191 human genes) and the corresponding gene knockout mouse neurological and behavioral phenotypes. The majority of these links are for schizophrenia, Alzheimer’s disease, epilepsy, and amyotrophic lateral sclerosis. Several rare diseases are also associated with these genes.

Of 119 Tdark genes prioritized by KMC to IMPC, 45 mouse lines were produced, with 41 phenotypes observed. Knockouts of the Tdark kinase Alpk3 have increased embryonic and perinatal lethality, with the surviving adults displaying severe heart defects. Of 482 Tbio genes submitted by KMC, 184 mouse lines were produced, with 145 phenotypes observed. Knockouts of the Tbio GPCR Adgrd1 display reproductive defects. (These are Tdark and Tbio statistics as of April 2017.) Tudor commented: “If you don't know very much to begin with, don't expect to learn a lot quickly.”

Data from Cristian Bologa suggest that on average it takes 15-20 years for Tdark to bear fruit. The leptin receptor was Tdark in 1995, but led to an approved drug in 2014. The smoothened receptor was Tdark in 1997, and a drug was launched in 2012. Tudor gave several other examples. There is room for improvement in research funding. Text mining of all NIH grants for the period 2000-2015 suggests that 8,858 proteins received zero NIH funding. Of these, 6,051 are Tdark, and 2,616 are Tbio. This is to be expected, but 119 are Tchem and 72 are Tclin. Possible explanations could be old drug targets or research funded elsewhere. (Data from funding sources other than NIH are not available.) Pharma and academia could pay more attention to these 8,858 underfunded proteins.

Tudor’s second take home message was that just because something is ignored it does not mean it lacks importance. Understudied proteins need funding and patience. Based on current evidence, IMPC has the most concerted Tdark exploration approach.

DrugCentral (http://drugcentral.org ) is an open access online drug compendium38 integrating structure, bioactivity, regulatory information, pharmacologic actions, and indications for active pharmaceutical ingredients approved by regulatory agencies. It integrates content for active ingredients with pharmaceutical formulations, indexing drugs and drug label annotations, and complementing similar resources available online. Tudor’s team used it initially to find how many drugs there are, but they also wanted to know how many drug targets there are. They have studied innovation patterns per therapeutic area:39

Drugs distributed by Anatomical Therapeutic Chemical (ATC) codes (levels 1-2)

Drugs distributed by Anatomical Therapeutic Chemical (ATC) codes (levels 1-2). Concentric rings indicate ATC levels. Histograms represent the number of drugs distributed per year of first approval.

They have also examined the commercial impact of target classes by evaluating data from IMS Health on drug sales from 75 countries, aggregated over a five-year period (2011–2015). After excluding categories such as homeopathic medicines, they identified 51,095 unique products, and mapped them to 1,069 active pharmaceutical ingredients from DrugCentral, corrected by the number of active pharmaceutical ingredients (APIs) per product, then by the number of Tclin targets per API. The most lucrative target class from a therapeutic perspective was G-protein coupled receptors (GPCR, 27.42% market share). Tudor also tabulated the top 20 targets by revenue. His third take home message was that there are many unexplored opportunities. By his conservative estimate (about 15,000 disease concepts, and about 2500 unique drug indications), we address about 15% of human diseases with therapeutic agents.

It has been said that the absence of a quantitative language is the flaw of biological research40 or “the more facts we learn the less we understand”. Again, when little is known, we should not expect knowledge to accumulate quickly. Separation by organ and cell is a conceptual fallacy. Medicine maintains this separation for necessity: by organ (e.g., cardiology or ophthalmology), and by disease category (e.g., oncology or infection). NIH Institutes are organized in a similar way. Many pharmaceutical companies are organized by therapeutic area. Yet genes, proteins and pathways do not observe such separation. The impact of this “mental divide” in science has yet to be understood.

A. B. Jensen et al. have studied disease correlations and temporal disease progression (trajectories)41 on a large scale over 15 years, and grouped 1,171 significant trajectories into temporal patterns centered on a small number of early diagnoses that are central to disease progression. Hence it is important to focus on early diagnoses in order to mitigate the risk of adverse patient outcomes. The authors suggest such trajectory analyses may be useful for predicting and preventing future diseases of individual patients. Using data from the Cerner HealthFacts database, Tudor’s team has found that the top diseases prior to Alzheimer’s (over 5 years or more) are essential hypertension, hyperlipidemia, Type 2 diabetes mellitus, hypercholesterolemia, and coronary atherosclerosis. For renal failure, diseases over the previous five years are essential hypertension, heart failure, angina pectoris, chronic heart disease, and diabetes mellitus.

Diseases are concepts. They lack physical manifestation outside patients, so the search for cures has to be patient-centered.42 Animal models should be combined with mining of patient data. We ought to use electronic health record data to prioritize targets for further drug discovery. For example, we should get genes associated with diseases that precede Alzheimer’s to investigate possible causality. Such priorities could be disease-specific, or phenotype-specific.

It is time to acknowledge that target prioritization for drug discovery is precompetitive knowledge. The pharmaceutical industry reward system is based on patents, which are awarded for drugs, not targets. Finding a good target leads to the “me-too” phenomenon. It is time to pool resources together on targets, team up with Open Targets and create a Target Selection Consortium, partnering industry with academia. “Double blind” studies could be cosponsored, to avoid the reproducibility crisis. IDG KMC is seeking new knowledge.


  1. Pletscher-Frankild, S.; Palleja, A.; Tsafou, K.; Binder, J. X.; Jensen, L. J. DISEASES: Text mining and data integration of disease-gene associations. Methods (Amsterdam, Neth.) 2015, 74, 83-89.
  2. Nguyen, D.-T.; Mandava, G.; Sheils, T.; Simeonov, A.; Southall, N.; Jadhav, A.; Guha, R.; Mathias, S.; Bologa, C.; Holmes, J.; Liu, G.; Mani, S.; Patel, J.; Sklar, L. A.; Ursu, O.; Waller, A.; Yang, J.; Oprea, T. I.; Brunak, S.; Jensen, L. J.; Fernandez, N.; Ma'ayan, A.; Rouillard, A. D.; Gaulton, A.; Hersey, A.; Karlsson, A.; Overington, J.; Liu, G.; Mehta, S.; Schurer, S.; Vidovic, D.; Mehta, S.; Patel, J.; Schurer, S.; Vidovic, D.; Sklar, L. A.; Waller, A. Pharos: Collating protein information to shed light on the druggable genome. Nucleic Acids Res. 2017, 45 (D1), D995-D1002.
  3. Dickinson, M. E.; Flenniken, A. M.; Ji, X.; Teboul, L.; Wong, M. D.; White, J. K.; Meehan, T. F.; Weninger, W. J.; Westerberg, H.; Adissu, H.; Baker, C. N.; Bower, L.; Brown, J. M.; Caddle, L. B.; Chiani, F.; Clary, D.; Cleak, J.; Daly, M. J.; Denegre, J. M.; Doe, B.; Dolan, M. E.; Edie, S. M.; Fuchs, H.; Gailus-Durner, V.; Galli, A.; Gambadoro, A.; Gallegos, J.; Guo, S.; Horner, N. R.; Hsu, C.-W.; Johnson, S. J.; Kalaga, S.; Keith, L. C.; Lanoue, L.; Lawson, T. N.; Lek, M.; Mark, M.; Marschall, S.; Mason, J.; McElwee, M. L.; Newbigging, S.; Nutter, L. M. J.; Peterson, K. A.; Ramirez-Solis, R.; Rowland, D. J.; Ryder, E.; Samocha, K. E.; Seavitt, J. R.; Selloum, M.; Szoke-Kovacs, Z.; Tamura, M.; Trainor, A. G.; Tudose, I.; Wakana, S.; Warren, J.; Wendling, O.; West, D. B.; Wong, L.; Yoshiki, A.; McKay, M.; Urban, B.; Lund, C.; Froeter, E.; LaCasse, T.; Mehalow, A.; Gordon, E.; Donahue, L. R.; Taft, R.; Kutney, P.; Dion, S.; Goodwin, L.; Kales, S.; Urban, R.; Palmer, K.; Pertuy, F.; Bitz, D.; Weber, B.; Goetz-Reiner, P.; Jacobs, H.; Le Marchand, E.; El Amri, A.; El Fertak, L.; Ennah, H.; Ali-Hadji, D.; Ayadi, A.; Wattenhofer-Donze, M.; Jacquot, S.; Andre, P.; Birling, M.-C.; Pavlovic, G.; Sorg, T.; Morse, I.; Benso, F.; Stewart, M. E.; Copley, C.; Harrison, J.; Joynson, S.; Guo, R.; Qu, D.; Spring, S.; Yu, L.; Ellegood, J.; Morikawa, L.; Shang, X.; Feugas, P.; Creighton, A.; Castellanos Penton, P.; Danisment, O.; Griggs, N.; Tudor, C. L.; Green, A. L.; Icoresi Mazzeo, C.; Siragher, E.; Lillistone, C.; Tuck, E.; Gleeson, D.; Sethi, D.; Bayzetinova, T.; Burvill, J.; Habib, B.; Weavers, L.; Maswood, R.; Miklejewska, E.; Woods, M.; Grau, E.; Newman, S.; Sinclair, C.; Brown, E.; Ayabe, S.; Iwama, M.; Murakami, A.; MacArthur, D. G.; Tocchini-Valentini, G. P.; Gao, X.; Flicek, P.; Bradley, A.; Skarnes, W. C.; Justice, M. J.; Parkinson, H. E.; Moore, M.; Wells, S.; Braun, R. E.; Svenson, K. L.; de Angelis, M. H.; Herault, Y.; Mohun, T.; Mallon, A.-M.; Henkelman, R. M.; Brown, S. D. M.; Adams, D. J.; et, a. High-throughput discovery of novel developmental phenotypes. Nature (London, U. K.) 2016, 537 (7621), 508-514.
  4. Ursu, O.; Holmes, J.; Bologa, C. G.; Yang, J. J.; Mathias, S. L.; Nelson, S. J.; Oprea, T. I.; Knockel, J. DrugCentral: online drug compendium. Nucleic Acids Res. 2017, 45 (D1), D932-D939.
  5. Santos, R.; Ursu, O.; Gaulton, A.; Bento, A. P.; Donadi, R. S.; Bologa, C. G.; Karlsson, A.; Al-Lazikani, B.; Hersey, A.; Oprea, T. I.; Overington, J. P. A comprehensive map of molecular drug targets. Nat. Rev. Drug Discovery 2017, 16 (1), 19-34.
  6. Lazebnik, Y. Can a biologist fix a radio? Or, what I learned while studying apoptosis. Cancer Cell 2002, 2 (3), 179-182.
  7. Jensen, A. B.; Moseley, P. L.; Oprea, T. I.; Ellesoee, S. G.; Eriksson, R.; Schmock, H.; Jensen, P. B.; Jensen, L. J.; Brunak, S. Temporal disease trajectories condensed from population-wide registry data covering 6.2 million patients. Nat. Commun. 2014, 5, 4022.
  8. Horrobin, D. F. Opinion: Modern biomedical research: an internally self-consistent universe with little contact with medical reality? Nat. Rev. Drug Discovery 2003, 2 (2), 151-154.