Shedding Light on the Dark Genome: Methods, Tools & Case Studies

In 2014 the NIH initiated a program entitled, “Illuminating the Druggable Genome” (IDG) with the goal of improving our understanding of the properties and functions of proteins that are currently unannotated within the four most commonly drug-targeted protein families - GPCRs, ion channels, nuclear receptors, and kinases. The symposium entitled “Shedding Light on the Dark Genome: Methods, Tools & Case Studies” was put together by Rajarshi Guha (NIH National Center for Advancing Translational Sciences (NCATS)) and Tudor Oprea (U. New Mexico) and was designed to highlight recent work addressing data resources and methods being developed that can provide insight in to dark targets: protein targets that are unstudied or understudied in the public literature. The presence of these dark targets represents a knowledge deficit and together with the fact that a small fraction of the proteome is currently therapeutically targeted highlights the need for resources that can shed light on these dark targets.

Tudor Oprea (U. New Mexico) was the first speaker and presented an overview of the IDG Knowledge Management Center (KMC), highlighting the diverse data sources and data types that have been integrated to construct the Target Central Resource Database (TCRD). In addition to collating data sources, the TCRD includes the results of text mining of drug labels, patents, and medical literature. Oprea went on to describe the Target Development Level (TDL) that classifies protein targets according to the level of knowledge available about them. He also described the front-end to the KMC, namely Pharos, a Web portal that provides users access to the TCRD data. Based on analysis of TCRD data, Oprea concluded that only 38% of the human proteome is currently functionally annotated and less than 3% of the proteome is therapeutically targeted, and, finally, only a quarter of all diseases are targeted via therapeutic agents.

The next speaker, Prudence Mutowo (European Bioinformatics Institute (EBI)) spoke about the use of ChEMBL and SureChEMBL resources to track drug targets. The EBI is a collaborator in the IDG program and these resources represent key components of the TCRD data source developed at the University of New Mexico. These resources represent curated sources of information on small molecule bioactivity, mined from the medicinal chemistry literature (ChEMBL) and small molecules extracted from the patent literature (SureChEMBL). Mutowo highlighted the challenges involved in the curation process, especially for understudied targets, which are most relevant to the IDG. She concluded with specific examples of how the curation has proven useful in shedding light on dark targets.

Stephan Schürer (U. Miami) was the next speaker, and his talk addressed the development of the Bioassay Ontology (BAO) and Drug Target Ontology (DTO) and their role in providing a framework to support data integration, clarification, and mining. He described how these ontologies have been employed in multiple large scale projects such as the BioAssay Research Database (BARD) and most recently Library of Network-Based Cellular Signatures (LINCS). He highlighted the importance of standards in data integration pipelines and provided specific examples of challenges faced in the LINCS and IDG projects. He concluded by highlighting specific examples of how the DTO was used in the IDG project and pointed out how these ontologies provide a robust framework to represent, integrate, model, and query diverse drug discovery data generated in different projects.

The next speaker was Anders Dohlman (Mt. Sinai School of Medicine) who described the development of classification models to predict adverse cardiovascular events caused by tyrosine kinase inhibitors (TKIs). After providing an overview of the cardio effects of TKIs he described the role of the FDA FAERS and Drug Labels resources as a source of adverse event information on known TKIs. He then described a random forest model developed to predict adverse events based on structural features of the TKIs, and highlighted how c-Kit mutants and their distinct binding patterns correlated with occurrence of hypertension. He finally highlighted the use of their recently developed method, the characteristic direction, applied to the LINCS L1000 dataset to generate biomarker sets for the identification of cardiac adverse events.

Following Dohlamn, Rajarshi Guha (NIH NCATS) talked about Pharos (https://pharos.nih.gov/idg/index), the front-end for the KMC. Following on from the presentation by Oprea, Guha highlighted the architecture of the application, specifically pointing out the design decisions that were taken to address specific classes of users - biologists, computational scientists, and funders. He described the role of the underlying REST API and its role in providing direct, programmatic access to the TCRD data as well as the basis for the graphical interface. He also highlighted specific features such as the target dossier that enables users to collect information on targets, diseases, and compounds as they browse, and store them for later analysis. Given the diverse data sources and types, he then highlighted the various visualization methods implemented in Pharos to enable efficient summary and drill down when required.

Next, Meir Glick (Merck) described the strategy at Merck to enable target identification and validation, based on integrated screening, synthesis, and informatics. He highlighted the role of informatics in bridging multiple data types, whose subsequent integration is vital for linking small molecules to phenotypes. He then described examples of harmonization of small molecules, targets, and activities. He concluded his presentation by describing the concept of dark chemical matter, small compounds that are inactive in multiple assays, and how they could represent possible tool compounds against the right system.

The penultimate speaker was Olexander Isayev (U. North Carolina) who described an approach to predicting kinase activity profiles using deep convolutional neural networks. He pointed out that traditional profile activity models are built separately for individual kinases and then concatenated independently. In contrast the work he presented involved developing a multi-task learning model that uses data on multiple kinases simultaneously during training. He highlighted how his model exhibits very good training statistics, compared to a random forest model.

The final speaker was Haobo Gu (U. Tennessee, Knoxville) who described a study of intrinsically disordered regions of proteins, designating them as the dark matter of the proteome. He then described an approach using sequence length and intrinsic disorder to cluster sequences. He then went on to show how this clustering distinguished eukaryotes from prokaryotes and various other groupings. He concluded that the proposed method is capable of clearly identifying the evolutionary status of the organisms.

Rajarshi Guha, National Center for Advancing Translational Science