Integrative Chemogenomics Knowledge Mining using NIH Open Access Resources

The symposium took place on Monday, September 9, 2013 from 8:50 AM until approximately noon in the Indiana Convention Center in downtown Indianapolis. Five speakers, primarily from member Centers of the NIH Molecular Libraries Program (MLP), made presentations to 40-60 attendees. The topic of the session was “Open Access resources in chemogenomics,” with a particular emphasis on the BioAssay Research Database (BARD,, a new software project aimed to integrate and contextualize 10 years of MLP data resident in PubChem.

Brief opening remarks by Tudor Oprea (University of New Mexico) introduced the BARD project and the outreach goal of the symposium: raising awareness of BARD in the cheminformatics and chemogenomics communities.

Rajarshi Guha (National Institutes of Health), one of the primary leaders of BARD development at the NIH, presented a technical perspective on the application programming interface (API) underlying BARD. After talking about the high-level architecture of BARD and the main components, he dove into the details of the RESTful API that BARD provides to scientists and developers. The API provides programmatic access to all the entities stored in the BARD warehouse such as assays, projects, experiments, and so on. Currently the API serves JSON and provides a variety of system-level resources that provide structure information about available resources, schema, etc. He then went on to highlight how the REST resource hierarchy could be extended by user-contributed plug-ins. After describing the workflow for plug-in development he highlighted a few of the plug-ins that are currently available including the BADAPPLE promiscuity method (University of New Mexico) and SMARTCyp prediction tool (Technical University of Denmark). He went on to highlight the flexibility of the plug-in architecture that allows a plug-in to accept any data type (strings, files) and output arbitrary data types and formats (plain text, HTML, SVG, and so on). The result of this architecture is that plug-in functionality can range from a simple descriptor plug-in (taking a SMILES and returning a number) to a fully fledged, HTML5-rich, interactive interface to the API and database. Guha ended the presentation by stressing the fact that BARD is more than just a data store. Instead, it represents a platform that co-locates data and the methods to analyze, annotate, and interpret the data. Combined with the extensibility features built into the platform, BARD represents a hub for collaborations between experimentalists and computational scientists.

Alexander Tropsha (University of North Carolina) spoke about how his group developed a BARD plug-in that connects BARD to the Chembench ( online QSAR modeling system. He described the QSAR modeling workflow that they have settled upon and highlighted key aspects of the workflow (including data cleaning and model validation) that are well defined in the Chembench suite of tools and will be made accessible via the BARD plug-in.

Tudor Oprea then presented a case study in which external data were highly curated and annotated with annotations of targets and other descriptors from the BioAssay Ontology (BAO) (, developed at the University of Miami. DrugMatrix, which is an open-access dataset available through the National Toxicology Program at NIEHS ( was originally downloaded from ChEMBL ( in December 2012. This dataset required significant manual curation: assay details from the Eurofins Panlabs ( Assay Catalog needed to be matched with DrugMatrix data on record; targets needed further data mining (e.g., species, exact target annotation); and substrate/reference compound information needed completion for each biochemical and pharmacological screen. For example, two receptors "Imidazole I2" and "Sigma 2" and one enzyme “phorbol ester” needed re-mapping, while the exact chemical structure for 11 compounds remains undetermined; and a total of 37 targets required additional curation. Comparison attempts between DrugMatrix and another matrix-style dataset (CEREP Bioprint, illustrate why BARD needs assay ontologies: although a number of target -chemical pairs (e.g., for target UniProt ID) can be identified, numerical bioactivity value comparisons remain meaningless in the absence of assay similarity information (e.g., agonist vs. antagonist, radio-ligand binding vs. functional assay, etc.). Establishment of a standardized research data format (RDF) as implemented in BARD to provide contextual information across assays using language familiar to research scientists and linking back to established ontologies (e.g. BAO) offers a potential platform for providing a formal assay similarity definition.

Eric Dawson (Vanderbilt University), a key outreach coordinator for the BARD project, described the way in which active engagement of end users (medicinal chemists and biologists) in BARD development has enhanced requirements-gathering and user-interface elements. Emphasis was placed on the collaborative nature of MLP Centers working together to bring the research data management (RDM, Broad Institute), application programming interface (API), and database warehouse architecture (National Chemical Genomics Center, NCGC) all together while simultaneously coordinating with an engaged user base of experienced scientists from participating Centers that leverages industrial backgrounds from a current perspective working in academia. The development of a potential local, private installation of BARD behind an organization’s firewall was also described with targets for deployment at Vanderbilt’s High Throughput Screening (HTS) Center core and St. Jude’s Children’s Hospital (Kip Guy laboratory). Dawson articulated that such a version of the database and tools would promote novel development of intellectual property and seed new collaborations between academic medical centers and the pharmaceutical industry.

Jeremy Yang (University of New Mexico) presented the BADAPPLE (BioActivity Data Associative Promiscuity Pattern Learning Engine) plug-in for BARD, which is an evidence-based estimator of scaffold promiscuity that relies on historical screening data to assign promiscuity to compound scaffolds on the basis of the performance of compounds containing those scaffolds. Importantly, as the first plug-in written for BARD, BADAPPLE provides a pathway to be emulated by future potential plug-in developers. The BADAPPLE algorithm generates a score based on scaffold-family membership, which is derived solely from empirical BARD activity data. This score reflects both a pan-assay “batting average,” as well as weighted evidence, with high scores indicative of highly promiscuous patterns. The score is “evidence-based,” meaning that the algorithm evaluates data “as is,” and score values are subject to change as new information becomes available. The BARD annotations and bioassay ontology enable improvements, extensions, and customizations for BADAPPLE. Somewhat surprisingly, 1.4% of the scaffolds (i.e., 1,979 scaffolds out of over 146,000, extracted from nearly 374,000 compounds) capture 50% of the bioactivity observed in 528 assays (over 30 million bioactivity observations or wells). BADAPPLE is available both as a BARD plugin, and as a web-based tool (, and can be used to identify suspicious screening results.

Following the formal presentations, Paul Clemons (Broad Institute) walked through recent demonstration screenshots of BARD web-client development, highlighting features that will be available on BARD's public release later this fall. BARD web query provides a simple search with auto-complete that guides users toward controlled vocabulary terms and yields tabbed search results for Projects, Assays, and Compounds. Facet-based browsing allows rapid filtration of results based on additional controlled vocabulary terms. Projects and Assays can be navigated to the level of individual compound results, and search results can be saved to a Query Cart for further analyses, including Molecular Spreadsheet views and linked hierarchy visualizations that permit rapid assessment of compound performance across target classes, phenotypes or assay types.

Following the presentations, the organizers and speakers formed a panel to engage the audience in Q&A and discussion. Much discussion was directed at how to sustain BARD as a community resource into the future, both from the standpoint of continued funding of the project beyond its initial two-year timeframe, and in terms of community adoption of BARD as a useful tool that will promote deposit of non-MLP data to BARD in the future.

Paul Clemons, Eric Dawson, Rajarshi Guha, Tudor Oprea, Symposium Organizers and Participants

Slide presentations are at

BARD architecture