Technical Program with Abstracts

ACS Chemical Information Division (CINF)
251th ACS National Meeting, Spring 2016
San Diego, CA (March 13-17, 2016)

CINF Symposia

Elsa Alvaro, Erin Davis, Program Chair

[Created Fri Feb 19 2016, Subject to Change; Check ACS Online Program for Latest Changes]

CINF: Tomayto vs. Tomahto: Overcoming Incompatibilities in Scientific Data 8:30am - 12:00pm
Sunday, March 13
Room 25B - San Diego Convention Center
David Deng, Organizing
David Deng, Presiding
8:30am-8:35am Introductory Remarks
8:35am-9:05am CINF 1: Relational database file can take us beyond the plain text file format
T O'Donnell,

gNova, San Diego, California, United States

I propose a relational database file and associated table schema that can replace plain text chemical file formats for sharing chemical structures and data. This proposal uses the open-source relational database engine SQLite. I will argue that a computer program reading such a file should use the Structured Query Language (SQL) to select data from database tables as needed, rather than reading it all at once into the program as is typically done when reading formatted text files. I will show a diagram of the database table schema. Example computer code for selecting from, and inserting into the tables from scratch or from existing files will be presented. I will discuss issues of file size and access speed.

Using a relational database allows extensions without alterations to a text file format. There are over 110 file formats used to store chemical structures, properties and data. SDF and PDB formats are very common and could be considered a standard. However, ad-hoc variants of these formats, intended to implement new features, cause errors in programs that rely on a standard format. The proposed SQLite database file could replace SDF, PDB and perhaps all those 110 file formats. I will discuss how a relational database maintains data integrity in a way that is more robust than a text-based file format. The relational database approach will be compared to the use of XML/CML formats and the Resource Description Framework (RDF) in the semantic web.

The database schema contains several basic tables describing molecules, atoms, bonds and properties. These tables reside within a single file that may contain multiple molecules. The tables may be fully or sparsely populated, depending on how much information is present or relevant for each molecule. Updates to these tables can be made as more data becomes available. Extensions to the database schema are possible in order to accommodate new types of data, for example internal/z-matrix coordinates, spectroscopic data or biological assay data. These are implemented not as new or modified basic table columns, but as additional tables with foreign key references to the basic tables.

The cross-platform format of SQLite, its portability across 32-bit, 64-bit, big-endian and little-endian architectures obviates any data incompatibilities caused by differences in hardware and software.

An open-source project at provides a wiki, table definitions and example computer code.

9:05am-9:35am CINF 2: Standard JSON molecule, a solution to a cross-vendor molecule file format?

Brian Cole,

OpenEye Scientific Software, Santa Fe, New Mexico, United States
Sharing information in cheminformatics and molecular modeling is still more tricky than it needs to be. There are numerous file formats to navigate, each with its own pros and cons that takes years to master. And with the advent of cloud computing and service based architectures, interoperability between various software packages is becoming more important.

OpenEye has developed a JSON representation of molecules to be able to seamlessly integrate with web technologies. We have written a specification of that format as well and would like to work with the community to gather feedback and make it a standard others can rely on. Though we are not blind to history, file formats, and standardization processes. We are proposing the following guidelines for the process:

- Minimal: no chem-informatics necessary to produce/consume, and thus also human editable. Though storing that information is encouraged to enable the next item.
- Interoperability: the extensibility of JSON should really encourage the use for painful modeling tasks like sharing charges and radii between packages.
- Tested: reference implementations are necessary to prove utility, but are a bad idea as a specification. A proper document and most importantly, a test suite, are how successful standards are created. Any implementation that claims to be compliant must pass the test suite.

We are seeking more feedback on who would be interested in contributing to such a process.

9:35am-10:05am CINF 3: Rule-based capture/storage of scientific data from PDF files and export using a generic scientific data model
Stuart Chalk,, Audrey Bartholomew, Bashar Baraz, John Turner

Department of Chemistry, University of North Florida, Jacksonville, Florida, United States
Recently, the US government has mandated that publicly funded scientific research data be freely made available in a useable form, allowing integration of data in other systems. While this mandate has been articulated, existing publications and new papers (PDF) still do not provide accessible data, meaning that the usefulness is limited without human intervention.

This presentation outlines our efforts to extract scientific data from PDF files, using the PDFToText software and regular expressions (regex), and process it into a form that structures the data and its context (metadata). Extracted data is processed (cleaned, normalized), organized, and inserted into a contextually developed MySQL database. The data and metadata can then be output using a generic JSON-LD based scientific data model (SDM) under development in our laboratory.

10:05am-10:25am Intermission
10:25am-10:55am CINF 4: Building linked-data, large-scale chemistry platform: Challenges, lessons, and solutions
Valery Tkachenko,, Alexey Pshenichnov, Aileen Day, Colin Batchelor, Peter Corbett

Royal Society of Chemistry, Rockville, Maryland, United States
Chemical databases have been around for decades, but in recent years we observed a qualitative change from rather small in-house built proprietary databases to large-scale, open and increasingly complex chemistry knowledgebases. This tectonic shift imposed new requirements for database design and system architecture as well as implementation of completely new components and workflows which did not exist in chemical databases before. Probably the most profound change is being caused by the linked nature of modern resources - individual databases become nodes and hubs of huge and truly distributed web of knowledge. This change puts forward such important aspects as data and formats standards, interoperability, provenance, security, quality control and metainformation standards.

ChemSpider at the Royal Society of Chemistry was first public chemical database which incorporated rigorous quality control by introducing both community curation and automated quality checks at the scale of tens of millions of records. Yet we come to realization that this approach may now be incomplete in a quickly changing world of linked data. In this presentation we will talk about challenges associated with building modern public and private chemical databases as well as lessons that we learned from our past and present experience. We will also talk about solutions for some common problems.

10:55am-11:25am CINF 5: Towards a functional database for enzyme data: STRENDA DB
Carsten Kettner,, Martin Hicks

Beilstein Institut, Frankfurt/Main, Germany
Scientific research has reached a stage in which the rapid improvement of technologies and methodologies has contributed to the accumulation of a vast amount of data in the published literature. However, since scientific publishing serves an important role in communicating new data, both journal editors and readers are challenged to identify novel findings. In addition, mainstream publication practices often have a number of deficiencies in the way that data are reported, resulting in the publication of incomplete, irreproducible and even unusable data sets. Several years ago, a group of biochemists working in the field of enzymology opened the debate that reliable data are a basic requirement for subsequent research and knowledge generation for all “-omics” sciences, in particular for systems biology. Under the auspices of the Beilstein-Institut, this group formed the STRENDA Commission and developed community-based recommendations for authors reporting enzymological data – the STRENDA guidelines.
The submission of this data to a public database is essential to ensure maximal accuracy and accessibility of the experimental kinetic enzymatic data. The direct, electronic submission by authors prior to publication has proven to be essential for comprehensive data acquisition in macromolecular sequencing and structural biology, for example, the Protein Data Bank (PDB). The development of robust, web-based, software tools and the implementation of experimental and informatics standards to assess the experimental data in a manuscript with respect to the compliance of the STRENDA guidelines resulted in STRENDA DB. This open access database is intended to provide a knowledge base for researchers and publishers; the former can use this data for reproduction, interpretation and additional experiments and the latter for a quick assessment of the degree of innovation and novelty of the data submitted. Here, with the presentation of STRENDA DB, we propose a change in the current publication workflow: from manuscript to database to publication, rather than from manuscript to publication to database.

11:25am-11:55am CINF 6: Virtues and vicissitudes of curatorial data wrangling: The guide to pharmacology experience
Christopher Southan,

Guide to PHARMACOLOGY, University of Edinburgh, Göteborg, Sweden

A wide range of valuable databases, both academic and commercial, use the curation model to extract and standardise selected result sets from the literature. This is the classic unstructured-to-structured transformation, predominantly of target binding data (e.g. IC50, Ki or Kd) between ligands and targets. Since 2009 the the Guide to PHARMACOLOG (GoPdb) has now curated quantitative interactions between 1300 protein targets and 6000 ligands covering a substantial proportion of the druggable proteome. The team has thus considerable expertise in the challenges of standardisation. This needs to be interposed between not only the primary literature but also other the databases that have extracted data. The wide range of compatibility and other issues associated with selecting new content for GtoPdb will be outlined. These will include the problem of equivocal measurement units as well as non-standard chemical IDs, protein IDs being used by authors. The issue of data gaps will also be expanded on. The presentation will conclude with an assessment of new initiatives from several publishers for authors to mark-up their chemical compounds before publication.

11:55am-12:00pm Concluding Remarks
CINF: From Data to Prediction: Applying Structural Knowledge in Drug Discovery & Development 8:40am - 12:00pm
Sunday, March 13
Room 25A - San Diego Convention Center
Jason Cole, Organizing
Jason Cole, Presiding
8:40am-8:45am Introductory Remarks
8:45am-9:15am CINF 7: Finding better aim at a moving target by exploiting structural data
Marcel Verdonk,

Astex Pharmaceuticals, Cambridge, United Kingdom
Structural databases like the Protein Data Bank (PDB) contain a wealth of information that is widely used in the structure-based drug discovery community. A range of applications in this area have been reported, including knowledge-based scoring functions, interaction fields and pocket similarity searching. In general, when PDB data is used for such applications, the structures are treated as static, and all atoms are considered “equal”. However, flexible, solvent exposed protein atoms are significantly less likely to be involved in ligand binding than more buried, tightly packed atoms. Here, we show that this effect can be clearly observed in a statistical analysis of protein-ligand interactions in the PDB. Furthermore, we will illustrate how such analyses can be used to improve structure-based design applications like pocket finding algorithms, interaction fields and knowledge-based scoring.

9:15am-9:45am CINF 8: Bridging the dimensions: Seamless integration of 3D structure-based design and 2D structure-activity relationships to guide medicinal chemistry
Marcus Gastreich1, Matthew Segall3,, Carsten Detering2, Edmund Champness3, Christian Lemmen1

1 BioSolveIT, Sankt Augustin, Germany; 2 BioSolveIT Inc, Bellevue, Washington, United States; 3 Optibrium Ltd, Cambridge, United Kingdom
The effective use of software can have a major impact on timelines and innovation in drug discovery. However, the traditional split between computational modellers and synthetic chemists has been blurred and software must be accessible across disciplines to quickly understand and predict structure-activity relationships (SAR). There has been a similar divide between tools for three-dimensional (3D) structure-based design and those for analysis of SAR based on a two-dimensional (2D) compound structure. Seamless integration between these approaches would enable all of the available structural knowledge to be used to guide the efficient design of high quality, active compounds.

In this talk we will illustrate how information from 2D models of key physicochemical and absorption, distribution, metabolism, elimination and toxicity (ADMET) properties can be superimposed on 3D views of protein-ligand complexes. The influence of each atom or functional group on these properties can be highlighted and combined with visualization of the atomistic contributions to binding affinity, enabling development of optimization strategies that balance potency with the ADMET properties required in a safe and efficacious drug.

Furthermore, 2D analyses, such as activity cliff detection and matched molecular pair analyses, are commonly used to explore compound data sets and quickly identify important SAR within a chemical series or library. We will demonstrate how a seamless, highly visual link between the results of these analyses and related 3D structural information helps to understand and rationalize this SAR. This enables the efficient design of compounds with improved target affinity in a truly multi-parameter optimization environment.

Linking activity cliffs with 3D structural information to rationalise SAR

9:45am-10:15am CINF 9: Predicting binding affinity doesn't work, or does it?
Christian Lemmen,

BioSolveIT, Sankt Augustin, Germany
Predicting binding affinity remains the holy grail in computational drug discovery. It may works on some target but not on others. Even extremely compute-intense approaches seem not to work, that work consistently well. One reason is that we are over-estimating the quality of our data (the PDB structures) we are working with. Some flaws in this data are well-known low resolution or high temperature factors could be taken into account but checking the electorn density we find many more issues. After all the crystal structure is also just a model that more or less fits to the primary experimental data. Next, the path from a raw PDB file to the input necessary for detailled molecular modelling is not always obvious. We've analyzed this carefully and provide a novel solution. Finally water poses a particular problem in the modeling process. Some water molecules are crucial, others should be replaced to gain affinity. However, careful analysis again shows that also the water molecules in the crystal structure model are not all 'well defined'. Therefore we will have a look at water molecules and how we can measure their experimental support. In summary, in many cases a quantitative assessment is next to impossible but it is good to 'know your enemies' and a qualitative assessment may still be very helpful. E.g. for the prioritization of compounds no absolute predictions are needed, but structure-activity relations and trends in the affinity data. We will show how the Hyde scoring pinpoints issues in the data and provides at least a qualitative assessment that is extremely helpful for the every-day tasks of SAR-analysis and compound prioritization.

10:15am-10:30am Intermission
10:30am-11:00am CINF 10: Structural knowledge by prediction: Crystal structure prediction tests and progress
Colin Groom,, Jason Cole, Anthony Reilly

Cambridge Crystallographic Data Centre, Cambridge, United Kingdom
Wouldn’t it be nice to be able to predict the crystal structure of organic molecules?

To encourage this progress to this goal the CCDC set up a blind test of crystal structure prediction methods back in 1999. By the time of this symposium, the sixth blind test will have concluded. Over twenty research groups will have applied their methodology to rigid molecules, flexible molecules, salts and other multicomponent systems.

This presentation, timed to coincide with the end of the 50th anniversary year of the Cambridge Structural Database, will review progress in the field. Moreover, it will discuss whether purely informatics-based approaches, trained using over 800,000 known structures can successfully predict the structures of unknown systems.

11:00am-11:30am CINF 11: Using physicochemical data and predictions in the risk assessment of mutagenic impurities
Susanne Stalford,

Lhasa Limited, Leeds, United Kingdom
ICH M7 guidance on mutagenic impurities (MIs) supports the control of potential MIs based on a sufficient understanding of the manufacturing process. This strategy would reduce the need to perform testing on MIs predicted to be purged during synthesis of an active ingredient.
A concept was brought forward in which semi-quantitative “purge factors” are calculated, based on physicochemical properties such as reactivity, solubility and volatility, to give confidence that a MI is likely to be absent in an end-product. This approach is used by several organisations within the pharmaceutical industry to support regulatory submission for e.g. late phase development. Our goal is to expand its use and standardise the approach through a consortium in order to establish a framework which would estimate “purge factors” and provide sufficient support for regulatory submissions. The key aims are to 1) standardise how calculations are performed throughout industry, 2) collate existing data and promote cross-industry data sharing to facilitate supported and accurate decision making, and 3) provide an automated in silico system which predicts purge factors based on experimental data and expert knowledge.
A successful international collaboration has been established, with a number of pharmaceutical companies guiding the development of the software and models for the prediction of physicochemical properties. This work has the potential to save both time and money in regards to analytical testing and also to ensure effort is focussed correctly on those impurities that present a substantive risk. This communication describes our scientific approach and recent progress.

11:30am-12:00pm CINF 12: Profile-QSAR generation 2: Perfection, the enemy of the good?
Valery Polyakov1,, Eric Martin2, Li Tian1

1 GDC, NIBR, Lafayette, California, United States; 2 Computational Chemistry, Novartis, El Cerrito, California, United States
Profile-QSAR achieves unprecedented accuracy and broad domain of application by augmenting a few hundred IC50s for a target of interest with millions of additional IC50s from 100,000s of compounds tested across many hundreds of historical assays from the same protein family. The accuracy and domain of application have now been dramatically further improved by replacing the original Bayesian formulation with a random forest formulation. Highlights of this presentation include
- A new random forest-based algorithm yields unprecedented accuracy and domain of application,
- A new challenging “Novelty” test set for evaluating virtual screening models, which mirrors the great diversity of actual historical compound selections,
- A demonstration that linear PLS models can extrapolate to novel chemistry better than non-linear random forests,
- A surprising demonstration that excluding the seemingly most relevant training data can prevent overfitting and greatly improve extrapolation, and
- A comparison and analysis of models built on internal and public data sets.


CINF: Data Mining: Searching Non-covalent Interactions in Chemical Databases 1:00pm - 4:45pm
Sunday, March 13
Room 24C - San Diego Convention Center
Suman Sirimulla, Organizing
Suman Sirimulla
Cosponsored by: COMP, Presiding
1:00pm-1:05pm Introductory Remarks
1:05pm-1:30pm CINF 20: Sigma-hole interactions for rational drug design
Suman Sirimulla,

Basic Sciences, St.Louis College of Pharmacy, St. Louis, Missouri, United States
Sigma hole interactions are gaining increased attention in medicinal chemistry. Halogen bond is now widely accepted as an important non covalent interaction for rational drug design by medicinal chemistry community. Bivalent sulfur atoms are also known to have electron deficiency in its outer lobe, exhibiting sigma hole. In this study we discuss the importance of sigma hole interactions in protein-ligand complexes and present scoring functions to score sigma hole interactions for molecular docking purposes. The insights of analyses of sigma hole interactions obtained from datamining of Protein Data Bank are also presented.

1:30pm-1:55pm CINF 21: Deep convolutional neural networks for autonomous discovery of molecular interactions

Abraham Heifets, Izhar Wallach, Michael Dzamba,

Atomwise, Inc., San Francisco, California, United States
Deep convolutional neural networks (neural nets with a constrained architecture that leverages the spatial and temporal structure of the domain they model) achieve the best predictive performance in areas such as speech and image recognition. Such deep convolutional neural networks autonomously discover and hierarchically compose simple local features into complex models. We demonstrate that biochemical interactions, being similarly local, are amenable to automatic discovery and modeling by similarly-constrained machine learning architectures. We describe the training of AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications, on millions of training examples derived from ChEMBL and the PDB. We visualize the automatically-derived convolutional filters and demonstrate that the system is discovering chemically sensible interactions. Finally, we demonstrate the utility of autonomously-discovered filters by outperforming previous docking approaches and achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.

Sulfonyl detection with autonomously-trained convolutional filters.

1:55pm-2:20pm CINF 22: Crystallographic informatics: Similarity and statistics
Simon Coles2,, Graham Tizzard2, Philip Adler1

1 Chemistry, Haverford College, Haverford, Pennsylvania, United States; 2 University of Southhampton, Hampshire, United Kingdom
For several years we have been systematically synthesizing and crystallizing families of related compounds, providing insights into polymorphism, similarity and crystal structure formation. These families can get rather large, which has necessitated developing computerized approaches to searching for patterns and packing similarities. Whilst many of the properties under investigation can be driven by conventional or primary intermolecular interactions there is an increasing awareness that weak or secondary interactions (in addition of course to shape and steric factors) can have considerable influence on the arrangement of the solid state. Accordingly we use two approaches that are not reliant purely on characterizing and understanding primary, hydrogen bonded, interactions.
Quantification of similarity in terms of dimensionality is important in understanding polymorphism, phase transitions and crystal growth. The XPac program disregards the standard descriptors of intermolecular interactions and is a geometrical analysis that assigns vectors between equivalent points in adjacent molecules in order to try to capture the effects of diffuse Van der Waals interactions. Pairs, or indeed large families, of vector representations can then be used to index the degree of similarity between all the members of a collection. This has the advantage of not only being agnostic to intermolecular interactions but also to an extent molecular similarity – it is indeed possible to compare oranges and lemons!
Our second method takes quite the opposite approach. Using molecular, interaction and topological statistical descriptors gleaned from all possible characteristics of a crystal structure (we have used around 4000!) it is possible to build statistical models. Again when looking at related sets of structures, one can compare models and also look for correlations between descriptors. In this way it is not only possible to characterize similarity, but also relate physical properties to structural characteristics and answer questions such as “will it crystallize?” or “what coformers will produce a cocrystal?”.
We are lacking tools to mine really big and diverse collections in this way – our methods will be illustrated through studies on our own compound/structure libraries, however we demonstrate the potential, and shortcomings, of mining much larger databases by these approaches.

2:20pm-2:45pm CINF 23: Chemical fragment analysis of halogen bonds in protein binding sites
AhWing Chan,

UCL, London, United Kingdom
We present an analysis of chemical fragments from halogen-containing, lead-like molecules in the Protein Data Bank (PDB) forming halogen bonds to protein main chain or side chain atoms. A fragment is defined as the largest ring assembly containing the halogen atoms involved in the halogen bond(s). Linear chains with halogens are also examined. Although there are about 2000 halogen-containing ligands in the PDB, only around 250 form halogen bond with protein. The halogen atoms are most often attached to an aromatic ring, with only very few linear motifs found. Our findings are useful for drug design, especially in the area of fragment-based design, chemical library design, and selection of screening compounds.

2:45pm-3:00pm Intermission
3:00pm-3:25pm CINF 24: Mining interaction data in the Cambridge structural database: Getting the rewards and removing the risks!
Jason Cole,, Peter Wood, Neil Feeder, Robin Taylor, Colin Groom

CCDC, Cambridge, United Kingdom
The Cambridge Structural Database is the biggest single resource describing non-covalent interactions. Think of every atom in 800,000 molecules interacting with all the molecules surrounding in in a lattice. These tens of millions of interactions are astonishingly informative – but only if we have tools to analyse them.

This presentation will describe the new Full Interaction Mapping functionality developed as part of the CSD System – a technology that reveals the energy yielding interactions holding molecules together and highlights the compromises that are sometimes made.

We will also see what a pure statistical analysis of every atom-atom interaction in over 800,000 crystals reveal. Just which interactions do appear more often than random? Are some interactions seen just because ‘everything has to be somewhere’? Are some of the interactions claimed to stabilise protein-ligand complexes observed less than one might expect and actually unfavourable? What really makes a halogen bond favourable?

In this presentation, timed to coincide with the end of the 50th anniversary year of the Cambridge Structural Database, we will showcase examples of applying Full Interaction Maps real life problems, such as understanding the relative stability of structural polymorphs and look at how interaction propensity gives us insight into the true worth of non-covalent interactions.

3:25pm-3:50pm CINF 25: Fast mining of adaptable interaction patterns in protein-ligand interface
Therese Inhester2,, Matthias Rarey1

1 University of Hamburg, Hamburg, Germany; 2 Center for Bioinformatics, University of Hamburg, Hamburg, Germany

Improving the specificity or the affinity of a small molecule to its target protein are just two typical tasks in structure-based drug design and structure-driven developments in biotechnology. In both scenarios, a profound understanding of interatomic interactions on a geometric and an energetic level is required. The ever increasing number of high-quality protein-ligand structures provides the opportunity of gathering detailed knowledge about preferred geometrical arrangements of atoms (interaction patterns). Yet, advanced data mining-methods are needed to deduce such spatial information from the large amount of data. Ideally, these methods have to be highly adaptable with regard to the possibilities of the query design. Moreover, they need to be efficient in order to enable near-interactive use. Equally important is a result browser which presents the resulting hits and the query in a comprehensive way.
In an attempt to address all these requirements we developed a new retrieval system to mine large sets of protein-ligand complexes for specific interaction patterns. Highly efficient indexing techniques and different graph-based algorithms are used to rapidly detect all occurrences of a spatial query. The queries represent a 3D constellation of atom descriptions within a protein-ligand interface. Additionally, constraints for physico-chemical properties of the protein and the ligand, e.g. the resolution, are combined with the spatial query.
We are going to present an application of our new approach in a detailed analysis of different interaction patterns including specific interactions and molecular substructures demonstrating its value for various molecular design tasks.

3:50pm-4:15pm CINF 26: Dual nature of a halogen atom
Mahesh Narayan,

Chemistry, University of Texas at El Paso, El Paso, Texas, United States
Halogen bonding interactions between halogenated ligands and proteins were examined using the crystal structures deposited to date in the PDB. The data was analyzed as a function of halogen bonding to main chain Lewis bases, viz. oxygen of backbone carbonyl and backbone amide nitrogen. This analysis also examined halogen bonding to side-chain Lewis bases (O, N, and S) and to the electron-rich aromatic amino acids. The data reveals that while fluorine and chlorine have strong tendencies favoring interactions with the backbone Lewis bases at glycine, the trend is not restricted to the achiral amino acid backbone for larger halogens. Halogen side-chain interactions are not restricted to amino acids containing O, N, and S as Lewis bases. Electron-rich aromatic amino acids host a high frequency of halogen bonds as does Leu. A closer examination of the latter hydrophobic side chain reveals that the 'propensity of interactions' of halogen ligands at this oily residue is an outcome of strong classical halogen bonds with Lewis bases in the vicinity. Furthermore, an examination of Theta 1 (C, X, O and C, X, N) and Theta 2 (X, O, Z and X, N, Z) angles reveals that very few ligands adopt classical halogen bonding angles, suggesting that steric and other factors may influence these angles.
Outcomes from understanding halogen bonding trends in the PDB were applied towards rational drug design. Currently there are no specific pharmacological therapies (drugs) to treat post traumatic stress disorder (PTSD). Emerging data reveal that the opioid receptor-like 1 gene (Oprl1), encoding the nociception (NOP)/orphanin FQ receptor, is involved in stress-mediated enhancement of amygdala-dependent fear in PTSD syndrome. Here, we attempted to design and develop specific novel nociception receptor agonists with the objective of achieving better receptor specificity. A structure based drug design method was used while availing of the crystal structure of nocicpetin receptor in the PDB. Halogen bonding was fully explored in designing the drugs. The results presented through our study illustrate the application of halogen bonding in rational drug design.

4:15pm-4:40pm CINF 27: Crystal clear: Using statistical descriptions and analysis to understand crystallisation
Philip Adler2,, Simon Coles4, Alex Norquist1, Joshua Schrier2, Dave Woods4, Sorelle Friedler1, Lucy Mapp3

1 Haverford College, Bryn Mawr, Pennsylvania, United States; 2 Chemistry, Haverford College, Haverford, Pennsylvania, United States; 3 Chemistry, University of Southampton, Southampton, United Kingdom; 4 University of Southhampton, Hampshire, United Kingdom
Statistical methods can be applied to understanding and predicting crystallisation. Methods have been deduced to apply these techniques to problematic systems such as co-crystals and inorganic-organic hybrid materials.
To use statistical approaches rigorously generally requires quite strictly defined experimental designs. While these designs are well understood, they are not necessarily practicable in terms of chemical properties for which we do not already have a solid understanding, for instance the problem of predicting outcomes in crystallisation experiments.One of the largest problems in this field is parameterising the chemical space; the methods of describing the systems such that statistical methods can apply, in particular the problem of keeping such descriptors invariant and mutually orthogonal. In addition, choosing the correct aspects of the system, that is those that demonstrate potential causal relationships, can prove to be challenging. This is because there is a plethora of means available to chemists to describe their systems. The application of statistical techniques such as decision trees and support vector machines has permitted the derivation of hypotheses and helped enhance understanding of crystallisation experiments, and this forms the presented work.

4:40pm-4:45pm Concluding Remarks
CINF: Global Initiatives in Research Data Management & Discovery 1:00pm - 5:00pm
Sunday, March 13
Room 25B - San Diego Convention Center
Ian Bruno, Leah McEwen, Organizing
Ian Bruno
Cosponsored by: ANYL, COMP, MEDI and PHYS, Presiding
1:00pm-1:15pm Introductory Remarks
1:15pm-1:45pm CINF 13: Open data is not enough: A look at the Research Data Alliance

Mark Parsons,

Research Data Alliance, Boulder, Colorado, United States
In recent years governments and research institutions have emphasized the need for open data as a fundamental component of open science. But we need much more than the data themselves for them to be reusable and useful. We need descriptive and machine-readable metadata, of course, but we also need the software and the algorithms necessary to fully understand the data. We need the standards and protocols that allow us to easily read and analyze the data with the tools of our choice. We need to be able to trust the source and derivation of the data. In short, we need an interoperable data infrastructure, but it must be a flexible infrastructure able to work across myriad cultures, scales, and technologies. This talk will present a concept of infrastructure as a body of human, organisational, and machine relationships built around data. It will illustrate how a new organization, the Research Data Alliance, is working to build those relationships to enable functional data sharing and reuse.

1:45pm-2:15pm CINF 14: Responses to the data revolution: CODATA on policy, data science, and capacity building
Simon Hodson1, John Rumble2,

1 CODATA, Paris, France; 2 R&R Data Services, Gaithersburg, Maryland, United States
Talk of a data revolution is not hyperbole. The ease in which data relating to human behaviour and transactions can be gathered has led to new industries driven by their ability to elicit predictive and commercially advantageous information from masses of data. In academia, new data science courses and multidisciplinary centres are springing up to feed the demand for these analytical and data management skills. The data revolution, the phenomenon of Big Data and advances in data science are everywhere impacting both scientific research and industry.

Exploiting the opportunities and addressing the challenges afforded by the data revolution and using them to generate wider societal benefit will fundamentally depend on the creation of a complementary ‘Open Data’ environment. Open data is crucial to the maintenance of scientific ‘self-correction’ whereby the data underlying published concepts are open to scrutiny, replication or invalidation. The rapid growth of data makes this crucial principle of research ever more difficult to sustain, and increasingly requires both the data and the code used in data analysis to be open, accessible and useable.

Our response must include new technical solutions for presenting, sharing and analysing data; on capacity building in “data science”; and on changing the habits and norms of researchers and their institutions to create a culture of openness and data sharing. Science is an international activity, done in a national cultural setting, thereby requiring national strategies to fit within a common international frame. The role of international bodies such as CODATA and the International Council of Science is to facilitate the fit between national priorities and processes and rapidly developing international norms.

To help address these issues, CODATA promotes Open Data and Open Science through three strategic priorities:

1) Supporting implementation of data principles, policies and practices
2) Addressing the frontiers of data science and its adaptation to scientific research.
3) Capacity building for data science (particularly in low and middle income countries - LMICs)

This presentation will examine the context of the ‘Data Revolution’ and provide an introduction to CODATA’s analysis of these issues and activities in the priority areas identified. Particular emphasis will be placed on the policy environment, a holistic approach to capacity building and the research data skills that benefit various role in science systems.

2:15pm-2:45pm CINF 15: Moving research forward with persistent identifiers and services
Patricia Cruse,

DataCite, Berkeley, California, United States
The presenter will provide information on DataCite, an international consortium which aims to increase the acceptance of research data as a legitimate, citable contribution to the corpus of scholarly communication. DataCite has developed many services that directly support data sharing, management and attribution. To enable these activities DataCite assigns persistent identifiers to research datasets and manages the infrastructure that supporst simple and effective methods of data citation, discovery and access. DataCite leverages the DOI infrastructure, which is already well-established. DOI names are the mostly widely used identifier for scientific journal articles, so researchers, authors, and publishers are familiar with their use. DataCite actively works with other identifier services such as ORCID and Crossref to deliver services that support researchers. In addition, the presenter will discuss the THOR project, a 30 month project funded by the European Commission under the Horizon 2020 program. DataCite is active participant in THOR and is working collaboratively to establish seamless integration between articles, data, and researchers across the research lifecycle. Please join the conversation and learn how your research can benefit from DataCite and THOR.

2:45pm-3:15pm CINF 16: Discoverability and reusability of FAIR chemistry research data as a key outcome of registering persistent identifiers and standardised metadata with DataCite
Henry Rzepa1,, Matthew Harvey2, Andrew Mclean3

1 Chemistry, Imperial College London, London, United Kingdom; 2 HPC division, Imperial College London, London, United Kingdom; 3 ICT Division, Imperial College London, London, United Kingdom
A research data management (RDM) system for computational and other chemical data is described using DataCite for registration of persistent (DOI) identifiers along with standardised metadata, including ORCID identifiers. Examples of the benefits of using such a FAIR model (Findable, Accessible, Inter-operable and Re-usable) will include automated repository retrieval and display based only on the assigned DOI and the media type required (, the standards-based curation of a ten-year old repository dataset ( and illustrations of how the metadata associated with the assigned DOIs can be used to enhance research data discoverability and impact.


3:15pm-3:30pm Intermission
3:30pm-4:00pm CINF 17: Surveying and tracking the biomedical data landscape
Maryann Martone,

Neurosciences, University of California, San Diego, San Diego, California, United States
In the past few years, data and data science has exploded into the academic and public consciousness, as “Big data” and “Data science” have taken hold. In this presentation, I will present an overview of projects and initiatives across biomedicine that are working to make data FAIR: Findable, Accessible, Interoperable and Re-usable in the age of global search. I will present projects such as the Neuroscience Information Framework, the NIDDK Information Network (dkNET) and bioCADDIE, the NIH Data Discovery Index, which are working to index data available across distributed resources. Such projects are confronting first hand the heterogeneity and dynamism of the current biomedical data landscape. I will also highlight some of the community efforts underway to update and expand our current citation system to make it easier to track usage of research resources such as reagents, organisms and data. In particular, I will highlight the Resource Identification Initiative and Data Citation projects underway at FORCE11: the Future of Research Communications and e-Scholarship. FORCE11 is a grass roots organization dedicated to transforming scholarship through technology. The Resource Identification Initiative is working with authors and journals on a simple method for supplying unique identifiers for key research resources in the literature. Through efforts such as the Joint Declaration of Data Citation Principles and follow on projects to implement the principles in a formal system of data citation, organizations around the world are working to develop a machine-actionable system for data citation as well, so that data is properly handled and cited as a primary product of scholarship. These projects are starting to point to key infrastructure requirements to transition our current paper based system into a system designed for networks and global search.

4:00pm-4:30pm CINF 18: Data Observation Network for Earth: Earth and environmental science data management and discovery
Amber Budden1,, William Michener1, Dave Vieglais2, Rebecca Koskela1, Heather Soyka1

1 University of New Mexico, Albuquerque, New Mexico, United States; 2 University of Kansas, Lawrence, Kansas, United States
Data Observation Network for Earth (DataONE) is the foundation of new innovative environmental science through a distributed framework and sustainable cyberinfrastructure that meets the needs of science and society for open, persistent, robust, and secure access to well-described and easily discovered Earth observational data. In this overview we will introduce the guiding principles of DataONE, the primary components of the DataONE cyberinfrastructure, provide a brief demonstration of DataONE search and discovery and discuss community perspectives and outreach opportunities.

Data Observation Network for Earth (DataONE)

4:30pm-5:00pm CINF 19: California Digital Library: Advancing the digital transition of scholarly information
John Chodacki,

California Digital Library, University of California, Oakland, California, United States
Researchers are increasingly being asked to ensure that all products of research activity – not just traditional publications – are preserved and made widely available. Adoption of good data curation practices is critical to open scientific inquiry, discourse, and advancement. With their long history in the management and dissemination of multifarious information resources, libraries play a key role in providing scholars with the tools and services necessary for the effective long-term curation of research data, encompassing data lifecycle management, preservation, sharing, dissemination, and reuse. This presentation will provide an overview of efforts the California Digital Library’s University of California Curation Center (UC3) group and their partners have undertaken to develop services that help researchers improve their handling, sharing, and archiving of datasets. Services include the DMPTool (Data Management Planning Tool), the Merritt Repository, the Dash Repository, and a other services that help researchers manage and get credit for their data.

CINF: From Data to Prediction: Applying Structural Knowledge in Drug Discovery & Development 1:30pm - 4:50pm
Sunday, March 13
Room 25A - San Diego Convention Center
Jason Cole, Organizing
Jason Cole, Presiding
1:30pm-1:35pm Introductory Remarks
1:35pm-2:05pm CINF 28: Towards a fully automated creation of large protein structure ensembles
Stefan Bietz, Matthias Rarey,

University of Hamburg, Hamburg, Germany
With the continuously growing number of available crystal structures, a great wealth of information related to molecular interactions and macromolecular conformational flexibility becomes available for drug discovery. Collecting and aligning all structures available for a certain protein target of interest is therefore a central problem to be addressed prior to structure analysis and knowledge exploitation. Classical methods include sequence and structure alignments followed by a superimposition of the extracted structures. Since these methods are mostly developed for protein structure analysis, they do not focus on the protein’s binding site resulting in major deficiencies. Not surprisingly, the construction of protein structure ensembles is mostly hand-curated work. While this is acceptable for individual cases, the construction of protein structure ensembles for large scale analysis or evaluation of computational techniques relying on ensembles like molecular docking is prohibitively work intense.

In this talk, we present a series of algorithms especially tailored for a binding-site focused, fully-automated construction of protein structure ensembles. The key element is a novel alignment algorithm for binding sites named ASCONA[1]. By taking sequence and structure information into account, ASCONA is able to calculate correct amino acid alignments of binding sites even in complicated scenarios like homo-dimer interfaces and binding sites consisting of patches from multiple protein domains. ASCONA allows to specifically control the variability like the gap and mutation rate in the binding site. The new alignment approach is embedded into a workflow for an automatic construction of protein structure ensembles named SIENA[2]. The process includes the extraction of binding sites from the PDB, their alignment, a reasonable reduction of structures, and the superimposition. In Summary SIENA enables the construction of arbitrary protein structure ensembles with typical computing times of 5-20 seconds.

[1] Bietz, S.; Rarey, M. (2015). ASCONA: Rapid Detection and Alignment of Protein Binding Site Conformations. Journal of Chemical Information and Modeling, 55(8):1747–1756.
[2] Bietz, S.; Rarey, M. (2015). SIENA: Efficient Compilation of Selective Protein Binding Site Ensembles. Journal of Chemical Information and Modeling, submitted for publication

2:05pm-2:35pm CINF 29: On our way to the automated search for ligand-sensing cores

Tobias Brinkjost1,2,, Christiane Ehrt2, Petra Mutzel1, Oliver Koch2

1 Faculty of computer science, TU Dortmund University, Dortmund, Germany; 2 Faculty of chemistry and chemical biology, TU Dortmund University, Dortmund, Germany
The investigation of protein-ligand interactions is one of the prerequisites for structure-based design of small molecule modulators of protein function. These interactions can be regarded based on structural similarity of secondary structure elements with impact on rational drug design [1]. The basic idea of the presented approach is the fact that a similar spatial arrangement of secondary structure elements around the binding site (‘ligand-sensing cores’) can recognize similar scaffolds independent of the overall fold [2]. The discovery of Namoline as a lysine-specific demethylase inhibitor, which impairs the growth of prostate cancer cells, by Willmann et al. demonstrated the pharmaceutical relevance of this concept [3]. However, to date there is no automated procedure available to compare 'ligand-sensing cores' of various proteins.

We will present the results of our ongoing progress to develop an automated computational method to identify similar 'ligand-sensing cores' in binding pockets of otherwise unrelated proteins for all known protein structures. Our approach is based on labelled graphs generated based on the secondary structure information provided by Secbase [4], an extension module of the Relibase. Calculations on available test datasets reveal a robust and very fast implementation, so that an all-against-all comparisons of the whole PDB should be possible within one or two months of calculation time on a recent workstation. We are currently optimizing our approach on several datasets and have also generated very promising results on ATP and other target specific datasets recently. We will also present the results of the all-against-all comparison.

In the end, this information of all similar ligand-sensing cores within all known protein structures will provide access to previously unused data to predict polypharmacology and to identify new lead structures. Therefore, this development leads to a valuable tool for rational structure-based drug design.

[1] Koch O; In Future Medicinal Chemistry; 2011:699-708.
[2] Koch M A, Waldmann H; In Drug Discovery Today; 2005:471-483.
[3] Willmann D, Lim S, Wetzel S, Metzger E, Jandrausch A, Wilk W, Jung M, Forne I, Imhof A, Janzer A, Kirfel J, Waldmann H, Schüle R, Buettner R; In International Journal of Cancer; 2012:2704-2709
[4] Koch O, Cole J, Block P, Klebe G; J. Chem. Inf. Model; 2009:2388-2402

2:35pm-3:05pm CINF 30: Deep learning in the 3rd dimension: Structure-based bioactivity prediction on novel targets
Abraham Heifets,, Izhar Wallach, Michael Dzamba

Atomwise, Inc., San Francisco, California, United States
Existing deep learning techniques for bioactivity prediction require significant prior knowledge of known active molecules for each protein target against which they predict, limiting their applicability in practice. We describe how to incorporate information about the structure of target proteins into the predictions made by deep learning neural networks. We discuss data cleaning, collation, and scaling techniques that are necessary to integrate large structural databases, such as the PDB, with large bioactivity databases, such as ChEMBL and PubChem. Finally, we present case studies where structural target information allowed the successful prediction of hits for targets with no known binders.

3:05pm-3:20pm Intermission
3:20pm-3:50pm CINF 31: CDD vision: Advanced analytics, calculations, and visualization live in CDD vault
Barry Bunin,

CDD, Belmont, California, United States
Drug Discovery Collaborations have been securely hosted in the CDD Vault for over a decade. We present a new web based data mining and visualization module for high throughput drug discovery data that makes use of a novel technology stack following modern reactive web design principles. CDD Vision allows researchers to simultaneously visualize, manipulate, and create publication quality graphics for hundreds of thousands of data points. Scientists can now perform complex, multidimensional analysis of experiment, calculated, and predicted properties to optimize activity, selectivity, and drug-like properties. The advanced analytics, calculations, and visualization suite is conveniently integrated within the CDD Vault for facile registration, structure activity relationships and secure collaboration. The synergy between these two systems allows users to quickly and graphically move through, sift and test stages of the drug discovery process to re-focus on advancing the science via intuitive data manipulation. Innovative capabilities for bio-computational across data sets leveraging the bioassay ontology, as well as newly shared open source technologies highlight the conversion of collaborative innovation into professional, useful products.

3:50pm-4:20pm CINF 32: Advances in data provisioning

Marian Brodney1,, Jacquelyn Klug-McLeod2, Gregory Bakken2, Robert Stanton1

1 Computational Sciences Center of Excellence, Pfizer, Cambridge, Massachusetts, United States; 2 Computational Sciences Center of Excellence, Pfizer, Groton, Connecticut, United States
Discovery project teams depend heavily on data to drive decision making. This information usually comes from various data sources (external, internal, in silico, etc.) in numerous formats (chemistry, in vitro, in vivo, pharmacology, ADME). It can be a difficult and time consuming process to access, collate and manage the amount and the variety of information needed to support teams, and to then keep this data current and presented in a reusable format for multiple disciplines. Our Computational Sciences (CSCoE) group has developed a data provisioning platform to aggregate project information, allowing teams to focus on analyzing and visualizing their data to advance towards their goal(s).

4:20pm-4:50pm CINF 33: Chemical information on the web: Find and be found

Asta Gindulyte,

National Center for Biotechnology Information, U.S. National Library of Medicine, Bethesda, Maryland, United States
While numerous chemical data repositories are available to chemists these days, the task of finding and collating all the data relevant to the problem of interest can be arduous and time consuming. A lot of effort may be required in compiling a list of trusted resources and learning how to use them, their scope, quirks, etc. And even then, it is a laborious process to use multiple data sources – whether manually searching each database or writing custom code to perform the task. Thus, it is not surprising that many chemists are turning to generic web search engines such as Google and Bing for their preliminary research. However, just because the database is “on the web”, it does not necessarily mean that it can be searched effectively using Google. For instance, data on the melting point of aspirin might be in the database, but it won’t necessarily show up in Google search results. That can happen if the resource requires a login to access the data. But, it can also happen if the web pages being served by the resource are not optimized for web search engines.

This talk will discuss the strategies for effective chemical information searching on the web from two perspectives: the owners of the chemical information databases and the users of such databases. This includes the steps that resource providers can take to “expose” their data to web search engines whether they are behind a login wall or not, and will give examples of PubChem’s experience in this endeavor. For the researchers using the web to search for chemical information, this talk will share ideas on how to streamline the user experience and show how to build their own custom Google chemical search engine (no programming experience required).

CINF: CINF Scholarships for Scientific Excellence: Student Poster Competition 6:30pm - 8:30pm
Sunday, March 13
Room 3 - San Diego Convention Center
6:30pm-8:30pm CINF 34: Quantifying the effect that chemical environment exerts upon changes in property in matched molecular pairs analysis
Iva Lukac1,, Andrew Leach1,3, Edward Griffen3, Alexander Dossetter2

1 School of Pharmacy and Biomolecular Sciences, Liverpool John Moores University, Liverpool, United Kingdom; 2 MedChemica Limited, Macclesfield, United Kingdom; 3 Medchemica Ltd, Macclesfield, United Kingdom
Matched Molecular Pairs Analysis (MMPA) is a technique that links differences in molecular structure with changes in properties and so allows useful information to be extracted and shared without disclosing full structures. By encoding the output as SMIRKS, it can then be used to suggest new molecules. It also provides the probability of each property changing in the desired direction. Like picking the winning horse by studying its previous results and current race conditions, the decision can be logical and data driven but success cannot be guaranteed.

MMPA has recently been applied to the ADMET databases of Roche, AstraZeneca and Genentech and the output combined to create what we believe is the world’s largest repository of medicinal chemistry knowledge, akin to a textbook. A by-product of this merging is that a large set of MMPA data are available that can answer fundamental questions about the technique: i) how many pairs are needed in order to be confident that a particular change in structure actually causes an increase (or decrease) in property? ii) for small sets of molecules, how large must the average change in property be in order to be confident that a particular change in structure actually causes an increase (or decrease) in property? iii) is there an upper limit to the number of pairs needed? iv) do chemically specific changes need less data than general changes? Each of these questions will be addressed.

A starting point for the analysis is that each pair can be viewed as a coin-flip experiment where an increase in a property corresponds to “heads” and a decrease to “tails”. This reduces the challenge to detecting biased coins. The coin flip analogy reduces the likelihood of misassigning the direction of change, but greatly reduces the number of structural transformations that are identified as having a significant effect upon properties. The large database available has permitted us to probe the link between the coin flip approach and the mean changes in property. The encoding of the chemical environment has further permitted us to analyse its impact which is to reduce the number of pairs required as the chemical environment becomes more specific.


6:30pm-8:30pm CINF 35: CSNAP: A new chemoinformatics approach for target identification using chemical similarity networks
Yu-Chen Lo1,, Silvia Senese1, Chien-Ming Li3, Qiyang Hu2, Yong Huang3, Robert Damoiseaux4, Jorge Torres1

1 Chemistry and Biochemistry, University of California, Los Angeles, Los Angeles, California, United States; 2 Institute for Digital Research and Education, University of California, Los Angeles, Los Angeles, California, United States; 3 Drug Study Units, University of California, San Francisco, San Francisco, California, United States; 4 Molecular Shared Screening Resource, University of California, Los Angeles, Los Angeles, California, United States

Target identification is one of the most critical steps following cell-based phenotypic chemical screens to determine the molecular mechanism of drugs and remains the major bottleneck of many drug discovery programs for developing novel therapies. Traditional in-silico target identification methods, including chemical similarity database searches, predict drug targets by single or sequential ligand similarity comparison, which have limited capabilities for accurate deconvolution of a large number of hits with diverse chemical structures. Here, we present CSNAP (Chemical Similarity Network Analysis Pulldown), a new computational target identification method that utilizes chemical similarity networks for large-scale chemotype recognition and drug target profiling from chemical screens. CSNAP orders query and annotated compounds into chemical similarity networks for rapid chemotype classification followed by consensus target scoring to identify the most probable target for query compounds into the network. Our benchmark study showed that CSNAP achieved higher target prediction accuracy than traditional target identification approach. Additionally, CSNAP is capable of integrating with biological knowledge-based databases and high-throughput biology platforms for system-wise drug target validation. To demonstrate the utility of the CSNAP approach, we combined CSNAP's target prediction with experimental ligand evaluation to identify the major mitotic targets of hit compounds from a cell-based chemical screen and we highlight novel compounds targeting microtubules, an important cancer therapeutic target. The CSNAP method is freely available and can be accessed from the CSNAP web server (

6:30pm-8:30pm CINF 36: Prediction and quantification of cation-π interactions in ligand-bromodomain binding: Using quantum chemistry to capture electronic effects
Wilian Augusto Cortopassi,, Robert Paton

Chemistry Research Laboratory, University of Oxford, Oxford, United Kingdom
CREBBP bromodomains are protein modules that recognize acetylated lysine residues and their selective inhibition shows potential in the design of more effective molecules to treat cancer. Recently, combining experimental and computational studies, [1] we have discovered that the ability of a series of dihydroquinoxalinone (DHQ) derivatives to bind to the CREBBP receptor is strongly influenced by the potential to form a cation-π interaction with an active site arginine residue. To get more insights into the importance of the cation-π interaction for the design of other CREBBP inhibitors, we took the systematic modification of aromatic substituents for a series of 5-isoxazolyl-benzimidazoles CREBBP inhibitors and constructed a quantitative structure-activity relationship (QSAR) based on the computed electrostatic potential of each π-system. Our model has been trained against literature data (R2 = 0.88, n=15) of binding affinities of 5-isoxazolyl-benzimidazoles and shows promise in testing against newly synthesized DHQ derivatives. Our work shows that a quantum chemical consideration of the electrostatic potential at a point remote from the ring center is a necessary condition to obtain good quantitative agreement, and leads to an improved qualitative understanding of binding and recognition involving inhibitors in the CREBBP active site.

[1] Rooney, T. P. C. et al. Angew. Chem. Int. Ed. 2014, 126(24), 6240-6244.

6:30pm-8:30pm CINF 37: 3Dmol.js: Chemical structure visualization for the modern web
Jasmine Collins1,, Matthew Ragoza3, Justin Jensen4, David Koes2

1 Computer Science/Neuroscience, University Of Pittsburgh, Pittsburgh, Pennsylvania, United States; 2 Computational and Systems Biology, University of Pittsburgh, Pittsburgh, Pennsylvania, United States; 3 University Of Pittsburgh, Pittsburgh, Pennsylvania, United States; 4 Pittsburgh Science & Technology Academy, Pittsburgh, Pennsylvania, United States

3Dmol.js is an object-oriented JavaScript library for visualizing 3D molecular data in the browser that does not require Java or plugins and provides interactive performance comparable to desktop applications. We will review the essential features of 3Dmol.js as well as describe several recent improvements, including improved rendering styles, support for animation, and improved crystallographic features.

Outlined cartoon style

6:30pm-8:30pm CINF 38: General purpose 2D and 3D similarity approach to identify hERG blockers
Patric Schyman,, Ruifeng Liu, Anders Wallqvist

DoD Biotechnology High Performance Computing Software Applications Institute, Frederick, Maryland, United States
Screening compounds for human ether-à-go-go-related gene (hERG) channel inhibition is an important component of early-stage drug development and assessment. In this study, we developed a high-confidence hERG prediction model based on a combined two-dimensional (2D) and three-dimensional (3D) modeling approach. We developed a 3D Similarity Conformation Approach (SCA) based on examining a limited fixed number of pairwise 3D similarity scores between a query molecule and a set of known hERG blockers. By combining 3D SCA with 2D Similarity Ensemble Approach (SEA) methods, we achieved a maximum sensitivity in hERG inhibition prediction with an accuracy not achieved by either method separately. The combined model achieved 69% sensitivity and 95% specificity on an independent external data set. Further validation showed that the model correctly picked up documented hERG inhibition or interactions among the Food and Drug Administration- approved drugs with the highest similarity scores–with 18 of 20 correctly identified. The combination of ascertaining 2D and 3D similarity of compounds allowed us to synergistically use 2D fingerprint matching with 3D shape and chemical complementarity matching.

6:30pm-8:30pm CINF 39: Indexing techniques and algorithms to efficiently mine interaction patterns in large sets of protein-ligand-complexes
Therese Inhester2,, Matthias Rarey1

1 University of Hamburg, Hamburg, Germany; 2 Center for Bioinformatics, University of Hamburg, Hamburg, Germany
The number of high-quality protein-ligand structures is increasing every year and opens the route for a detailed knowledge discovery needed for various structure-based molecular design tasks. Especially the analysis and comparison of spatial arrangements of atoms (interaction patterns) in large sets of protein-ligand interfaces can help deepening the understanding of molecular recognition and improving the design of highly affine ligands. The deduction of such information however requires efficient databases mining systems which are capable of dealing with the complexity of varying spatial queries in reasonable time.
To address this need we developed a new database and connected retrieval system which is able to mine large sets of protein-ligand complexes for specific interaction patterns. Highly efficient indexing techniques and different graph-based algorithms are used to rapidly detect all occurrences of a spatial query. The query consists of several atom descriptions which are connected by distance intervals or precompiled interactions. An atom description can be further refined by a molecule type and a substructure it belongs to. Additionally, constraints for physico-chemical properties can be combined with the spatial query. A graphical user interface allows an intuitive definition of queries starting from scratch or from a known binding site. Results can easily be analyzed and compared in a 3D viewer showing the structures aligned to the query.
In this contribution we will present the structure of the spatial queries used in our approach and explain in detail which indexing techniques and algorithms we used to efficiently mine large sets of protein-ligand structures.

6:30pm-8:30pm CINF 40: Development and application of multiclass QSAR models for predicting human skin sensitization
Vinicius Alves3,2,, Alexey Zakharov1, Eugene Muratov3, Denis Fourches5, Nicole Kleinstreuer4, Judy Strickland4, Carolina Andrade2, Alexander Tropsha3

1 CADD Group, Chemical Biology Laboratory, Center for Cancer Research, National Cancer Institute, Frederick, Maryland, United States; 2 Faculty of Pharmacy, Federal University of Goias, Goiania, Goias, Brazil; 3 UNC Eshelman School of Pharmacy, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, United States; 4 Contractor supporting the NTP Interagency Center for the Evaluation of Alternative Toxicological Methods (NICEATM), ILS, Inc., Research Triangle Park, North Carolina, United States; 5 Department of Chemistry and Bioinformatics Research Center, North Carolina State University, Chapel Hill, North Carolina, United States
We have developed classification QSAR models to predict chemical skin sensitization potential (non-sensitizers, weak, moderate, and strong/extreme sensitizers). We have (i) compiled, curated, and integrated the largest publicly-available dataset comprising 515 chemicals tested in the LLNA assay; (ii) used this data to generate and validate multi-class QSAR models; (iii) established SAR rules for skin sensitizers; and (iv) employed QSAR models for virtual screening of the COSMOS database of cosmetics inventory. Model developed with SiRMS and PASS/QNA descriptors using random forest and deep learning techniques showed total average accuracy of 63% and coverage of 59% as applied to several external validation sets. Virtual screening of the COSMOS inventory yielded 250 putative strong skin sensitizers. Structural fragments promoting or interfering with skin sensitization potential as well as structural modifications required to decrease chemical toxicity were identified. Models developed herein could be used to guide the rational design of safe cosmetic products.

6:30pm-8:30pm CINF 41: Virtual screening in the cloud computing environment
Aaron Cooper1,, Mathew Koebel3, Grant Schmadeke1, Suman Sirimulla2

1 Basic Sciences, St. Louis College of Pharmacy, St. Louis, Missouri, United States; 2 Basic Sciences, St.Louis College of Pharmacy, St. Louis, Missouri, United States
Cloud computing is becoming increasingly popular because of its ease, convenience and on-demand access to a shared pool of configurable computing resources. Several industries are turning to cloud technology as an efficient way to improve quality services due to its capabilities to reduce overhead costs, downtime, and an automated infrastructure development. Here we present an application that utilizes Amazon web Services platform to run virtual screening in a cloud environment. Virtual screening (VS) is a widely used computational technique in drug discovery process. It is used to search libraries of small molecules in order to identify the compounds that are most likely to bind the drug target (usually a protein or DNA). It involves running molecular docking calculations on a defined macromolecule against a library of small molecules. Currently there are several publicly available and downloadable chemical libraries (Chembridge, ZINC etc). These chemical libraries range from a few million to a couple billion ligand structures. Screening of such big libraries is a computationally expensive task. This process can be expedited by running these calculations massively paralleled on clustered computers containing several nodes. Here we present the “VSCloud,” an application that is optimized to run virtual screening on Amazon Web Services (AWS) and available for users on AWS market place. It would be an on-demand service for users where they pay-per-use. Users do not have to worry about maintaining the computer hardware, cloud infrastructure or the software.

6:30pm-8:30pm CINF 42: Structural evolution of Tcn (n = 4–20) clusters from first-principles global minimization
Chad Priest1,, De-en Jiang2

1 Chemsitry, University California, Riverside, Riverside, California, United States; 2 Department of Chemistry, University of California, Riverside, Riverside, California, United States
The structural evolution of Tcn (n = 4–20) clusters using a first-principles global minimization technique, namely, basin-hopping from density functional theory geometry optimization (BH-DFT). BH-DFT permits the exploration of a large configurational landscape for the evolution of Tc nanostructures. The method yielded significantly more stable structures have been found in comparison with previous models, indicating the power of DFT-based basin hopping in finding new structures for clusters. The growth sequence and pattern for n from 4 to 20 are analyzed from the perspective of geometric shell formation. The binding energy per atom, relative stability, and magnetic moments are examined as a function of the cluster size. Several magic sizes of higher stability and symmetry are discovered for cluster sizes 6, 10, 12, 14, and 18. In particular, we find that Tc19 prefers an Oh symmetry structure, resembling a piece of a face-centered-cubic metal, and its electrostatic potential map shows interesting features that indicate special reactivity of the corner atoms. Furthermore, when comparing the structural evoultion with neighboring elements magnesium and ruthenium, a strikingly similar cubic-like structural pathway is observed with ruthenium while a disparaging icosahedral growth pattern is noticed for the same group element magnesium.

CINF: Beyond Digitized Paper: The Next Generation of ELNs 8:15am - 12:00pm
Monday, March 14
Room 24C - San Diego Convention Center
Erin Davis, David Deng, Organizing
Erin Davis, David Deng, Presiding
8:15am-8:20am Introductory Remarks
8:20am-8:45am CINF 50: Toward semantic representation of science in electronic laboratory notebooks (ELNs)
Stuart Chalk,

Department of Chemistry, University of North Florida, Jacksonville, Florida, United States
An electronic laboratory Notebook (ELN) can be characterized as a system that allows scientists to capture the data and resources used in performing scientific experiments. This allows users to easily organize and find their data however, little information about the scientific process is recorded.

In this paper we present initial attempts to integrate an Electronic Notebook Ontology (ENO) into the Eureka! Research Workbench, an open source semantic ELN. A discussion of the ENO, integration into the backend data store of Eureka!, and possibilities using this approach will be presented.

8:45am-9:10am CINF 51: New cloud-based ELN with built-in raw analytical data support and automatic structure confirmation capabilities
Santiago Dominguez Vivero1,, Juan Cobas Gomez1, Santiago Fraga Castro1, Francisco Javier Sardina2

1 Mestrelab Research SL, Hereford, Herefordshire, United Kingdom; 2 Chemistry, University of Santiago de Compostela, Santiago De Compostela, A Coruña, Spain
Laboratory notebooks represent a critical component of the research and development workflow of many companies and academic and research groups. Whilst many research organizations are still using handwritten recording procedures, Electronic Laboratory Notebooks (ELNs) are progressively replacing traditional paper books in both commercial research establishments and academic institutions.
A plethora of new ELN products have been rolled out across the pharmaceutical and chemical industry by a rising legion of vendors in recent years. The result is a motley crew of ELNs, ranging from generic authoring tools to custom solutions dedicated to very specific scientific disciplines.

In this talk we present Mbook, a new cloud-based ELN (in-house or client-server versions are also supported) specially designed for the field of organic synthesis and which implements unique features in the context of analytical data handling that makes it stand apart from other ELN solutions. These include:

- Ability to automatically process, analyse, store and report raw NMR data recorded in all NMR manufactures (both high field and benchtop NMR instruments are supported) as well as LC/GC/MS data acquired in many different formats.
- A new powerful fully automatic structure verification (ASV) system of small molecules using NMR, LC/GC/MS or both jointly. More specifically, this ASV module can be used with raw samples before purification to get a quick assessment about whether or not the expected product has been successfully produced as the result of a reaction. In addition, it can also be used with purified samples to get a higher degree of confidence of the proposed molecular structure, for example for registration, which is also supported in one single click by the integration between Mbook and a cloud-based Mestrelab registration system.

NMR data uploaded to Mbook and viewed within its web interface

9:10am-9:35am CINF 52: Mobile interfaces for a digital research notebook
Jeremy Frey2,, Cerys Willoughby2, Simon Coles1, Richard Whitby3, Colin Bird2

1 University of Southampton, Hampshire, United Kingdom; 2 University of Southampton, Southampton, United Kingdom; 3 Univeristy of Southampton, Southampton, Hants, United Kingdom
Despite the clear advantages of digital notebooks over the traditional paper log, laboratory researchers continue to show reluctance to make the transition. Some of the rationale for their inhibitions arises from the ease with which they can note their observations on paper, whereas their desktop computer will almost certainly be located outside the lab. It is an undeniable fact that capturing experiment metadata at source will produce a more reliable record than relying on subsequent recall.

Using mobile devices to curate experimental observations in the laboratory can overcome this difficulty. In this paper we describe the evolution of mobile experiment recording at Southampton, from an early prototype developed during the CombeChem project through to Notelus, a native iPad application that can interface directly to an ELN and also has a linked experiment planning application. We project forward to the vision of a generic Digital Research Notebook (DRN) that has all the required functionality, implemented with an Application Programming Interface (API) that enables the DRN to be interface-agnostic. Experiments would be recorded in the lab using a mobile (e.g., tablet, phone, camera) interface, using cues to encourage metadata capture at source. Subsequent data analysis would be done out of the lab, using a desktop or web interface to the same API as used by the mobile device.

9:35am-10:00am CINF 53: Not just another reaction database
Aileen Day2, Valery Tkachenko2,, Alexey Pshenichnov2, Leah McEwen1, Simon Coles3, Richard Whitby3

1 Clark Library, Cornell University, Ithaca, New York, United States; 2 Royal Society of Chemistry, Rockville, Maryland, United States; 3 University of Southhampton, Hampshire, United Kingdom
The need for a high quality reaction database underpins synthetic reaction planning, as highlighted by the roadmap of the Dial-a-Molecule grand challenge [1] (the aims of which are to be able to predict the outcome of a reaction a priori and therefore generate products on demand, and also to optimise a reaction).
A number of reaction databases are available [2] - most of these focus on storing basic reaction schemas and details and link to publications for more details. However their main limitation is that because their major source is the abstraction of published literature, insufficient structured reaction detail is recorded:
for someone else to reproduce the reaction
to fully record all reaction products (not just the target product)
previous attempts to reach the optimised reaction route so that this 'work-up' can be correlated to allow better prediction of reaction outcomes.
As a result, the reactions domain of the chemical data repository that the Royal Society of Chemistry is developing will capture:
reactions and processes directly from Electronic Lab Notebooks
reactions which gave low yields or unintended products
processes, parameters and equipment in S88 process recipe [3] style for maximum reproducibility
multistep reactions
reactants, products etc. not just as small organic molecules
raw characterisation data linked to products
We will demonstrate a first version, populated with reactions text-mined from RSC articles and examples of notebook reactions and processes as recorded by an academic research group at Cornell University.
[1] Dial a Molecule Grand Challenge, (accessed Oct 8, 2015)
[2] Organic Chemistry Resources Worldwide, (accessed Oct 8, 2015)
[3] ISA, 'Batch Control Part 1: Model and Terminology,' The International Society for Measurement and Control, ISA Press, ISA - S88.01-1995

10:00am-10:15am Intermission
10:15am-10:40am CINF 54: Directly upload data from an ELN into PubChem
Ben Shoemaker,, Asta Gindulyte, Evan Bolton, Steve Bryant

NCBI / NLM / NIH, Bethesda, Maryland, United States
Managing data in a laboratory across multiple projects, instruments, collaborators and people coming and going requires the organization and flexibility that an electronic laboratory notebook (ELN) can provide. Such automation at publication time, however, falls short when it comes to submitting data to a public database. Public accessibility of data is often a required step in journal publication or grant renewal. Public repositories offer a stable, well-publicized platform in which to house such data. During the time crunch of publishing or an administrative deadline, an additional data upload procedure must be initiated to load and often reformat the data to maximize its exposure and usability.
PubChem is providing the PubChem Upload RESTful programming interface (API) to allow ELN providers to directly incorporate data publishing into PubChem as a built-in tool or button. A suite of features lets a submitter upload data while scheduling a delayed release. Such flexibility provides the submitter with immediate, uniquely-generated PubChem data identifiers to prove public submission while timing the actual release with a publication or other future date. In addition, semi-private view codes can be generated to share access with collaborators and administrators. Simultaneous access to the PubChem Upload web interface is available throughout the process.
The PubChem Upload API offers seamless integration of public data reporting via existing ELN systems.

10:40am-11:05am CINF 55: Intuitive collaboration platform: A Scilligence story
Rajeev Hotchandani1,, Jinbo Lee2

1 Scilligence, Watertown, Massachusetts, United States; 2 Scilligence Corporation, Burlington, Massachusetts, United States
Scilligence is a leading innovator of cross-platform, mobile cheminformatics and bioinformatics solutions. Scilligence's proprietary technologies address three main areas of R&D informatics needs: knowledge management and collaboration; knowledge mining of unstructured data; project and material management.
Scilligence’s enterprise platform such as ELN enhance knowledge sharing and productivity of researchers in discovery and development of small molecule and biologics therapeutics. Scilligence ELN is a cross-platform Electronic Laboratory Notebook for research and education. It has been widely adopted across industry and academic institutions handling multitude of internal & external collaborations. What’s unique about Scilligence ELN is its powerful data mining technology and advanced informatics for biologics and conjugates such as ADCs (antibody-drug conjugates). Scilligence ELN supports all disciplines of research including medicinal chemistry, process chemistry, bioassays, HTS, in vivo pharmacology and toxicology.
Scilligence’s applications are specifically designed to minimize IT footprints and require no client-side installation.




11:05am-11:30am CINF 56: ACAS LIMS simplifies diverse data loading, management, and querying
John McNeil,, Guy Oshiro,, Brian Fielder,, Eva Gao, Samuel Meyer, Brian Bolt, Fiona McNeil, Matthew Shaw, Kelley Carr

John McNeil & Co., San Diego, California, United States
Traditional ELNs have historically struggled with the integration of disparate experimental data into the digital notebook. Limitations on what types of data can be loaded and the complexity of reformatting data are barriers to effective usage. Traditional ELNs capture the information about the experiment but do not facilitate querying and reporting of the experimental data. We developed ACAS LIMS (Assay Capture & Analysis System) to provide a streamlined approach to load disparate assay types, and built a powerful querying and reporting engine to transform the data into knowledge. The cornerstone of ACAS is the Experiment Loader module, which enables the rapid entry of experimental and computational data into the system while capturing context about the protocol and experiment. ACAS LIMS also includes a Compound Registration module that tracks parent compounds, lots, and synthesis reactions. The ACAS Data Viewer module facilitates easy retrieval of all of the linked data by structure search or simple text searches. ACAS is an integral part of the experimental data processing workflow.

11:30am-11:55am CINF 57: ChemEngine: An automated chemical data harvesting tool for molecular inventory and chemical computing from scientific literature

Muthukumarasamy Karthikeyan1,, Renu Vyas2

1 Digital Information Resource Centre, CSIR National Chemical Laboratory, Pune, India; 2 Chemical Engineering and Process Development, CSIR-National Chemical Laboratory, Pune, MH, India
There is an urgent need for scientific data sharing and data standardization in public domain as required by research supported by government grants. Usually the chemical data published in scientific literature is very hard to retrieve, recover and re-use. Chemoinformatics tools are matured enough to harvest chemical data to certain extent especially converting name to structure and vice versa. Chemoinformatics tools have been developed to harvest molecular data in image format to corresponding connection tables and chemical names with moderate success. Recently we developed a chemoinformatics application for harvesting chemical data from scientific literature in plain text format and PDF formats that is beyond simple name to structure conversion. The methodology involves parsing raw data from scientific literature in PDF format and recognize the chemical data and convert them efficiently into molecular data for reusability in inventory (ELN applications) and Chemical Computing (using QM/QC tools). The truly computable molecules generated using this approach were directly subjected to further atomic level energy calculation . The challenges of selective recognition of chemical data from large amount of non-chemical data and apply for inventory application and further computing will be discussed. The ChemEngine a chemoinformatics tool developed for this approch will be presented with case studies especially on the performance and optimization.


11:55am-12:00pm Concluding Remarks
CINF: Global Initiatives in Research Data Management & Discovery 8:15am - 11:55am
Monday, March 14
Room 25B - San Diego Convention Center
Ian Bruno, Leah McEwen, Organizing
Leah McEwen
Cosponsored by: ANYL, COMP, MEDI and PHYS, Presiding
8:15am-8:20am Introductory Remarks
8:20am-8:45am CINF 43: PubChem BioAssay: A decade’s practice for managing chemistry research data
Yanli Wang,

NCBI, NLM, NIH, Building 38A, Room 5S506, 8600 Rockville Pike, Bethesda, Maryland, United States
The PubChem BioAssay database was created in 2004 by the National Center for Biotechnology Information (NCBI) as a public repository for biochemical biology and medicinal chemistry data of small molecule. The database now contains over 1,000,000 bioassay records (BioAssay accession, AID), 200 million bioactivity outcomes, and tens of thousands protein and gene targets. Building this public information system has been an effort with challenges on many fronts. This presentation will describe the project history, the ten-year, multiple-cycle, and still continuing development, and the current functionalities provided at PubChem. BioAssay data may be freely accessed and downloaded using the NCBI information retrieval system Entrez at A suite of services provided by PubChem are available at, including the most recent development for the BioAssay record page at Chemical structures and assay results may be deposited via the submission system at: PubChem welcomes feedback and contribution from the community.

8:45am-9:15am CINF 44: Data infrastructural design for informing critical evaluation
Kenneth Kroenlein,

Thermodynamics Research Center, National Institute of Standards and Technology, Boulder, Colorado, United States
Exponential growth rates in data generation combined with non-negligible error rates in the scientific literature [1] have conspired to make critical evaluations impracticable in many scenarios for the average scientist. Data volumes have grown to such a degree that many traditional data collection and interpretation approaches cannot scale sufficiently to remain comprehensive and current, or to effectively track shifting interests within research and industrial communities. It is thus necessary to strongly rely on a substantially increased role for digital archives, automated analysis and machine learning approaches.

The approach adopted at the Thermodynamics Research Center (TRC) at the National Institute of Standards and Technology (NIST) is dynamic data evaluation, whereby a reliable and comprehensive data archive is used in conjunction with an algorithmically-encoded expert analysis in order to generate up-to-date property recommendations. In developing an infrastructure to support this high-throughput critical analysis for thermophysical properties, TRC staff have created technological solutions for data collection, curation and communication to quickly meet the challenges encountered under real-world conditions. These include user experience-driven data entry tools, and international data standards and ad hoc derivatives of them. The particulars of these tools will be discussed as well the importance of being both proactive and reactive in the information technology development process.

[1] Chirico, R.D.; Frenkel, M.; Magee, J.W.; Diky, V.; Muzny, C.D.; Kazakov, A.; Kroenlein, K.; Abdulagatov, I.; Hardin, G.; Acree, W.E.; Brenneke, J.F.; Brown, P.L.; Cummings, P.T.; de Loos, T.W.; Friend, D.G.; Goodwin, A.R.H.; Hansen, L.D.; Haynes, W.M.; Koga, N.; Mandelis, A.; Marsh, K.N.; Mathias, P.M.; McCabe, C.; O’Connell, J.P.; Pádua, A.; Rives, V.; Schick, C.; Trusler, J.P.M.; Vyazovkin, S.; Weir, R.D.; Wu, J. “Improvement of Quality in Publication of Experimental Thermophysical Property Data: Challenges, Assessment Tools, Global Implementation, and Online Support” J. Chem. Eng. Data 2013, 58, 2699-2716.

9:15am-9:40am CINF 45: Community-driven disciplinary data repositories: A case study
Ian Bruno,, Colin Groom

Cambridge Crystallographic Data Centre, Cambridge, United Kingdom
In 2015 we celebrated the fiftieth anniversary of the Cambridge Structural Database - the world’s primary repository for small molecule crystal structure data. This has come a long way since its initial origins as a series of printed indices first published in 1965. Today crystallographers across the globe deposit over 60,000 datasets per year and researchers across disciplines apply the knowledge derived from over 800,000 structures to a range of scientific challenges.

A key driver in the development of the Cambridge Structural Database has been the input and support of the wider research community. This has come from academia and industry, publishers and researchers, scientific unions and individuals. This presentation, timed to coincide with the end of the 50th anniversary year of the Cambridge Structural Database, will provide a brief reflection on the last half century. We’ll look at the key moments where community engagement was pivotal to the establishment and evolution of a widely respected disciplinary data repository.

9:40am-10:10am CINF 46: ICSU World Data System: Trusted data services for global science
Mustapha Mokrane1, Jean-Bernard Minster2,, Rorie Edmunds1

1 International Programme Office, ICSU World Data System, Koganei, Tokyo, Japan; 2 Institute of Geophysics and Planetary Physics, Scripps Institution of Oceanography, La Jolla, California, United States
Today’s research is international, transdisciplinary, and data-enabled, which requires scrupulous data stewardship, full and open access to data, and efficient collaboration and coordination. New expectations on researchers based on policies from governments and funders to share data fully, openly, and in a timely manner present significant challenges but are also opportunities to improve the quality and efficiency of research and its accountability to society. Researchers should be able to archive and disseminate data as required by many institutions or funders, and civil society to scrutinize datasets underlying public policies. Thus, the trustworthiness of data services must be verifiable. In addition, the need to integrate large and complex datasets across disciplines and domains with variable levels of maturity calls for greater coordination to achieve sufficient interoperability and sustainability.

The World Data System (WDS) of the International Council for Science (ICSU) promotes long-term stewardship of, and universal and equitable access to, quality-assured scientific data and services across a range of disciplines in the natural and social sciences. WDS aims at coordinating and supporting trusted scientific data services for the provision, use, and preservation of relevant datasets to facilitate scientific research, in particular under the ICSU umbrella, while strengthening their links with the research community. WDS certifies it Members, holders and providers of data or data products, using internationally recognized standards. Thus, providing the building blocks of a searchable common infrastructure, from which a data system that is both interoperable and distributed can be formed.

This presentation will describe the coordination role of WDS and more specifically activities developed by its Scientific Committee to:
– Improve and stimulate basic level Certification for Scientific Data Services, in particular through collaboration with the Data Seal of Approval.
– Identify and define best practices for Publishing Data and to test their implementation by involving the core stakeholders, namely, researchers, institutions, data centres, scholarly publishers, and funders.
– Establish an open WDS Metadata Catalogue, Knowledge Network, and Global Registry of Trusted Data Services.

10:10am-10:25am Intermission
10:25am-10:55am CINF 47: STRENDA and MIRAGE: Examples of community-based data reporting standardization initiatives
Martin Hicks,, Carsten Kettner,

Beilstein Institut, Frankfurt, Germany
An essential requirement for scientific progress is unrestricted access to research results in a form that is directly usable by researchers. However, there are many deficiencies in the way that data are currently reported, resulting often in incomplete and even unusable data sets that are not suitable for subsequent research and knowledge generation. There are various reasons for this, ranging from the lack of a framework for structured and standardized data reporting to the largely outdated infrastructure for reporting and publishing scientific research results. The diverse data requirements in scientific research mean that no one-size-fits-all solution would be applicable; thus domain-specific guidelines and infrastructure for reporting and data management are required.

The Beilstein-Institut has initiated and runs two data standards projects: STRENDA, which is concerned with the standardization of reporting enzymology data, and MIRAGE, with the reporting of glycomics experimental results. Each project is made possible and advanced by a commission of experts in these fields, who work together in defining the reporting standards. The STRENDA reporting guidelines were published in 2010 and have been recently implemented as a web-based front-end to the STRENDA-DB, enabling an open access database of validated enzymatic experimental data to be built up. This talk will address the issues that have had to be overcome in setting up an effective bottom-up mechanism to create workable solutions – from scientists for scientists – in these projects.

10:55am-11:25am CINF 48: Standardizing the description of nanomaterials: The CODATA uniform description system
John Rumble1,, Steven Freiman2, Clayton Teague3

1 R&R Data Services, Gaithersburg, Maryland, United States; 2 Freiman Consulting, Potomac, Maryland, United States; 3 Teague Consulting, Gaithersburg, Maryland, United States
The complexity and newness of nanomaterials has made describing them accurately a challenge. Traditional nomenclature systems for chemicals and bulk engineering materials do not capture the nanoscale details and features that give nanomaterials their interesting properties. CODATA has established an international, multi-disciplinary working group to develop a uniform description system for materials on the nanoscale (UDS). The UDS has been designed to meet the needs of diverse user communities, including researchers, regulators, and database managers in disciplines ranging from chemistry and materials science to food science, nutrition science, and toxicology. The UDS has identified the information categories and descriptors useful for describing individual nano-objects and collections of nano-objects, including those in various media such as biological and environmental fluids, as well as nano-objects embedded in bulk materials. The CODATA UDS is freely available at for use in designing database schemas, developing ontologies for nanomaterials applications, reporting experimental and computational results in the literature, and depositing results into nanomaterials repositories. This presentation will briefly review the history of the CODATA UDS as well describe the UDS itself.

11:25am-11:55am CINF 49: Scientific units in the electronic age
Stuart Chalk,

Department of Chemistry, University of North Florida, Jacksonville, Florida, United States
Scientists have standardized on the SI unit system since the late 1700’s. While much work has been done over the years to refine and redefine the system, little has formally done to standardize the representation of the SI units in electronic systems.

This paper will present a summary of current efforts toward electronic representation of scientific units, an analysis of needs for current computer/network systems, and an outline of future work.

CINF: Informatics & Quantum Mechanics: Combining Big Data & DFT in Pharma & Materials 8:40am - 12:00pm
Monday, March 14
Room 25A - San Diego Convention Center
Art Cho, Organizing
Art Cho, Presiding
8:40am-8:45am Introductory Remarks
8:45am-9:15am CINF 58: Screening of materials for energy applications based on transport properties: Methods and data automation tools

Boris Kozinsky,

Bosch Research, Waban, Massachusetts, United States
Design of new functional materials relying on transport phenomena is complicated by the highly nonlinear sensitivity of conductivity to structural and composition changes. This makes brute-force computational screening impossible and requires the development of descriptors and efficient approximations to narrow down the space of possibilities. I will briefly present our recent efforts on developing practical methods and data-driven approaches for the discovery and design of battery and thermoelectric materials. In each case there is a need to automate computational tools, organize and analyze the data, and preserve full record of data flow for reproducibility, while allowing for data sharing. The resulting workflows and data formats are heterogeneous and an automation platform is needed that is flexible enough to cover the common requirements and to leave the API interfaces open for implementation of specific scientific plug-ins by the users. The necessary features include tight coupling of data capture with automation, connecting computational engines in a high-level working environment, recording complete provenance information, and organizing data in an efficiently query-able form. Finally, data science tools are also needed for analyzing transport data, extracting and validating trends, to be used in iterative screening. I will highlight our current efforts to implement an open-source platform aimed at satisfying these requirements.

9:15am-9:45am CINF 59: High-throughput chemical simulations and virtual screening for materials discovery
Mathew Halls,, David Giesen, Thomas Hughes, Shaun Kwak, Thomas Mustard, Jacob Gavartin, Alexander Goldberg, Yixiang Cao

Schrodinger Inc., San Diego, California, United States
Virtual screening is an approach first developed and applied in the pharmaceutical industry to identify leads in the drug discovery process. The process involves the automated computational analysis and subsequent filtering of chemical structure libraries based on predicted properties to identify promising systems for further investigation. Virtual screening for materials solutions is a promising new development. Advances in the efficiency of simulation codes and the significantly improved performance of commodity computing resources has dramatically reduced the time required for analysis; pushing the applicability from small molecules to extensive surface models and bulk systems. Moreover, electronic structure and molecular dynamics packages are extremely robust for routine analysis and property prediction, usually requiring no user intervention once the chemical models and parameters have been decided. This makes it possible for automated property predictions for candidate systems with varying structure and composition. The structure library can then be sorted and ranked to identify lead systems and estimate critical structure-property limits across a target chemical design space. An alternative approach to exhaustive screening involves the automated evolution of a set of input structures toward target property characteristics using a simulation informed genetic algorithm. An evolutionary approach dramatically reduces the number of simulations needed to identify chemical systems having the desired property profile, and samples chemical space not covered by deterministic library generation. In this presentation, examples of the use of high-throughput chemical simulation for materials discovery are presented.

9:45am-10:15am CINF 60: Machine learning and high-throughput quantum chemistry methods for the discovery of organic materials
Alan Aspuru-Guzik,

Harvard University, Cambridge, Massachusetts, United States
In this talk, I will overview my group's efforts towards the discovery of organic materials. I will focus on the methods that we employ such as neural fingerprints, deep neural networks, Gaussian processes and even simple linear regressions to correlate theory and experiment. I will describe ways of accelerating the exploration of the chemical space and strategies for close collaboration with experimental partners. I will overview briefly the functionality of our software stack. Applications include organic light-emitting diodes, organic flow battery materials and organic photovoltaics. I will talk about at least one of these, again, focusing on the computational methodologies and lessons learned.

10:15am-10:30am Intermission
10:30am-11:00am CINF 61: Using drug discovery methods to accelerate the search for better battery materials
Joshua Schrier,

Chemistry, Haverford College, Haverford, Pennsylvania, United States
Redox flow batteries (RFB) using water-soluble organic redox couples are a new strategy for low-cost, eco-friendly, and durable stationary electrical energy storage. To be useful, these molecules must have extreme (either high or low) oxidation/reduction potentials and high aqueous solubility. Like many molecular design problems, the search space of possible functional derivatives is too large to be completely explored with ab initio calculation, but cheminformatics methods can make the search tractable.

In this talk, I'll describe our exploration of 105 possible thiophenoquinone derivatives. By using existing cheminformatics tools from the drug-discovery community to eliminate insufficiently soluble compounds from our pipeline,we reduced the space to a more tractable 103 compounds. Using exhaustive B3LYP/6-311+G(d,p) thermochemical calculations with SMD solvation model—which reproduce experimental reduction potentials to within ±0.04—we computed redox voltages and free energies of solvation for all of these compounds, resulting in 51 new candidates with the high solubility and wide voltage range needed for high performance aqueous RFB applications.

This ab initio data-set provided us with the opportunity to develop and test cheminformatics and lead-screening strategies for finding high-performing battery materials. A group-additivity model predicts the redox potential to within ±0.09 V, and can be trained with as few as 200 examples. Surprisingly, the 'quantum-free' group-additivity model was more accurate than models that used the ab initio LUMO energy or information from semiempirical Hückel calculations as descriptors. Using these models to perform simulated screening experiments, we found that 'active' (high voltage or low voltage species) candidates could be identified with an enrichment factor of 2-6, depending on the model and framework type. Having validated this drug-discovery inspired approach with an exhaustive dataset (where we know the 'right' answer), we are now applying this to the much larger space of phenazine derivatives for aqueous redox batteries, and will describe preliminary results on that effort.


11:00am-11:30am CINF 62: Combining density functional theory with cheminformatics for development of a new-paradigm ligand screening method in computational drug discovery
Art Cho1,2,

1 Korea University, Seoul, Korea (the Republic of); 2 Quantum Bio Solutions, Seoul, Korea (the Republic of)
Density functional theory (DFT) has been successfully applied to many fields for its efficiency and versatility. Materials science and quantum chemistry are examples of those fields in which DFT methods are inseparable in current research. On the other hand, it has been rare to utilize DFT for biological problems until recently. Computational drug discovery, in which protein docking is the central methodology, is one such field. This is due to 2 main reasons: 1. biomolecules are large compared to other chemical molecules and therefore it would be prohibitively time-consuming to run DFT calculations on the whole systems, and 2. most interactions within biomolecular systems are non-quantum in nature. However, it turns out that there are cases in computational drug discovery, for which molecular mechanical level description is not enough. For this reason, we have developed a series of methods incorporating DFT calculations for computational drug discovery, which proved to be effective for a number of different classes of problems. In order to take advantage of the power of DFT in virtual ligand screening, however, something must be done on the other side of the process because of the time consuming nature of quantum-level calculations. For this, we have envisioned a paradigm-shifting ligand screening method, in which ligand libraries to be screened contain much fewer compounds than current drug-like libraries, yet, effectively span much larger compound space.
In this talk, I will briefly summarize our previous efforts in development of protein docking methods using DFT calculations and then present a new cheminformatic method that will enable use of them in industrial pharmaceutical environment.

11:30am-12:00pm CINF 63: Discovery through deterministic optimization: Navigating chemical space for effective material design

Jennifer Elward,, Christopher Rinderspacher

Army Research Laboratory, Aberdeen Proving Ground, Maryland, United States
Computational molecular design and optimization increasingly plays a critical role in the design of novel materials. Part of the desirability of optimization lies in the ability to traverse and explore the largely untapped potential of chemical space in a manner that will most benefit the materials discovery process. In the present work, we have developed a deterministic, constrained optimization method that is able to leverage satisfying multiple constraints with efficient navigation of a large optimization space. Density functional theory provides the computational backbone to the optimization and was chosen due to the tradeoff between speed and accuracy and its wide application base. We compared a variety of breadth-first search and gradient-analogous local search algorithms for the optimization process. Each of the algorithms has been benchmarked with respect to efficiency and performance. One of the key benefits of utilizing deterministic techniques in this work is the ability to retain chemical and structural information at each step of the optimization procedure. This feature has allowed for detailed visual analysis of the algorithmic path from input structure to final candidate material and has been beneficial in both the structure search and further algorithm development. In addition, large data libraries can be created and examined to produce qualitative structure-property relationships based on the optimization constraints. At present, this method has been successfully applied to a number of materials science systems of interest including high-hyperpolarizability materials for optics applications and energetic materials. It was found in each case that it was only necessary to explore a small fraction of chemical space (< 1%) to find performant candidates which satisfy the optimization constraints.

CINF: Chemical Information for Small Businesses & Startups 1:00pm - 4:55pm
Monday, March 14
Room 24C - San Diego Convention Center
Edlyn Simmons, Organizing
Edlyn Simmons
Cosponsored by: CPRM and SCHB, Presiding
1:00pm-1:15pm Introductory Remarks
1:15pm-1:40pm CINF 72: Building a business with and without scientific computing: The five W's and one H
Steven Muskal,

Suite 103-475, Eidogen, Oceanside, California, United States
Startups and small businesses today have a clear and distinct advantage over their predecessors. Coupling very experienced resource pools of contracted and/or outsource staff with utility (i.e. cloud-based) and mobile computing can equip such businesses with an unprecedented level of capability. Open source and multiple options for lower cost technologies and content also represent exciting new opportunities or a quagmire depending on your background and experience. Having lived through both the sell- and buy-sides of scientific research computing over last 25 years, we will discuss the why's, when's, who's, what's, where's and how's for building a scientific research computing effort.

1:40pm-2:05pm CINF 73: Interactive cheminformatics for occasional use in SMEs
Therese Inhester1,, Matthias Hilbig3, Matthias Rarey2

1 Center for Bioinformatics, University of Hamburg, Hamburg, Germany; 2 University of Hamburg, Hamburg, Germany

In the past decade, more and more data sets of chemical compounds became freely available, due to large chemical databases such as ChEMBL or PubChem as well as due to vendors which provide their catalogs electronically. This large amount of freely-available small molecule data sets opens the route for a large community of life-scientists, including small businesses and start-ups, to profit from this data wealth. For this purpose, easy-to-use but precise cheminformatics software tools are required to perform elementary tasks. These tasks can span from chemical library browsing to removal of duplicates and filtering by physicochemical properties which is often required previous to virtual screenings. Moreover, compound collections need to be compared and merged according to different annotations, e.g. to find all compounds which are in a vendor catalog and also listed in ChEMBL.
We propose a new intuitive approach to interactively manipulate compound collections. Our software MONA [1] is able to rapidly perform different set operations on molecule sets, as well as filtering and visualizing large compound collections. Using the recent second release of MONA[2], arbitrary molecule properties can be added to the molecules and can be used for filtering, too. Furthermore, similarity clustering and structure depiction alignments strongly improve the visual comparison of molecules. With the help of MONA, standard processes on annotated molecule collections can easily be performed on a regular personal computer by scientists with occasional cheminformatics needs like they occur in biotechnology SMEs. We are going to present different application scenarios in order to demonstrate the utility of our approach.
[1] Hilbig M. et al., MONA– Interactive manipulation of molecule collections. Journal of Cheminformatics 2013, 5:38.
[2] Hilbig M. and Rarey M., MONA 2: A light Cheminformatics Platform for Interactive Compound Library Processing. J.Chem.Inf.Model 2015.

2:05pm-2:30pm CINF 74: Playing by the rules: Knowing what applies and what information you have to maintain regarding your chemical inventory
Frankie Wood-Black,

Ag., Science and Engineering, Northern Oklahoma College, Ponca City, Oklahoma, United States
Think about your business. Do you know exactly what chemicals you have on hand? Do you know if you are subject to various regulations such as SARA, TSCA, or DHS? How do can you tell if they apply to you? When working in a small chemical business or starting up a business, you have an number of things that you have to manage. Knowing and understanding what you have, what rules apply and what documentation you need to maintain is just one more thing that you need to consider in how you manage your business. This paper focuses on some of the key environmental regulations that you may or may not be aware of when dealing with a business that uses and maintains a chemical inventory.

2:30pm-2:55pm CINF 75: ChemSpider: Search and share chemistry… for free
Serin Dabb,

The Royal Society of Chemistry, Cambridge, United Kingdom
ChemSpider is a free chemical structure database providing fast text and structure search access to over 35 million structures from hundreds of data sources. This presentation will demonstate the content of ChemSpider, where we get our data from and how we aggregate it into one interface. ChemSpider is provided for the chemistry community, and we encourage researchers to curate and add more content. It provides chemical structural information, physical and chemical properties, in addition to literature references including patents. The Royal Society of Chemistry is committed to working with the community to improve access to free tools and services around data; other examples of these will be discussed.

2:55pm-3:10pm Intermission
3:10pm-3:35pm CINF 76: What chemists and other scientists need to know about their duty of disclosure under the new law governing the patenting process in the US
Xavier Pillai,

Leydig Voit Mayer Ltd, Chicago, Illinois, United States
Scientists and inventors are aware that most patent applications filed after March 16, 2013 will be examined under the Leahy-Smith America Invents Act (AIA) which was enacted into law on September 16, 2011. That is, the patent applications will be examined on the basis of the first inventor to file the patent applications as opposed to the first one to invent. The law brought about many changes to the patent practice. In particular, the law expanded the scope of prior patents and publications (prior art) that can be applied by an examiner against a patent application, which in turn expanded the scope of the duty of disclosure by the patent applicants. This talk would address the expanded scope of the applicable prior art and the attendant duty of disclosure.

3:35pm-4:00pm CINF 77: Monitoring the minnows: Using IP information to understand what small businesses are doing
Stephen Adams,

Magister Ltd, Roche, Cornwall, United Kingdom
The intellectual property system can sometimes be regarded as a tool which is optimised for the multi-national corporation, exclusively for protecting blockbuster inventions, and not very relevant to small businesses. However, many developed economies, and almost all developing countries, rely upon the small to medium-sized enterprise (SME) sector for a substantial proportion of their innovative capacity. Paradoxically, it can be more difficult to identify research originating from these small companies than from larger ones. This in turn can hinder other negotiations such as establishing joint ventures, valuing intangible assets before a formal takeover bid, or head-hunting key personnel. This session will consider some of the challenges inherent in locating and using records of the IP generated and owned by small businesses.

4:00pm-4:25pm CINF 78: Patent information in PubChem for small businesses and startups
Sunghwan Kim,, Paul Thiessen, Evan Bolton, Steve Bryant

National Library of Medicine, National Institutes of Health, Rockville, Maryland, United States
PubChem ( is a public chemical information resource, developed and maintained by the U.S. National Institutes of Health (NIH). It contains more than 157 million chemical substance descriptions, 60 million unique compounds, and 229 million bioactivities determined from one million assay experiments. Importantly, data contribution from a growing number of organizations, including IBM and SureChEMBL (formerly known as SureChem), allows PubChem to provide links to patent information for chemicals. Currently, PubChem offers links between about 6 million patent documents and more than 16 million unique chemical structures, with over 336 million chemical substance-patent links covering U.S., European, and World Intellectual Property Organization patent documents published since 1800. This presentation will provide an overview of the patent information in PubChem as well as the best practice for using it.

4:25pm-4:50pm CINF 79: Open patent chemistry “big bang” presents large opportunities for small enterprises
Christopher Southan,

Guide to PHARMACOLOGY, University of Edinburgh, Göteborg, Sweden
In 2012, after the first IBM open deposition of 2.5 million structures, few would have predicted that PubChem compounds that include patent-extracted submissions would approach 20 million by 2015 (PMID 26194581). The current major open patent chemistry feeds (in size order) are NextMove, SCRIPDB, Thomson Pharma, IBM and SureChEMBL. The comparative statistics of sources and the arguments that the coverage probability of lead compound prior-art structures is now very high, will be presented. The consequences are that the academic community and small companies can now patent-mine extensively in PubChem and SureChEMBL, possibly even without needing commercial sources to support their own filings. Other recent major enabling aspects for small institutions include a) the open availability of patent full-text for querying b) a range of free tools for DIY chemistry extraction (PMID 23618056) and c) automatic bioentity mark-up in patent text (e.g. protein names) from the SureChEMBL/SciBite collaboration. Examples of DIY analysis of newly published patents will be shown. Even for small enterprises not filing directly open patent chemistry presents a big expansion in accessible SAR space and aspects of mining this will be exemplified. However, open chemistry extraction does bring in a variety of artefacts that add confounding structural “noise” These include a) permutations of mixtures and chiral exemplifications, b) virtual structures c) extractions from documents cannot directly indicate IP status and d) “common chemistry” swamping. These problems and some partial solutions using PubChem filters will be discussed.

4:50pm-4:55pm Concluding Remarks
CINF: Global Initiatives in Research Data Management & Discovery 1:00pm - 5:00pm
Monday, March 14
Room 25B - San Diego Convention Center
Ian Bruno, Leah McEwen, Organizing
Ian Bruno, Leah McEwen
Cosponsored by: ANYL, COMP, MEDI and PHYS, Presiding
1:00pm-1:05pm Introductory Remarks
1:05pm-1:35pm CINF 64: Authoring tools to automate data sharing in scientific publishing
John Kitchin,

Chemical Engineering, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States
Data sharing and reproducibility of research are increasingly important issues. Funding agencies are mandating data sharing in calls for proposals, and journals are increasingly requiring data sharing as a condition of publication. Scientists are increasingly interested in open access to data. A requirement, or even desire to share is not sufficient, however, if sharing is difficult or tedious. We believe that new authoring tools are needed that will integrate data and analysis into the research and publishing processes. These tools will reduce the difficulty of sharing and reusing data.

We have developed a new approach to writing scientific documents that enables the direct inclusion of human-readable, and machine-addressable data and code. In this talk we will illustrate the approach by example from papers we have recently published using the approach. We show that the combination of an extensible editor (Emacs) with a lightweight markup language (org-mode) provides a remarkable solution to data sharing and research reproducibility issues. This combination enables the documentation of experimental setup, data generation, and analysis in a single document, and subsequent export of a scientific manuscript that is suitable for submission to most journals. When coupled with external data repositories, the approach enables sharing of large or complex data sets that cannot easily be captured in a manuscript. We will conclude with an outlook on the approach and where we see it going.

1:35pm-2:00pm CINF 65: Facilitating the inclusion of analytical raw data in the submission and review process
Santiago Dominguez Vivero1,, Juan Cobas Gomez1, Felipe Seoane1, Jose Garcia Pulido1, Agustin Barba1, Jesus Varela Carrete2

1 Mestrelab Research SL, Hereford, Herefordshire, United Kingdom; 2 Chemistry, University of Santiago de Compostela, Santiago de Compostela, A Coruña, Spain
Scientific articles have traditionally been supported by the inclusion of supplementary data, also referred to as supporting information, designed to provide additional information necessary for understanding the principal points of the publication and support the conclusions presented in the article but which cannot be included in the paper itself due to space reasons or technical limitations.
The inclusion and review of supplementary data is not without challenges, particularly when papers are supported by a large volume of analytical information. Preparation and formatting of the data ready for upload is a time consuming, error-prone process. Also, the data uploaded has traditionally been ‘dead’ data, (PDF, images or Word) which allows visualization but no interaction, interrogation or re-processing. Validation of results achieved would therefore require repetition of the processes described in the paper and reacquisition of the analytical data. Whilst this would be highly desirable, it is not often feasible.
The last few years have seen a significant increase in calls for the inclusion, by the authors, of raw analytical data which support the paper’s conclusion, to allow reviewers and readers to verify the work of the authors, prevent fraud and identify errors.
This talk aims to present an integrated system developed by the authors to facilitate the submission and review of supplementary information. The system includes:
- Software automations which prepare and format supplementary information commonly provided.
- Software automations which prepare the raw data for one-click submission.
- A software tool, of free access to all reviewers and readers, which allows the reprocessing and reanalysis of raw data and includes the visualization, reprocessing and analysis capabilities implemented in the Mnova software.
- Searching capabilities to leverage, in future research, the body of knowledge built by the submission of publications and supporting information.

2:00pm-2:30pm CINF 66: Crystallography: A domain exemplar for chemistry data management
Simon Coles,

University of Southhampton, Hampshire, United Kingdom
Crystallographers have been generating data for over a century. It wasn’t until about half way along this timeline that the case was made for a coherent community approach to managing the data outputs. This resulted in the Cambridge Structural Database (, with the main driver for collecting data being that it could be reused and new science could be driven from what we can learn by having a collection. Data management was an added bonus, but quickly the community seized on this opportunity and aligned the database to publishing procedures.

The great leap forward however was the introduction of the Crystallographic Information Framework (CIF) in the early 1990’s ( The power behind CIF is that it is driven by a structured dictionary of terms managed by a committee convened by a learned society (IUCr). With such a comprehensive dictionary many things are possible – it can be expressed as a common file format and from this numerous applications can be driven. Not only does it provide a curation format that stands the test of time, but it can be rendered in many ways, it drives all data exchange, it can be automatically validated and it even underpins several forms of publication.

Having built such a culture and toolset, the arrival of the internet and the ability to relatively easily build software systems provides a rich environment for development. This is timely because these technological innovations have also resulted in an exponential increase in the amount of data generated. This talk will outline not only what we can achieve as a global body, but also how individuals or organisations can now act globally. Originally CIF only covered results data, but this was soon extended to raw data and now the community, through the Diffraction Data Deposition Working Group (, is moving forward with archival and publication of ALL of its data. Furthermore, crystallographers are now integrating with the wider chemistry community through the development of common standards. On a more local level much is also possible – the UK National Crystallography Service operates an archival policy that must be in line with numerous funder mandates but at the same time it has been possible to set up an open data repository that integrates with community processes (

2:30pm-2:55pm CINF 67: Are data management solutions developed for commercial organizations suitable for academic research?
Mariana Vaschetto,, Tom Oldfield, Michael Hartshorn

Dotmatics, Bishops Stortford, United Kingdom
Research data management solutions such as Electronic Laboratory Notebooks (ELNs) have been standard in Pharma and Biotech for quite some time. In addition, the widespread acceptance of web-based technologies facilitated the adoption of hosted solutions in cloud environments by commercial organizations. In addition, commercial organizations are shifting their focus towards collaborative research with academic groups and non-for-profit institutes. This leads to the questions: Are cloud data management solutions developed with commercial R&D groups in mind transferable to Academia? And can these solutions facilitate collaborative research across both worlds?
In this talk we will discuss how Dotmatics has worked with its customers to develop a cloud based configurable infrastructure that is equally useful in commercial or academic environments. This web-based solution enables scientists to customize the read and write access to this data making the real-time transfer and sharing of information across different teams seamless. The tools also provide secure access for all partners with restricted views that allows sharing of knowledge while protecting IP for the commercial sector. The ideas presented not only include spread sheets built on database technology but also interactive dashboards to share documents and applications and facilitate the collaboration of ideas globally

2:55pm-3:10pm Intermission
3:10pm-3:30pm CINF 68: Data sharing in life sciences R&D: Pre-competitive collaboration through the Pistoia Alliance
Carmen Nitsche,

Pistoia Alliance, San Antonio, Texas, United States
The Pistoia Alliance ( is a group of life sciences industry experts. We use pre-competitive collaboration to address issues around aggregating, accessing, and sharing data that are essential to innovation, but provide little competitive advantage. We have a strong track record in delivering value from our projects, providing our membership with perspective on current problems, and being a source of impartial opinion. We were established in 2009 by representatives of AstraZeneca, GSK, Novartis and Pfizer who met at a conference in Pistoia, Italy.

3:30pm-3:50pm CINF 69: The Royal Society of Chemistry and the data publication landscape
Serin Dabb,

The Royal Society of Chemistry, Cambridge, United Kingdom
Like many global funding agencies, the UK’s, Engineering and Physical Science Research Council (EPSRC) has mandated research data preservation and put the responsibility on institutions to comply. A key component of this preservation is ensuring accessibility and discoverability. The Royal Society of Chemistry is building a research data repository to hold different types of derived chemical information, as part of our EPSRC-funded National Chemical Database Service, to address some of these needs.
We believe a number of steps need to take place before the chemistry community incorporates data management into their routine workflow. One is encouraging the development of community data standards. Another is showing the utility of wider information sharing, and our Compound Collection pilot for extracting compounds from theses has had great buy-in from academia, libraries and pharma companies. As a Publisher we are investigating overlap between research data availability, and our journal publication processes. As a learned society we feel our role is to aid our community by encouraging skills development and a wider understanding of the issues involved, and appreciating the potential opportunities.

3:50pm-4:10pm CINF 70: Digital IUPAC: The need for global representation of chemistry and chemical information in the digital age
Jeremy Frey,

University of Southampton, Southampton, United Kingdom
The growing importance of chemical information challenges all international bodies and especially IUPAC to fulfill its mission to address global issues involving the chemical sciences, given that in the modern digital world all manufacture, research, teaching and learning is now assisted by computer systems. I wish to argue the case for “Digital IUPAC” as in this increasingly digital age, IUPAC will and must take a lead in providing machine-readable (i.e. computable and understandable) representations of chemical information as well as structure, using standards that IUPAC define and standards that other international authorities agree to use.
Looking at the wider data and information agenda, the Royal Society (of London) report “Science as an open enterprise” argues the absolute necessity for intelligent access to the data on which scientific conclusions are based. Intelligent openness is fundamental to the whole progress of science. In the modern digital world intelligent access really requires that this access can be mediated by computers. The comprehensive conversion of IUPAC’s knowledge base of standards and definitions from human-readable to computer-readable form is essential. It is vital that this conversion be done now as a matter of extreme urgency, if IUPAC is to maintain its role as the international authority for the chemical sciences. If computers cannot find and use the information provided by IUPAC, that information will effectively cease to exist for the “Wikipedia generation”.

4:10pm-4:30pm CINF 71: DIG chemistry: Establishing a research data interest group to address the many faces of chemical data management
Leah McEwen,

Clark Library, Cornell University, Ithaca, New York, United States
Chemistry is a central science with a long history of rich information and data resources traditionally compiled from articles. Is this really the best way of managing research data to ensure reproducibility, reuse, efficiency, and application? What are the compelling use cases for chemical data and what practices are missing to support these? The international chemical information community is examining current practices and connecting with other disciplines and data initiatives to explore how we can collectively and effectively fill the gaps. Experimental and theoretical researchers, educators, data and information scientists, librarians, publishers, database providers, and colleagues across the academic, industrial, private and public sectors are forming a Chemistry Interest Group within the Research Data Alliance (RDA). This presentation will highlight specific challenges we could address now with a view to discussing how these can best be tackled.

4:30pm-5:00pm Panel Discussion
CINF: Informatics & Quantum Mechanics: Combining Big Data & DFT in Pharma & Materials 1:30pm - 4:45pm
Monday, March 14
Room 25A - San Diego Convention Center
Art Cho, Organizing
Art Cho, Presiding
1:30pm-2:00pm CINF 80: In silico, high-throughput screening of non-fullerene acceptor materials for applications of organic photovoltaic devices: A Harvard clean energy project study
Steven Lopez,, Edward Pyzer-Knapp, Alan Aspuru-Guzik

Harvard University, Cambridge, Massachusetts, United States
Organic Photovoltaics (OPVs) have shown a steady growth in efficiencies since the 1980s, and reported percent conversion efficiencies (PCEs) up to 12% are reported in multi-junction cells. OPVs are lightweight, easy to produce, and feature chemically diverse components. While PCBM is the standard fullerene n-type (acceptor) material, it is not without limitations, which include limited spectral breadth, small range of LUMO energies, and relatively high costs of industrial production. We have undertaken an in silico high-throughput screening utilizing the Harvard Clean Energy Project to explore the chemical space associated with non-fullerene acceptor materials. A library of 100,000 n-type materials including perylene diimides, tetraazabenzodifluoroanthenes, diketopyropyrroles, and fluoroanthene-fused imides. This work is carried out through a tight feedback loop with experimental colleagues that synthesize target materials and create OPV devices.

2:00pm-2:30pm CINF 81: Regioselectivity prediction of metabolic reactions based on ab initio derived descriptors

Arndt Finkelmann2,, Andreas Göller1, Gisbert Schneider2

1 Global Drug Discovery, Bayer Pharma AG, Wuppertal, Germany; 2 Department of Chemistry and Applied Biosciences, ETH Zurich, Zurich, Switzerland
The complexity and diversity of chemical transformations involved in drug metabolism impose a hurdle on computational models for Site of Metabolism (SoM) prediction [1]. Ligand-based machine learning models are general and potentially useful tools for SoM prediction. However, they require suitable atom descriptors that ideally capture the reactivity-determining features. We approached the SoM prediction problem by developing descriptors that characterize an atom's steric and electronic environment. The electronic environment is approximated by the partial charge distribution in the atom's proximity. The partial charges are obtained from quantum chemistry calculations and represent the electron distribution in a molecule. To identify the partial charge scheme that is best suited for descriptor construction, we investigated the dependence of different partial charges on molecular conformation and calculation method. NPA and CM5 charges turned out to have a low dependence on conformation and allow for a one-conformer approach. We demonstrate that our descriptors enable the construction of accurate and robust cytochrome SoM prediction models.

[1] Kirchmair J.; Goeller A.H.; Lang D.; Kunze J.; Testa B.; Wilson I. D.; Glen R.C.; Schneider G. Predicting drug metabolism: experiment and/or computation? Nat. Rev. 2015, 14, 387-404.

2:30pm-3:00pm CINF 82: COSMO-based approach for the design of solvents to optimize reaction rates
Nicholas Austin1,, Nikolaos Sahinidis2, Daniel Trahan3

1 Chemical Engineering, Carnegie Mellon University, Bowling Green, Kentucky, United States; 2 Dept Chemical Engineering, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States; 3 The Dow Chemical Company, Freeport, Texas, United States
The reaction medium plays a critical role in determining the success of a particular reaction, the rate at which it proceeds, and whether any undesirable side-products are formed. The selection of a solvent is often empirical or based on somewhat rudimentary properties (H-bond donor/acceptor abilities, dielectric constant, solubility parameters, etc.). Furthermore, there is limited customization in solvent choice as many reactions are performed in one of a handful of common laboratory solvents or a simple blend of these solvents. For this reason, choosing an optimal (or simply improved) solvent for a particular reaction has significant application potential in liquid-phase chemistry.
From an optimization point of view, this problem is challenging for two main reasons: (1) there are virtually an infinite number of potential structures to choose from and (2) any blend of solvents requires the additional determination of mole fractions. In this work, we propose the use of the COSMO solvation model and its –RS and –SAC post-processing steps to calculate solvation free energies and thereby estimate reaction rates. The use of COSMO presents a distinct advantage over other approaches as COSMO-RS and –SAC estimates of any composition require only a single calculation for each component of a mixture. Incorporating these methods into an efficient optimization framework necessitates the development of semi-empirical methods (specifically, group contribution methods) These provide estimates to sigma profiles, which are averaged representations of charge density versus surface area and are key in calculating mixture properties.
The design space of the optimization (molecular structures and mole fractions) is projected into a much lower-dimensional space, specifically that of the statistical moments of the sigma profiles of each component of the solvent mixture. This enables the use of efficient derivative-free optimization methods to quickly design solvents for specific reaction-rates applications. This approach will be discussed in detail and applied to several solvent design problems.

3:00pm-3:15pm Intermission
3:15pm-3:45pm CINF 83: Efficient, first-principles-based screening for high-charge carrier mobility in organic crystals
Christoph Schober,, Karsten Reuter, Harald Oberhofer

Chair of Theoretical Chemistry, Technical University Munich, Garching, Germany
In organic electronics, charge carrier mobility is a key performance parameter. Due to the complex manufacturing processes of e.g. organic field effect transistors (OFETs) measured mobilities are often heavily affected by the device preparation. This masks the intrinsic materials properties and therewith hampers the decision whether further device optimization for a given organic molecule is worthwhile or not. We developed a fast and efficient protocol with a descriptor based on electronic coupling values to assess the expected performance of organic materials for application in organic electronic devices. Applying this protocol to experimental structures of organic crystals obtained from the Cambridge Structural Database (CSD), we screen about 40000 structures employing only first principle methods. Out of the 28000 successfully calculated structures we select 2000 candidates with above-average electronic couplings for additional calculations and in-depth analysis using statistical methods and automated classification based on chemical structure. This allows us not only to identify a number of specific crystals with exceptionally high electronic coupling values and therefore promising properties, but also possible lead structures which can be the basis for in-depth theoretical and experimental studies of new classes of materials for organic electronics.

3:45pm-4:15pm CINF 84: Data-driven chemistry: From small molecules to discovery of new functional materials
Olexandr Isayev2,, Alexander Tropsha1

1 Univ of North Carolina, Chapel Hill, North Carolina, United States; 2 UNC Eshelman School of Pharmacy, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, United States
The Materials Genome Initiative is transforming Materials Science into a data-rich discipline. These developments open exciting opportunities for knowledge discovery in materials databases using informatics approaches to inform the rational design of novel materials with the desired physical and chemical properties. Statistical and data mining approaches have been successfully employed in both chemistry and biology leading to the development of cheminformatics and bioinformatics, respectively. However, until recently their application in materials science has been limited due to the lack of sufficient body of data.
In this work we showcase a pilot materials informatics applications capable of (i) instantaneously query and retrieve the necessary material information in the desired form, (ii) identify, visualize and study important data patterns, and (iii) generate experimentally-testable hypotheses by building predictive Machine Learning (ML) models based on materials’ characteristics. Specifically, we posit that materials with similar structural, topological, and electronic characteristics are expected to have similar physical chemical properties irrespective of their formal composition. To enable uniform comparison of materials by their intrinsic properties, we will represent all materials uniquely by multiple numerical descriptors, or fingerprints. This representation will enable the use of classical cheminformatics and ML approaches to mine, visualize, and model any set of materials as we demonstrated in our recent pioneering studies on Materials Cartography [1].

Isayev, O., Fourches, D., Muratov, E,N,. Oses, C., Rasch K.M., Tropsha, A., and Curtarolo, S. Chem, Mater, 2015, 27, 735–743. DOI: 10.1021/cm503507h

4:15pm-4:45pm CINF 85: Multi-agent approach for molecular modeling in chemical vapor deposition
Luke Achenie,

Virginia Tech, Blacksburg, Virginia, United States
Zinc sulfide continues to generate interest due to the fact that compared to other semiconductors it has a large direct band gap, making it desirable in optical applications. In these applications only high quality zinc sulfide films can be employed. The default approach for producing zinc sulfide films are through chemical vapor deposition (CVD). However, there is a large potential for defects in the deposited film primarily due to the fact that the morphology of adducts are different in the gas phase compared to that in the solid phase (i.e. deposited film). The impact of adducts depend on the size distribution of clusters, which create large distorted grains on the substrate that do not normally have the same morphology as the deposited film. Basically the main issue is how big these clusters grow; larger clusters make it more likely to have defects in the deposited film.

In our previous research we employed a computational approach to predict the size distribution and morphology of the clusters. With this information we were able to explain the link between the cluster size and the morphological defects on the deposited film. In our approach we coupled the macro-scale computational fluid dynamics with molecular scale molecular dynamics and nano-scale ab-initio calculations in order to estimated the nucleation, growth, dynamics, and size distribution of the particles inside the CVD reactor. This presentation shows advances we have made using new modeling modalities; specifically we will discuss a Multi-Agent Approach for coupling molecular modeling, specifically molecular dynamics in which the force fields are periodically updated with DFT calculations with macro level (i.e. continuum based) computational fluid dynamics and/or general particle dynamics. In summary our approach uses a multi-agent approach to bridge the time and space scales of molecular level calculations and continuum scale modeling.

CINF: Sci-Mix 8:00pm - 10:00pm
Monday, March 14
Hall D/E - San Diego Convention Center
8:00pm-10:00pm CINF 105: Supporting openness and reproducibility in scientific research: The Center for Open Science

Sara Bowman,

Center for Open Science, Charlottesville, Virginia, United States

8:00pm-10:00pm CINF 110: Building a better materials science database: Challenges and opportunities

Robin Padilla,, Michael Klinge,

Corporate Markets & Databases, Springer Nature, Heidelberg, Germany

8:00pm-10:00pm CINF 116: Competitive intelligence workbench: Getting access to information for decision making

Huijun wang,

Merck, Kenilworth, New Jersey, United States

8:00pm-10:00pm CINF 117: Using systems biology in computational drug design workflows

George Nicola,, Bruce Kovacs

Afecta Pharmaceuticals, Irvine, California, United States

8:00pm-10:00pm CINF 131: Comparative toxicogenomics database: Advancing understanding of molecular connections among chemicals, genes, and diseases

Cynthia Grondin,, Allan Davis, Thomas Weigers, Carolyn Mattingly

Biology, North Carolina State University, Raleigh, North Carolina, United States

8:00pm-10:00pm CINF 139: Enhanced chemical understanding through 3D-printed models

Amy Sarjeant1,, Peter Wood4, Ian Bruno1, Ye Li2, Vincent Scalfani3, Shawn O'Grady2

1 Cambridge Crystallographic Data Centre, Cambridge, United Kingdom; 2 University of Michigan, Ann Arbor, Michigan, United States; 3 University Libraries, University of Alabama, Tuscaloosa, Alabama, United States; 4 CCDC, Cambridge, United Kingdom

8:00pm-10:00pm CINF 13: Open data is not enough: A look at the Research Data Alliance

Mark Parsons,

Research Data Alliance, Boulder, Colorado, United States

8:00pm-10:00pm CINF 143: Chemical knowledge representation and access in Wolfram|Alpha and Mathematica

Eric Weisstein,

Scientific Content, Wolfram|Alpha, Champaign, Illinois, United States

8:00pm-10:00pm CINF 147: Leveraging the VIVO research networking system to facilitate collaboration and data visualization

Michaeleen Trimarchi, Danielle Bodrero Hoggan,

Kresge Library, The Scripps Research Institute, La Jolla, California, United States

8:00pm-10:00pm CINF 165: Predicting drug-induced hepatic systems' toxicity by integrating transporter interaction profiles

Eleni Kotsampasakou,, Gerhard Ecker

Department of Pharmaceutical Chemistry, University of Vienna, Vienna, Austria

8:00pm-10:00pm CINF 21: Deep convolutional neural networks for autonomous discovery of molecular interactions

Abraham Heifets, Izhar Wallach, Michael Dzamba,

Atomwise, Inc., San Francisco, California, United States

8:00pm-10:00pm CINF 29: On our way to the automated search for ligand-sensing cores

Tobias Brinkjost1,2,, Christiane Ehrt2, Petra Mutzel1, Oliver Koch2

1 Faculty of computer science, TU Dortmund University, Dortmund, Germany; 2 Faculty of chemistry and chemical biology, TU Dortmund University, Dortmund, Germany

8:00pm-10:00pm CINF 2: Standard JSON molecule, a solution to a cross-vendor molecule file format?

Brian Cole,

OpenEye Scientific Software, Santa Fe, New Mexico, United States

8:00pm-10:00pm CINF 32: Advances in data provisioning

Marian Brodney1,, Jacquelyn Klug-McLeod2, Gregory Bakken2, Robert Stanton1

1 Computational Sciences Center of Excellence, Pfizer, Cambridge, Massachusetts, United States; 2 Computational Sciences Center of Excellence, Pfizer, Groton, Connecticut, United States

8:00pm-10:00pm CINF 33: Chemical information on the web: Find and be found

Asta Gindulyte,

National Center for Biotechnology Information, U.S. National Library of Medicine, Bethesda, Maryland, United States

8:00pm-10:00pm CINF 57: ChemEngine: An automated chemical data harvesting tool for molecular inventory and chemical computing from scientific literature

Muthukumarasamy Karthikeyan1,, Renu Vyas2

1 Digital Information Resource Centre, CSIR National Chemical Laboratory, Pune, India; 2 Chemical Engineering and Process Development, CSIR-National Chemical Laboratory, Pune, MH, India

8:00pm-10:00pm CINF 58: Screening of materials for energy applications based on transport properties: Methods and data automation tools

Boris Kozinsky,

Bosch Research, Waban, Massachusetts, United States

8:00pm-10:00pm CINF 63: Discovery through deterministic optimization: Navigating chemical space for effective material design

Jennifer Elward,, Christopher Rinderspacher

Army Research Laboratory, Aberdeen Proving Ground, Maryland, United States

8:00pm-10:00pm CINF 81: Regioselectivity prediction of metabolic reactions based on ab initio derived descriptors

Arndt Finkelmann2,, Andreas Göller1, Gisbert Schneider2

1 Global Drug Discovery, Bayer Pharma AG, Wuppertal, Germany; 2 Department of Chemistry and Applied Biosciences, ETH Zurich, Zurich, Switzerland

8:00pm-10:00pm CINF 99: Applications of drug-target data in translating genomic variation into drug discovery opportunities

Anna Gaulton,

Chemogenomics Team, European Molecular Biology Laboratory - European Bioinformatics Institute, Cambridge, United Kingdom

CINF: Chemistry, Data & the Semantic Web: An Important Triple to Advance Science 8:15am - 11:55am
Tuesday, March 15
Room 25B - San Diego Convention Center
Evan Bolton, Stuart Chalk, Organizing
Evan Bolton, Stuart Chalk, Presiding
8:15am-8:20am Introductory Remarks
8:20am-8:45am CINF 86: Towards knowledge representation improvements in chemistry
Evan Bolton,

NCBI / NLM / NIH, Warrenton, Virginia, United States
Scientific knowledge is vast and nuanced. Summarizing countless pieces of information (numbering in the billions and trillions) is not straightforward. There are many opportunities to improve the quality and navigability of data. This talk will highlight efforts to handle the open scientific corpus in how it pertains to chemical information.

8:45am-9:10am CINF 87: Chemical classifications for biology and medicine
Minoru Kanehisa,

Institute for Chemical Research, Kyoto University, Uji Kyoto, Japan
Life would not exist without chemical substances. For the purpose of developing bioinformatics methods, they are divided into two categories: metabolic substances and regulatory substances. Metabolic substances are interconverted as substrates and products of enzyme-catalyzed reactions. Regulatory substances interact with proteins, DNA, RNA, and other endogenous molecules in many ways. The chemical space of metabolic substances is determined by the universe of enzyme-catalyzed reactions, which in turn is determined by the genomic space of enzyme genes [1]. Here we focus on regulatory substances, including xenobiotic compounds and drugs, and present how knowledge is organized in KEGG [2], which is an integrated resource of sixteen main databases. There are two relevant databases. One is KEGG BRITE, which contains hierarchical classifications of various biological objects that are linked to both internal and outside databases. The other is KEGG DGROUP for drug grouping where individual instances of drugs are grouped into classes of functionally identical or similar drugs. KEGG DGROUP can be compared to KEGG ORTHOLOGY (KO) for genes and proteins, where generalization of instances to classes is the basis for interpretation and prediction of molecular interaction networks and associated high-level functions.

[1] Kanehisa, M.; Chemical and genomic evolution of enzyme-catalyzed reaction networks. FEBS Lett. 587, 2731-2737 (2013).
[2] Kanehisa, M., Sato, Y., Kawashima, M., Furumichi, M., and Tanabe, M.; KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res. 44, in press (2016).

9:10am-9:35am CINF 88: Withdrawn
9:35am-10:00am CINF 89: ChEBI database and ontology: A key resource for chemical biology and metabolomics
Gareth Owen,

EMBL-EBI, Ely, United Kingdom
ChEBI (, a manually curated database and ontology of chemical entities of biological interest, is widely recognized as a key player in the chemical ontology arena. The ChEBI Ontology includes both chemical and functional descriptors of the molecules and groups in the database. This is of particular importance in enabling the incorporation of chemical concepts of importance to chemical biology and drug discovery into resources that have different foci. Thus the ChEBI Ontology is widely used by other biomedical ontologies - most notably the Gene Ontology - for handling their chemistry-containing terms.

After reviewing the motivation behind the development of the ChEBI Ontology and its three component sub-ontologies, the various relationships used to link entities within the ontology and the methods used to classify new entries in the ChEBI database will be described. Finally, plans for future developments of the ChEBI Ontology will be discussed, and possible applications and uses of the ontology by other resources will be outlined.

10:00am-10:15am Intermission
10:15am-10:40am CINF 90: Classifying chemistry: Current efforts in Canada
David Wishart,

Biological Sciences, University of Alberta, Edmonton, Alberta, Canada
Our group has been actively involved in developing databases for metabolomics (HMDB), exposomics (T3DB), food chemistry (FooDB) and medicinal chemistry (DrugBank). This work has given us a unique perspective on chemical information and how it can connect to biological or biomedical information. It has also highlighted the need to develop better methods to “harvest” chemical and biochemical data for our databases as well as better methods to “structure” the data within these databases. Over the past 3 years we have developed several publicly accessible tools to facilitate chemical data harvesting, text mining and feature prediction. We have also started developing novel tools for classifying chemical structures, storing biochemical processes and providing more structured ontologies to describe the chemical and biological data in these databases. In this presentation I will describe some of these tools in more detail and highlight some useful applications that are enabled by these tools.

10:40am-11:05am CINF 91: Classifying compounds in public databases
Lutz Weber,

IT, OntoChem, Germering, Germany
OntoChem is engaged in developing novel tools and algorithms that enable the interconnection of chemical and biological ontologies with use cases focused on knowledge generation in drug discovery. For example, OntoChem ontologies ( provide not only a computational classification of chemical compounds but also chemical fragments, substituents, scaffolds and other chemical terms directed towards substances and materials like drugs, vitamins, polymers or alloys.

OntoChem integrates semantic text mining and annotation toolsto extract knowledge and factual data of compound properties. A modular UIMA pipeline is used to annotate any document type with a range of Life Science technologies. These technologies have been optimized and are ideally suited for high speed and high quality annotation necessary to handle searching large data volumes.

As an example, we will demonstrate the classification of PubChem compounds using an open access chemistry classification system derived from ChEBI, autoritative chemistry text books and other sources. Such chemistry classifications can be used to enhance search engines (e.g., improve knowledge extraction technologies and support higher level abstraction in Life Sciences. Using PubChem compound classifications we will demonstrate their utility as a basis for judging on novelty and trend analytics


11:05am-11:30am CINF 92: Automated structural and functional annotation of small molecules using integrated chemical ontologies: ClassyFire, ChemOnt, and downstream applications
Yannick Djoumbou Feunang,

Biological Sciences, University Of Alberta, Edmonton, Alberta, Canada
Centuries of discoveries in chemistry and biology have produced a large amount of knowledge about chemicals and their interactions with the environment. Recent efforts to harvest and store this data into electronic warehouses have emphasized two issues: 1) the scarcity and incompleteness of the data, and 2) the need to organize it into more comprehensible and exchangeable formats. Moreover, organizing chemical information in a structured way could not only improve our understanding of chemistry and related sciences, but also facilitate new discoveries. In this spirit, we have developed ClassyFire and ChemOnt. ClassyFire is a computational tool for a rapid, consistent, dataset-independent, automated structure-based classification of chemical compounds. It relies on the structure-based sub-ontology of ChemOnt, a well-defined chemical ontology, which covers a wide spectrum of compounds, and roles (applications, health effects, etc.). In a join effort, ChemOnt has been mapped to other ontologies for the sake of interoperability. ClassyFire was used to classify major databases, including PubChem, DrugBank, HMDB, and ChEBI. Additionally, we have developed other tools and frameworks to integrate more concepts (proteins, pathways, phenotypes) in order to represent or study their interactions with small molecules. In this presentation, we will describe these tools, some of their applications, and how they could be combined with semantic technologies to infer knowledge, suggest new hypothesis, as well as make new discoveries.

11:30am-11:55am CINF 93: Evaluation of machine-generated chemical ontologies for molecular information
Stephen Boyer,, Thomas Griffin, Eric Louie

IBM Research, San Jose, California, United States
With today's proliferation of the scientific literature and the massive databases resulting from computer curation, it is imperative to automate the classification processes. Several programs have been developed to ingest machine-readable forms of molecules and to generate a set of molecular attributes (descriptors). One example is ingesting a SMILES string and correlating it with the context in which it occured. We have evaluated several of these programs for classification purposes and for input into downstream operations such as knowledge graphs, data mining, regulatory compliance, and cognitive computing.


CINF: Driving Change: Impact of Funders on the Research Data & Publications Landscape 8:35am - 12:00pm
Tuesday, March 15
Room 25A - San Diego Convention Center
Elsa Alvaro, Andrea Twiss-Brooks, Organizing
Elsa Alvaro
Cosponsored by: MEDI and ORGN, Presiding
8:35am-8:40am Introductory Remarks
8:40am-8:50am Update on NSF MPS Open Data Policies
8:50am-9:15am CINF 100: NIH public access policy
Neil Thakur,

NIH, Rockville, Maryland, United States
The NIH public access policy has been place since 2005, and mandatory since 2008. It requires all papers arising from NIH funds to be made public on PubMed Central within 12 months of publication. Since PubMed Central is an XML archive, papers need to be in XML format before they can be posted. There are four ways in which papers can be posted on PubMed Central, and they vary in level of effort that an author must undertake. The submission method is determined by author and publisher preference. We will describe these methods and their implications for authors. We will also discuss the various strategies NIH uses to monitor compliance. The American Chemical society has been using different submission methods, and we will explore how these various approaches have impacted compliance.

9:15am-9:40am CINF 101: U.S. Department of Energy public access plan
Laura Biven,

US Department of Energy, Washington, D.C., District of Columbia, United States
The Department of Energy’s (DOE) Public Access Plan aims to increase access to data and publications resulting from DOE-funded research. As part of the implementation of the Public Access Plan, the Department has developed PAGES –the Public Access Gateway for Energy & Science- to provide public access to full text versions of peer reviewed publications, and now requires data management plans for DOE-funded research. This presentation will discuss the history and philosophy for the new activities and requirements as well as some thoughts for future work.

9:40am-10:05am CINF 102: Helping authors and funders achieve open access goals at ACS Publications
Darla Henderson,

Publications Division, American Chemical Society, Washington, District of Columbia, United States
During 2014-2015, in response to increasing funder mandates imposed on authors and several far-reaching trends in scholarly publishing and open access, ACS Publications implemented a significant expansion of its open access publishing program. This expansion included:

- The launch of ACS Central Science, the Society’s first fully open access journal (with no author publishing charges) aiming to publish the most impactful multidisciplinary research and showcasing the centrality of chemistry;
- A new ACS-sponsored program making one noteworthy new article from an ACS journal open access each day (ACS Editors’ Choice);
- New license types for authors choosing to publish open access (ACS AuthorChoice options); and
- A $60-million stimulus program to support authors selecting to publish their work as open access across ACS journals (ACS Author Rewards).

ACS works closely with funding organizations to comply with various new mandates and to communicate these policy requirements to authors. Engaging directly with funders directly is one way we are seeking to simplify the process for authors. In addition to leveraging direct engagement with funders to simplify author compliance, ACS also serves as a founding member of CHORUS, a suite of services and best practices for sustainable public access to published articles reporting on funded research in the US.

This session will address how the American Chemical Society’s Publications Division is working with funders as they develop new mandates for publications.

10:05am-10:30am CINF 103: Libraries at the hub as the federally funded research wheel turns to open
Shannon Kipphut-Smith1,, Betty Rozum2,, Becky Thoms3,

1 Rice University, Houston, Texas, United States; 2 Utah State University, Logan, Utah, United States
Academic libraries are strong partners in supporting researcher compliance with both funder public access policies and institutional open access policies, and are increasingly involved in research data management activities. The 2008 National Institutes of Health (NIH) Public Access Policy, requiring researchers to deposit copies of all NIH-funded publications in PubMed Central, provided an opportunity for academic librarians to use their expertise in education and training, copyright, and author rights issues to assist with policy compliance. At the same time, many institutions began conversations about management of research data and adoption of institutional open access (OA) policies, requiring faculty to place copies of their scholarship in institutional repositories (IRs). Academic libraries play an important role in these policies, promoting the benefits of OA, managing IRs, and facilitating article deposit.

Naturally, many of those already engaged with services and resources related to public access, OA policies, and research data welcomed the 2013 White House Office of Science and Technology Policy (OSTP) memo, calling for increased public access to the results of federally-funded publications and research data. This presentation shares the results of a study conducted to better understand how academic libraries are leveraging existing services and resources when addressing the new public access policies. The researchers will survey libraries and research offices at the Carnegie very high and high research activity universities regarding OA policies, and services and collaborations that have been developed to assist faculty in meeting the new federal mandates. Using the results of the survey, and case studies from Rice University and Utah State University, we will offer a detailed snapshot of the role of academic libraries and research offices in addressing these funder policies as well as identify opportunities for more collaborative efforts.

10:30am-10:45am Intermission
10:45am-11:10am CINF 104: SHARE phase II: Enhancing the dataset and engaging the community
Judy Ruttenberg,

Association of Research Libraries, Washington, District of Columbia, United States
SHARE is building a free, open data set about scholarly research activities across their lifecycle. Stakeholders across the scholarly research ecosystem - funders, institutions, researchers, libraries - can both participate and benefit from an open data set about research activity, especially with an increasing trend toward public and open access to the results of that activity. This session will update the community on the progress of SHARE Notify, currently processing and freely distributing millions of research release events from sources including ArXiv, Figshare, PLOS, PubMed Central, and a number of institutional repositories. We will share the objectives and progress of Phase II of SHARE - expanding the number of data providers, enhancing the aggregated metadata, and looking for opportunities for institutional integration of SHARE's dataset. The expansion, enhancement, and integration of the dataset will ensure that SHARE is a timely and reliable source of data for universities about their own research output, and for funders about their investments. SHARE is an open source development project led by three higher education associations (ARL, AAU, and APLU) in partnership with the Center for Open Science, a nonprofit technology start-up.

11:10am-11:35am CINF 105: Supporting openness and reproducibility in scientific research: The Center for Open Science

Sara Bowman,

Center for Open Science, Charlottesville, Virginia, United States
New policies by funding agencies and require researchers to make publicly available their data and other research outputs. Evolving journal policies increasingly require more data and materials sharing by authors. Researchers must learn to navigate these ever-changing policies, often with little infrastructure support. The non-profit Center for Open Science (COS) seeks to provide researchers with both the infrastructure tools and training to meet these needs.
COS builds The Open Science Framework (OSF), a free and open-source web application designed to manage the entire research lifecycle, from project inception and planning through data archiving and dissemination. The OSF is connects tools researchers already use to increase efficiency and streamline workflows. Features like automatic file versioning and logging of actions make the research process more transparent. The OSF can be used privately, among collaborators, or opened to the general public with just the click of a button. Every resource, project, and contributor is given a persistent, unique identifier, which allows work to be cited and researchers to earn credit for contributions. The OSF represents a technical solution for researchers wishing to increase the openness of their work and meet funder mandates regarding data access.
The COS Community team focuses its efforts on building communities of researchers, funders, librarians, journal editors, and other stakeholders around open and reproducible practices in science. Two full-time staff members support researchers with free statistical and methodological consulting services, providing guidance to help researchers both meet funder mandates and make their work more open and reproducible. In another major initiative, the Community team seeks to support journal editors and funders with templates of guidelines that can be adopted to increase transparency of the research process and product. In collaboration with the Berkeley Initiative for Transparency in the Social Sciences and SCIENCE magazine, COS convened a meeting of stakeholders to write the Transparency and Openness Promotion (TOP) Guidelines. This talk will provide an overview of the guidelines, an update on the adopting journals, and provide more information on how journals in the chemical sciences can participate to enhance their own transparency standards.
This talk will highlight initiatives COS has undertaken to improve the openness, integrity and reproducibility of science.

11:35am-12:00pm CINF 106: Impact of open publishing: Scalability, sustainability, and success
Ann Gabriel,

Elsevier, New York, New York, United States
New policies concerning dissemination of funded research are influencing traditional modes of scholarly communication. This segment will explore how Publishers are working to comply with and enhance a range of mandates from global interests, as well as streamline publication workflow for both institutions and endusers. We will examine paths to compliance, including new business models and content types. We will also discuss sharing across scholarly collaboration networks, with a specific focus on Open Data.

CINF: Linking Big Data with Chemistry: Databases Connecting Genomics, Biological Pathways & Targets to Chemistry 9:30am - 11:50am
Tuesday, March 15
Room 24C - San Diego Convention Center
Rachelle Bienstock, Organizing
Rachelle Bienstock, Presiding
9:30am-9:35am Introductory Remarks
9:35am-9:55am CINF 94: Connecting 3D chemical data with biological information
Ian Bruno,, Suzanna Ward, Elizabeth Thomas, Colin Groom

Cambridge Crystallographic Data Centre, Cambridge, United Kingdom
Understanding the 3D structure of molecules and their interactions with biological systems is a crucial element of successful drug design. A vital resource in this is the world’s collection of over 800,000 crystal structures of organic and metal organic compounds. Many of these are directly biologically relevant. Even those that aren’t contain conformational and interaction data explaining molecular properties and interactions.

Sophisticated software is available in the Cambridge Structural Database System to release this knowledge, but until now, it has been designed for human consumption. This presentation, timed to coincide with the end of the 50th anniversary year of the Cambridge Structural Database (CSD), will describe the development of an Application Programming Interfaces (APIs) that enables the linking of the CSD to other resources as well as interoperability with other suites of software.

We will see how the 3D structures of small molecules can be linked to the 3D structures of equivalent protein ligands. How the search and analysis tools previously the domain of expert structural chemists can be accessed through Pipeline Pilot and KNIME. How we might generate streamlined workflows to link structural information in the CSD with target, pathway and disease information data in resources such as Open PHACTS. Finally we will look at the insights we can gain from linking 3D structural chemistry to biological data and the challenges involved in bridging these domains.

9:55am-10:15am CINF 95: PubChem BioAssay: Link chemical research to GenBank and beyond
Yanli Wang,

Building 38a, Room 5s506, Bethesda, Maryland, United States
The PubChem BioAssay database hosted by the National Center for Biotechnology Information (NCBI) at NIH serves as a public repository for biological results from Chemo-genomic research and RNAi screenings, with the former conducting systematic screening of small molecule libraries against disease targets and pathways, and the later aiming to gain insights into biological process and facilitate therapeutic target discovery. In particular, advanced technology in RNAi research enables genome-wide functional screens, and that in small molecule high-throughput screening (HTS) enables testing large compound library across wide assay target panel. PubChem BioAssay has grown rapidly in the past ten years with over 200 million bioactivity outcomes currently in its database. It devises multiple mechanisms in its data model for recording molecular information for the corresponding protein and nucleotide assay targets, and represents an important information resource for mining chemical modulators for over nine thousand protein targets that are associated with small molecule data, and for mining significance of biological relevance for over 30,000 genes provided by RNAi research. PubChem BioAssay links chemical research data to GenBank and related genomic resources through multiple tools and annotations. This integration helps to close the gap between genomic and chemical biology research, and provides a unique annotation service for the genomic information, which enables the retrieval of drug and chemical modulators for a particular protein in GenBank, as well as for searching biological and therapeutic relevance suggested by RNAi research for many gene records.

10:15am-10:35am CINF 96: Withdrawn
10:35am-10:50am Intermission
10:50am-11:10am CINF 97: Predicting adverse drug events using literature-based pathway analysis
James Rinker,, Timothy Hoctor

R & D Solutions, Elsevier Inc., Philadelphia, Pennsylvania, United States
Unexpected drug safety issues in clinical development can lead to suspending or ending the development of a clinical candidate. The cost of failed drug candidates in both time and money can greatly hinder the development of other promising candidates due to lost development resources. The ability to more accurately predict potential adverse drug events for pre-clinical candidates would greatly help in the process of deciding to move forward or suspend the development of candidates. One potential method for the prediction of adverse drug events would employ pathway analysis of adverse event regulators. Mining the literature for evidence of regulators implicated in specific adverse events can be extracted and mapped to known drugs or potential drug candidates based on their target profile. The target profile for a drug candidate can then be used to mine through hundreds of potential adverse events and their regulators. Employment of statistical, pathway, and subnetwork analysis can then be used to score and predict the likelihood of a specific adverse event for a drug based on either direct or indirect target modulation.

11:10am-11:30am CINF 98: Intersecting different databases to define the inner and outer limits of the data-supported druggable proteome
Christopher Southan,

Guide to PHARMACOLOGY, University of Edinburgh, Göteborg, Sweden
Hopkins and Groom coined the term “druggable genome” in 2002 for the extrapolated total of ~ 10% of the human proteome likely to bind small molecules with lead-like chemical properties and sufficient binding affinity for activity modulation. Fast-forward to 2015 and the UniProtKB website now include four database cross-references in the new Chemistry section. These provide a more detailed picture, based largely on chemistry-to-protein mapping data curated from the literature. They are thus evidence-supported statistics rather than homology-based transitive estimates. These included (Sept 2015) human protein links to 2927 target entries from ChEMBL, 2191 from BindingDB, 1563 from DrugBank and 1340 from the IUPHAR/BPS Guide to PHARMACOLOGY (GtoPdb). Statistical comparisons between these will be presented here defining different levels evidence support and following their continued expansion. The union of all four sets, 3603, encompasses ~ 18% of the proteome. However, the proportion that would match the most stringently curated of these, GtoPdb for chemistry-to-protein mapping is lower and comparison indicate curation strategies and source selections for each database diverge considerably (PMID 24533037). This is manifest in the relatively high unique content of 1147 (31% of the union) for the sources. However, they converge as a 4-way intersect for 490 proteins (13% of the union). Concordance between at least two independent sources (i.e. the non-unique proportion) expands to 2456 or 12% of the proteome. This represents the most precise data-supported druggable proteome snapshot for each UniProtKB release. Orthogonal comparative analyses of these intersecting sets will be presented, including by Gene Ontology functional categories, target class content, secreted vs. non-secreted, and disease gene links. The utility of this druggable proteome assessment is very high in pharmacology and drug discovery, especially in terms of being able to data mine leads as chemical starting points for target validation experiments.

11:30am-11:50am CINF 99: Applications of drug-target data in translating genomic variation into drug discovery opportunities

Anna Gaulton,

Chemogenomics Team, European Molecular Biology Laboratory - European Bioinformatics Institute, Cambridge, United Kingdom
Advances in sequencing and genotyping technologies offer opportunities for large-scale target identification and validation through genetic association studies1,2. However, successfully translating genotype-phenotype relationships into new therapeutics necessitates understanding of the associated biological pathways and the chemical tractability of the implicated proteins.

The ChEMBL3 database collates and organizes drug, target and bioactivity data, with the aim of tracking the drug discovery process from target and lead identification through to drug approval. This talk will present examples of the integration of ChEMBL druggability and drug-target data with results of genome-wide association studies to facilitate the identification of novel drug discovery and drug repurposing opportunities.


1. Hingorani, A. & Humphries, S. Nature’s randomised trials. Lancet 366, 1906–8 (2005).
2. Plenge, R. M., Scolnick, E. M. & Altshuler, D. Validating therapeutic targets through human genetics. Nature Reviews Drug discovery 12, 581–94 (2013).
3. Bento, A.P., Gaulton, A., Hersey, A., Bellis, L.J., Chambers, J. Davies, M., Krüger, F.A., Light, Y., Mak, L., McGlinchey, S., Nowotka, M., Papadatos, G., Santos S., Overington, J.P. The ChEMBL bioactivity database: an update. Nucleic Acids Research 42, D1083-D1090 (2014).

CINF: Chemistry, Data & the Semantic Web: An Important Triple to Advance Science 1:30pm - 4:45pm
Tuesday, March 15
Room 25B - San Diego Convention Center
Evan Bolton, Stuart Chalk, Organizing
Evan Bolton, Stuart Chalk, Presiding
1:30pm-1:35pm Introductory Remarks
1:35pm-2:00pm CINF 107: Representing the chemistry of 800,000 crystal structures
Suzanna Ward,, Ian Bruno, Colin Groom

Cambridge Crystallographic Data Centre, Cambridge, United Kingdom
For over 50 years the crystallographic community has used the Cambridge Structural Database (CSD) as the worldwide repository to share over 800,000 experimentally determined 3D crystal structures with the broader chemistry community. But these structures are typically represented as ‘just’ the coordinates of atoms in space. In order to be of use to other scientists this data must be enriched with both a chemical representation and the associated metadata necessary to contextualize an entry. Moreover, the structures must also be understandable by computer software.
This presentation, timed to coincide with the end of the 50th anniversary year of the Cambridge Structural Database, will look at how the existing chemical knowledge in 800,000 crystal structures can be used generate representations of new structures. It will look at how these representations are used in validation and standardization and in linking crystal data with other resources.
We will look at how we can make structures more discoverable and more useful, before addressing what the broader chemistry and informatics communities can do to improve scientific knowledge representation.

2:00pm-2:25pm CINF 108: CHEMnetBASE and beyond: CRC handbooks and dictionaries in today's world
Fiona Macdonald1,, Megan Eisenbraun2

1 Taylor and Francis, Boca Raton, Florida, United States; 2 Taylor & Francis, London, United Kingdom
While the CRC Handbook of Chemistry & Physics has been a mainstay for scientists since 1913, its utility is no longer restricted to the printed page. Since 1999 it's been available online in one form or other, and in the summer of 2016 the next incarnation will make its debut.

Along with the Chapman & Hall Chemical Dictionaries (Combined Chemical Dictionary, Dictionary of Natural Products, Dictionary of Organic Compounds) it makes up CHEMnetBASE, a suite of fully searchable databases containing physical properties, structures and chemical names. All of these products will be redesigned to align with the new and improved online Handbook, providing consistent search functionality, indexing protocols, and display of search results.

We will present the motivation behind the development of these resources, outline plans for integrating the search systems and showcase our vision for the future of CHEMnetBASE. Previews of the new online Handbook will also be presented.

2:25pm-2:50pm CINF 109: Collection, curation, and communication of thermophysical and thermochemical property data at the NIST Thermodynamics Research Center
Andrei Kazakov1,, Robert Chirico3, Chris Muzny4, Vladimir Diky5, Eugene Paulechka1, Ala Bazyleva1, Joseph Magee2, Scott Townsend1, Kenneth Kroenlein2

1 NIST, Boulder, Colorado, United States; 2 Thermodynamics Research Center, National Institute of Standards and Technology, Boulder, Colorado, United States; 3 National Institute of Standards Technology, Boulder, Colorado, United States
Exponential growth in publication rates and data generation has yielded tremendous challenges as well as potential rewards for data analysis groups. Data volumes have grown to such a degree that many traditional data collection and interpretation approaches cannot scale sufficiently to remain comprehensive and current, or to effectively track shifting interests within research and industrial communities. It is thus necessary to strongly rely on a substantially increased role for digital archives, automated analysis, and machine learning approaches.

The Thermodynamics Research Center (TRC) at the National Institute of Standards and Technology (NIST) maintains an extensive database of published experimental thermophysical and thermochemical properties for pure compounds, binary and ternary mixtures, and chemical reactions. All stored experimental data are associated with estimated combined experimental uncertainties. The large-scale data collection effort is complemented by the Guided Data Capture (GDC) software developed at TRC. GDC is designed to enforce the completeness of the information extracted, validate the information through data definition, range checks, etc., and guide the uncertainty assessment to ensure consistency between compilers with diverse levels of experience The resulting database, in combination with expert system software (ThermoData Engine, TDE), allows on-demand (i.e., dynamic) critical evaluation of thermophysical and thermochemical property data.

While the challenges in implementing such a system are significant, the potential benefits are quite noteworthy. These large, well-vetted data sets generated therewith can be then used as inputs for large scale efforts in chemical modeling, such as chemical candidate screening or development and optimization of property estimation methods. Dynamic access to large validated data sets such as these can also be used to very quickly compare data in submitted manuscripts to a nearly-comprehensive set of existing published data, as well as facilitate robust, property-based literature searches, improving the quality of published information and preventing the propagation of erroneous data. These efforts have facilitated a decade's long collaboration with key journals in the field where reported data are vetted for consistency by TRC before publication. The published data are disseminated in a free and open context via ThermoML, an XML-based file format and IUPAC standard.

2:50pm-3:15pm CINF 110: Building a better materials science database: Challenges and opportunities

Robin Padilla,, Michael Klinge,

Corporate Markets & Databases, Springer Nature, Heidelberg, Germany
SpringerMaterials presents large amounts of data from materials science, chemistry, and physics. The database draws on the Landolt-Börnstein Series and other specialized databases. Recent development is focused on adding new data sources, digitizing and enriching existing data, enhancing search algorithms, linking diverse content collections, and optimizing user experience design.


3:15pm-3:30pm Intermission
3:30pm-3:55pm CINF 111: TCI’s approaches to chemical information for researchers
Haruhiko Taguchi1, Tracey Barber2,

1 RD (Information Management) Department, Tokyo Chemical Industry Co Ltd, Chuo-ku Tokyo, Japan; 2 Marketing, TCI America, Cambridge, Massachusetts, United States
TCI manufactures and provides organic reagents to researchers around the world to support the advancement of chemistry. TCI also supplies chemical information to its customers through various ways including its website,, on which each product has its own dedicated page. Each product page contains the link to Reaxys, PubChem, and the Spectral Database for Organic Compounds (SDBS) for helping researchers to quickly collect chemical information. In addition, TCI’s website product pages provide reagent applications and the links to related academic journals and articles. TCI provides original chemical information too, including MSDS’s that are available in multiple languages for safety use, physical properties and regulations for each product. To further aid researchers in finding the reagents they need quickly, TCI offers searching by various ways including CAS number, keywords, category, structure and more.

Providing researchers the reagents they need when they need them, with the information required to keep their research moving forward quickly, is the challenge of all chemical suppliers today. TCI must ensure that it offers all of the technical information needed to support the research. TCI will show how it supplies chemical information through its website.

3:55pm-4:20pm CINF 112: Presenting the latest scientific knowledge on an e-commerce website
Jonathan Stephan,

Sigma Aldrich, Saint Louis, Missouri, United States
Sigma-Aldrich has always strived to deliver the latest information to scientists. As chemical, biological, and overall scientific information has increased, Sigma-Aldrich has built a strong content backbone using Automation and Informatics. The process starts at Product Attributes and Descriptions and moves to the more complex Safety Data Sheets, Technical Bulletins and Peer-reviewed Papers. This presentation will describe how a Catalog based company has used Automation to successfully transition to a leading provider of Chemical and Biological Information to the scientific community.

4:20pm-4:45pm CINF 113: Beyond chemistry: Collect, organize, and visualize scientific data on the web
David Deng,, Rajeev Hotchandani, Jinbo Lee

Scilligence, Burlington, Massachusetts, United States
We live in a time when technology advancement makes the amount of scientific data grow exponentially. For instance, improvements in laboratory technologies allows us to explore new chemical spaces and expedite data generation; scientific literature is being digitalized for easier access... All these developments have resulted in greater scientific data availablity. However, how to collect, organize, analysis and visualize this large amount of scientific data remains challenging.

In this presentation, a case study of managing chemical and biologic data within Scilligence’s web-based systems will be introduced. A typical workflow starts from synthesis planning, product registration, assay data analysis, to sample management. The information related to small molecules or biologics can be scattered around in the document repository system. It is however, fully recorded and searchable with Scilligence’s knowledge-mining tools.

CINF: Driving Change: Impact of Funders on the Research Data & Publications Landscape 2:00pm - 4:50pm
Tuesday, March 15
Room 25A - San Diego Convention Center
Elsa Alvaro, Andrea Twiss-Brooks, Organizing
Andrea Twiss-Brooks
Cosponsored by: MEDI and ORGN, Presiding
2:00pm-2:25pm CINF 119: Are we ready to define the scholarly commons?
Maryann Martone1,2,

1 Neurosciences, University of California, San Diego, San Diego, California, United States; 2, San Francisco, California, United States
The question of open access must be considered through the duality of modern scholarship: access to research products involves both human and machine. FORCE11, the Future of Research Communications and e-Scholarship, is a grass roots community that arose to address the question of how scholarship needs to adapt to maximize machine-based access in the age of networks and global search. On the flip side, technology must adapt to the requirements and reality of scholarship and its need for persistence and chains of evidence.

FORCE11 is a broad tent, welcoming those across all scholarly disciplines within academia, industry, government and at large. These diverse stakeholder groups allow insight into different practices and cultures and also efforts underway around the globe to provide new platforms and services for scholarly communication. It is clear that even within a single domain, e.g., biomedicine, access to scholarship is fragmented for machines and humans alike. It is also clear that different communities, even within the same domain, are at vastly different stages in transitioning their scholarship to e-scholarship platforms.

Through projects like the Neuroscience Information Framework and the NIDDK Information Network, I have been involved first hand in cataloging the thousands of databases, tools, materials, produced by the biomedical community. There has been a huge investment in the creation of these resources, but less on long term sustainability or interoperability. Part of the reason for this is that we really didn’t know how to do either. Sustainability is still challenging, but I believe we are making headway on the latter.

What is emerging from discussions around the globe is a better sense of the principles, best practices, interfaces and minimal standards that should govern information flow across the scholarly ecosystem to maximize machine and human access. At FORCE11, we are calling this the Scholarly Commons. We are considering not just what practices govern digital objects, but how researchers must handle physical and conceptual entities as they transition into the digital realm.

FORCE11 will be hosting a series of workshops that will explore defining the scholarly commons. The outcomes of these workshops will not be an endorsement of a particularly platform or technology, but rather what any stakeholder in modern scholarship should aim to achieve to create a vibrant, dynamic ecosystem that maximizes access for both machine and human.

2:25pm-2:50pm CINF 120: Research data curation services at UC San Diego library
Ho Jung Yoo,, David Minor

Library, UC San Diego, San Diego, California, United States
In 2008, the heads of the major campus service providers at UC San Diego recognized the need to streamline and enhance access to technology services on campus. With strong resource support from the Chancellor’s office, the team of service providers formed the Research Cyberinfrastructure Initiative, a campus program designed to centralize access for faculty to the abundance of technology services on campus, which included storage, networking, and high performance computing. One of the major new thrusts of this initiative was to commission the Library to develop a Research Data Curation Program (RDCP). The RDCP was formed at the end of 2013 to support the data management, publishing, and preservation needs that faculty would imminently need to address as a part of their research activities. The program now has a staff of 10 librarians and analysts, in partnership with other Library programs, to support data curation services on campus for faculty, staff, and students. These services include administration of online tools for writing data management plans and minting persistent identifiers, management of a data repository for sharing research data publicly, long term digital preservation, training classes, and consultation services.

2:50pm-3:15pm CINF 121: Is open science an inevitable outcome of e-science?
Jeremy Frey,

University of Southampton, Southampton, United Kingdom
The advent of e-Science building on the digital revolution in information production, exchange, and consumption, has created new ways of interacting with colleagues and disseminating discoveries. It also opened up radially new possibilities for regulation and governance of the research process and therefore unsurprisingly attracted the interest of the funders of science and the professional bodies as guardians of professional research practice. The players in the research life-cycle are still exploring and exploiting these opportunities and they are having major consequences on the securing of funding and the obligations placed on researchers, but are also creating new opportunities for different types of research. I will attempt to address some of these aspects in the context of the research landscape in the UK.

3:15pm-3:40pm CINF 122: Navigating the research data ecosystem
Dan Valen,

figshare, Brooklyn, New York, United States
Financial, social, and ethical pressures are increasingly requiring grantees to make their research results accessible in order to validate findings and spur scientific discovery. Collaboration around research data and the development of scholarly communication initiatives is fast becoming a requirement at institutions as more and more funding bodies mandate research data sharing. With the rise in funder mandates and public access policies around funded research, researchers, as well as publishers and institutions, are faced with a compliance puzzle.

This puzzle is one of the main drivers for the continuing evolution of At Figshare, we build tools to support researchers, publishers, and institutions that aid in the storing, sharing, and discoverability of both the positive and negative research outputs. Our ultimate goal is to aid in the reproducibility, replication, and reuse of research data and to help the research community realize this goal.

Good data management and infrastructure is at the foundation of reproducible research. This talk will touch on the evidence and challenges for reproducibility we’ve seen at Figshare and will delve deeper into incentives to motivate different stakeholders and communities toward best practices and workflows to achieve transparency in scientific research.

3:40pm-3:55pm Intermission
3:55pm-4:20pm CINF 123: Funding mandates and policies: A database provider's response
Ian Bruno1, Colin Groom2, Amy Sarjeant1,

1 Cambridge Crystallographic Data Centre, Cambridge, United Kingdom; 2 CCDC, Cambridge, United Kingdom
From the very start of the Cambridge Structural Database (CSD) to its current state as the repository for the world’s crystal structures, those who have curated these data strived to make it available to all researchers, everywhere. After all, what’s the point of having 800,000 crystal structures, if no one can make use of them? The mandates from research funding agencies that all scientific results should be publicly available dovetails with the mission of the Cambridge Crystallographic Data Centre (CCDC) to provide access to crystal structure data for anyone who requires it. How do the services that the CCDC provides match up to funder expectations and how have they evolved in response to these? What can a database provider do to ensure the quality of data is maintained while public access is guaranteed not just today but for future generations? How should it be paid for?

This presentation, timed to coincide with the end of the 50th anniversary year of the Cambridge Structural Database, explores the influence of funding agencies on data providers and the services they provide. It will also take a look at what remains to be done in order to meaningfully realise the benefits that funding agencies seek to achieve.

4:20pm-4:45pm CINF 124: Quest to find 'broader impact': How funding bodies are using altmetrics to evaluate funded research and grant applications
Sara Rouhi,

Altmetric, Washington, DC, District of Columbia, United States
As funding bodies both public and private evolve to accomodate a soaring number of applicants and diminishing pools of funds, they are increasingly looking beyond traditional metrics to evaluate new applicants and past reward recipients. While traditional metrics like H-index, citations, journal impact factor, and journal prestige all speak to the scholarly impact of an applicant, they cannot indicate impact across broader audiences like practitioners (educators, doctors, lawyers, legislators -- non-scholars who use peer-reviewed research in their work), the general public, interested parties, and research communicators (like journalists). Traditional metrics also take months or years to accrue making them lagging indicators of impact in the scholarly pace. They also do not serve early career researchers or researchers working in niche fields with non-traditional research outputs. Altmetrics begin to solve some of these issues by service as qualitative, attention and immediacy indicators. Private and public funders alike are increasingly using these indices to measure not only grants they have funded -- are they in keeping with the funder mission? Are they reaching key audiences? Are they engaging new/emerging communities of interest? -- but to evaluate potential grant applicants and existing applications. This presentation will walk through changes in the grant funding process at public and private funders, a case study outlining why funders are using altmetrics in this way, why they have pivoted to add these new metrics in their evaluation process, and what tools you can bring to your libraries to help support your researchers' grant application efforts.

4:45pm-4:50pm Concluding Remarks
CINF: Linking Big Data with Chemistry: Databases Connecting Genomics, Biological Pathways & Targets to Chemistry 2:00pm - 4:05pm
Tuesday, March 15
Room 24C - San Diego Convention Center
Rachelle Bienstock, Organizing
Rachelle Bienstock, Presiding
2:00pm-2:05pm Introductory Remarks
2:05pm-2:25pm CINF 114: How can genomic databases be linked to chemical structural information?
Rachelle Bienstock,

RJB Computational Modeling LLC, Chapel Hill, North Carolina, United States
There are more and more databases containing genomic, biological assay and pathway data. The new Nucleic Acids Research Database issue (NAR, 2015, 43, D1-D5) contains 177 databases including genomic, RNA, protein structure, toxicity and metabolic information. However, small ligand and chemical structure compound data is not linked in an efficient way to biological assay, biological pathway and protein target information. How can ligand and structural information successfully be combined and used with biological pathway, toxicity and target pathway information in the most efficient and coherent way for drug discovery? Methods for connecting disparate database information and linking database information will be discussed.

2:25pm-2:45pm CINF 115: Reactome pathway knowledgebase: Connecting pathways, networks, and disease
Robin Haw,

Informatics and Bio-computing, OICR, Toronto, Ontario, Canada
Modern health initiatives and drug discovery are focused increasingly on targeting diseases that arise from perturbations in complex cellular events. Consequently, there has been a tremendous effort in biological research to elucidate the molecular mechanisms that underpin normal cellular processes. A reaction-network pathway knowledgebase is the tool of choice for assembling and visualizing the “parts list” of proteins and functional RNAs, as a foundation for understanding cellular processes, function and disease. The Reactome Knowledgebase ( is a publically accessible, open access bioinformatics resource that stores full descriptions of human biological reactions, pathways and processes. Curated pathway knowledgebases, like Reactome, are uniquely powerful and flexible tools for extracting biologically and clinically useful information from the flood of genomic data. Our data model accommodates the annotation of disease processes, allowing us to represent the altered biological behaviour of mutant variants frequently found in cancer, and to describe the mode of action and specificity of drugs and therapeutics. Bio- and chemoinformaticians use Reactome to interpret high-throughput experimental datasets, to develop novel algorithms for data mining and visualization, and to build predictive models of normal and abnormal pathways. Specific features of Reactome support the visualization of interactions of many gene products in a complex biological process, and the application of bioinformatics tools to find causal patterns in genomic data sets. To maximize Reactome’s coverage of the genome, we have supplemented curated data with a conservative set of predicted functional interactions (FI), roughly doubling our coverage of the translated genome. We have developed a Cytoscape app called “ReactomeFIViz”, which utilizes this FI network to assist biologists to perform pathway and network analysis to search for gene signatures from within gene expression data sets or identify significant genes within a list. Pathway and network-based tools for building and validating interaction networks derived from multiple data sets will give researchers substantial power to screen intrinsically noisy experimental data in order to uncover biologically relevant information.

2:45pm-3:05pm CINF 116: Competitive intelligence workbench: Getting access to information for decision making

Huijun wang,

Merck, Kenilworth, New Jersey, United States
Pharmaceutical Companies have a large past generated and continue enlarged data collection. Meanwhile, there is rich information available externally due to the new techniques. Information is vital to identify new innovative drugs and drug targets. However, it remains a challenge for research scientists to quickly and easily obtain information and use it to make informed decisions. Our Competitive intelligence workbench is aimed to provide a self-services platform to enable scientists to access the latest information from both internal and external sources and make decisions with strong supporting data. In this project, we integrated multiple sources using big data approach and built various reusable components and services to find associations among compounds, target and clinical phenotypes, which is useful for novel repurposing opportunities, MOA elucidation, etc. We also developed project dashboards that provide comprehensive knowledge overview on projects in an easy to navigate interface. Scientists were able to access the most recent advances in their chosen fields to support decision-making. More important, the change of information access methods will decrease the data bottleneck for new medicine innovation and ever change landscape of Research.


3:05pm-3:15pm Intermission
3:15pm-3:35pm CINF 117: Using systems biology in computational drug design workflows

George Nicola,, Bruce Kovacs

Afecta Pharmaceuticals, Irvine, California, United States
We have built an automated, workflow-based system that predicts mechanism of action for new indications of safe, off-patent drugs. The platform technology can also design new molecules for a known target or an active drug program. We do this through a combination of enumerating derivatives from a patent, generating a combinatorial library of analogues around a Markush scaffold, chemical fingerprint searches, 3D similarity (shape, pharmacophores, electrostatics), ADMET descriptor matching, gene expression profiling, and protein docking.

The platform is built in the KNIME workflow environment, and uses both open source as well as proprietary software. The prediction algorithm is custom designed using machine learning models that have been trained on large data sets. We connect and make use of multiple web-accessible databases including those for binding activity, chemical and protein structures, biological pathways, and gene expression.

To feed compounds into the workflow, we have also built a comprehensive compound registration system that analyses, isomerizes, de-duplicates, and uploads compounds to an Instant JChem-enabled MySQL database server. Our base library consists of 10,000 commercially available drug compounds, as well as several hundred hand-picked compounds with known activities.

Our workflow-based platform technology has proven especially useful when partnering with small and mid-size pharmaceutical companies seeking to address an unmet medical need by redesigning an existing product, and where regulatory approval is likely to be achieved rapidly. We provide an example of this platform being used successfully to repurpose an antipsychotic molecule into a drug candidate currently in Phase III clinical trials. We are currently in the process of designing better molecular analogues for this project.

3:35pm-3:55pm CINF 118: Combining semantic triples across domains to identify new and novel relationships and knowledge
Matthew Clark,, Frederik van den Broek, Anton Yuryev, Maria Shkrob, Sherri Matis-Mitchell, Timothy Hoctor

R & D Solutions, Elsevier Inc., Philadelphia, Pennsylvania, United States
The focus on methods to analyze large databases, ‘big data’, continues to increase as the collections of scientific observations accumulate. Elsevier has collected tens of millions of facts from scientific literature in the form of semantic triples. In biology an example triple is “A regulates/causes/changes B” where A and B can be compounds, diseases drugs or other entity types. The relationship is also qualified by species, tissues, and other variables. In chemistry the triples are similar, e.g. “compound C inhibits protein A“ and are also qualified by variables such as potency, assay type, species, and variant. The possible combinations increase factorially with the number of facts joined together by disease, target, or chemical compound.
By combining these observations in biology and chemistry we can explore questions such as “based on the known targets drug A inhibits, what other diseases might it treat, based on disease pathways reported for all other diseases?” and “given the proteins related to a disease, and compounds known to inhibit those proteins what known compounds/structure scaffolds could be tested to treat the disease?” We will present examples of using data frameworks that combine Elsevier and open source pathway and biological activity databases to explore these questions with the broadest available knowledge base.

3:55pm-4:05pm Concluding Remarks
CINF: Chemistry, Data & the Semantic Web: An Important Triple to Advance Science 8:15am - 11:55am
Wednesday, March 16
Room 25B - San Diego Convention Center
Evan Bolton, Stuart Chalk, Organizing
Evan Bolton, Stuart Chalk, Presiding
8:15am-8:20am Introductory Remarks
8:20am-8:45am CINF 125: Analytical data, the web, and standards for unified laboratory informatics databases
Graham Mc Gibbon1,, Patrick Wheeler2,

1 Advanced Chemistry Development (ACD/Labs), Toronto, Ontario, Canada; 2 Product Development, Advanced Chemistry Development, Encinitas, California, United States
For knowledge management solutions to be widely embraced by the chemical community there must be standards for handling not just chemical structures but also analytical data and metadata. This includes dealing with different data sources, types and formats. More importantly platforms must support this from experiment inception through data acquisition and interpretation then eventually to presentation including appropriate storage and querying capabilities. Technology integration also gains importance considering the modern laboratory informatics environment and increasing externalization.
We present here how our organization has applied 20 years of experience in chemistry and informatics to developed technologies that unify data from distinct formats and types, and use common exchange protocols and compatibility with web based presentation layers. At the heart of this is chemical nomenclature, molecular structure, spectral and chromatographic information, and databases that store, relate and allow access to these elements and their associated relationships. Further, we illustrate such a technology, namely a platform for live data and unified laboratory intelligence, and how is utilized. We will also look toward future application of this knowledge management representation.

8:45am-9:10am CINF 126: From molecular formulas to Markush structures: Different levels of knowledge representation in chemistry
Michael Braden,

ChemAxon, Cambridge, Massachusetts, United States
Chemical compounds can be characterized in many different ways. Depending on the level of detail we have as to the composition, we can easily end up with very general or very specific descriptions. The representation of the available information is crucial in lots of use cases, where the actual chemical knowledge drives further important decisions. These use cases include quite 'simple' ones, like the checking of compounds' uniqueness, but also very complex ones, like the coverage of the patent space by a certain Markush structure. The presentation will provide a review of the existing solutions for these problems within a suite of informatics tools, a comprehensive knowledge management solution for chemical sciences. The motivation behind the development of these resources will be described, a vision on how they can be used by others, and successful user stories along with the exciting science behind them.

9:10am-9:35am CINF 127: Strategies for creating knowledge from chemistry and text data
Tom Oldfield1,, Mariana Vaschetto1,, Jeff Nauss2,

1 Dotmatics, Bishops Stortford, United Kingdom; 2 Linguamatics, San Diego, California, United States
Chemical data representation is a challenge that has been addressed using different methodologies. Representation includes not only a set of unique chemical descriptors for the molecules themselves, but also the linking process (reactions) that they belong to in the form of metadata. The structured nature of this data makes it easy to store in structured databases. However, one common issue remains: the low quality of metadata associated with each chemical entity. This could hinder the extraction of meaningful knowledge from the stored information without time consuming human intervention. Efforts have been made in a) the optimization of chemical and reaction representation in order to achieve real-time text and data mining and b) the integration of chemical information with semantic analysis of surrounding text generated by researchers. In this talk we will focus on addressing the first issue in detail and discuss strategies for the second part.
We will provide the background on chemical /reactions representations used by Dotmatics and the tools that enable Chemists to generate these into a comprehensive chemistry toolkit. Additionally this talk will cover how chemistry descriptors can be converted into computer fingerprints or bit-strings, allowing high performance searching (super and sub-structure searching) and ranking of chemistry data. These solutions also take advantage of advanced memory mapping and threading to provide interactive capability additional to those available on standard laptop computers. This enables data discovery to done at the application level instantly and without recourse to large scale server infrastructure.
Finally, we will explore how all Dotmatics technologies can make use of standardize ontology dictionaries and other commercially available natural language processing (NLP) based text mining tools providing additional added value in the knowledge discovery process.

9:35am-10:00am CINF 128: Combined structure and reaction retrieval in scientific content: What satisfied users in the past and what they demand for the future
Guido Herrmann1,, Josef Eiblmaier1, Valentina Eigner-Pitto1

1 Georg Thieme Verlag Kg, Stuttgart, Germany; 1 InfoChem GmbH, Munich, Germany
Thieme has been a chemistry publisher since 1909. We publish scientific information in various formats: journals, reference works, encyclopaedia, monographs and textbooks. Together with InfoChem GmbH, a software company focusing on the production and marketing of new products for chemical information advanced solutions have been developed to handle, store and retrieve chemical structures and reactions.

In our talk we present the motivation behind the development of these resources and a vision on how they can be used by others. We will highlight for reference works, journals and encyclopaedias’ how a combination of semantic technologies, advanced text, structure and data representation in combination with sophisticated search technologies lead to a greatly enhanced user experience and discoverability.

10:00am-10:15am Intermission
10:15am-10:40am CINF 129: Harnessing chemical and toxicological data for the evaluation of food ingredients and packaging
Diane Schmit,, Tammy Page, Kirk Arvidson, Patra Volarath, Leighna Holt

US Food and Drug Administration, College Park, Maryland, United States

The U.S. Food and Drug Administration’s (FDA’s) primary mission is to promote and protect public health. FDA's Center for Food Safety and Applied Nutrition (CFSAN) is one of six product-oriented centers within the FDA that carries out the mission of FDA to enforce the Federal Food, Drug, and Cosmetic (FD&C) Act and other laws that are designed to protect consumers' health and safety. The Office of Food Additive Safety (OFAS) within CFSAN manages FDA's pre-and post-market safety review of food additives, color additives, food contact substances, and generally recognized as safe (GRAS) substances. A result of OFAS’ responsibilities, it has amassed a very large volume of chemical, toxicological and regulatory data on chemicals under its purview. As such, OFAS has developed a number of web-based informatics tools that link regulatory submissions, regulations, chemical data and toxicological data to facilitate the identification of the regulatory history of a particular chemical as well as the chemical and toxicological data available within our internal administrative files. STARI is an ontology of scientific and foods terminology and regulatory data,
organized in a multi-hierarchical structure, and cross-linked to CERES and other data resources. CERES is OFAS’ chemical-centric knowledgebase that links regulatory history with human intake estimates and toxicological data in one resource. CERES also provides informatics tools to probe potential toxicity as well as identify potential structural analogs for read-across approaches to assist in the safety evaluation of new and previously regulated food additives and ingredients.

10:40am-11:05am CINF 130: Expansion of DSSTox: Leveraging public data to create a semantic cheminformatics resource with quality annotations for support of U.S. EPA applications
Christopher Grulke2, Inthirany Thillainadarajah1, Antony Williams1, David Lyons1, Jeff Edwards1, Ann Richard1,

1 National Center for Computational Toxicology, US EPA, Research Triangle Park, North Carolina, United States; 2 Zachary Piper Solutions, New Hill, North Carolina, United States
The expansion of chemical-bioassay data in the public domain is a boon to science; however, the difficulty in establishing accurate linkages from CAS registry number (CASRN) to structure, or for properly annotating names and synonyms for a particular structure is well known. DSSTox has long been considered a trusted source for highly curated CASRN to name to structure relationships within the environmental toxicology community. DSSTOX recently expanded to include accurate annotation of the more than 8000 chemical substances being tested in the ToxCast and Tox21 programs. To extend cheminformatics integrity beyond DSSTox’s initial 25K substances, we collected data from various public sources and performed a series of checks to evaluate the consistency of chemical information within and across these public repositories. Incoming data were constrained by strictly enforcing a 1:1 mapping of CASRN to structure, and each substance was assigned to one of six “QCLevels” to capture the level of confidence in CASRN to name to structure associations. The number of chemicals now supported in DSSTox has expanded to over 750k with over 150k curated to be higher quality than public resources. This expanded version of DSSTox is available to the public in legacy DSSTox flat file and SDF formats, through web interfaces supporting EPA’s Chemical Safety and Sustainability (CSS) projects (including ToxCast and Tox21), and as RDF graph format to facilitate semantic data efforts. Our efforts have quantified a high degree of inconsistency in publicly available chemical annotations, as well as highlighted the challenges caused by limited adoption of semantic data in chemistry to date. This abstract does not reflect U.S. EPA policy.

11:05am-11:30am CINF 131: Comparative toxicogenomics database: Advancing understanding of molecular connections among chemicals, genes, and diseases

Cynthia Grondin,, Allan Davis, Thomas Weigers, Carolyn Mattingly

Biology, North Carolina State University, Raleigh, North Carolina, United States
Exposure to chemicals in the environment plays a key role in the etiology of many human diseases and phenotypes. Chemicals influence genes and proteins, molecular pathways, and disease susceptibility, yet a clear understanding of their direct role in human disease is lacking. The Comparative Toxicogenomics Database (CTD; promotes understanding about the effects of environmental chemicals on human health by manually curating and presenting data from scientifically reviewed literature on the interactions between chemicals, genes, and diseases in vertebrates and invertebrates. In our curation paradigm, CTD scientists use controlled vocabularies, ontologies, mnemonic codes, symbols, and structured notation to transform the scientific literature into a semantic, computable structure. This information is integrated with gene attributes (including Gene Ontology annotations), molecular pathways, species, and general toxicology information to provide a free knowledgebase of over 28 million toxicogenomic relationships that can inform user hypotheses. CTD chemicals align with MeSH chemical terms and link to CCRIS, ChEBI, ChemIDplus, GENE-TOX, Household Products Database, Hazardous Substances Data Bank, PubChem and TOXLINE. Numerous CTD tools enable novel enrichment and comparative analyses of user-defined or CTD-based data sets. In addition, the structured information is made available for computational analysis in the form of XML, BEL, and other formats. Here, we present an overview of CTD functionality with emphasis on chemical representation and its integration with molecular and disease data.

11:30am-11:55am CINF 132: Wikidata: Advancing science through semantic integration of genes, diseases, and drugs
Benjamin Good1,, Elvira Mitraka2, Andra Waagmeester1,3, Sebastian Burgstaller-Muehlbacher1, Timothy Putman1, Andrew Su1, Lynn Schriml4

1 Department of Molecular and Experimental Medicine, Scripps Research Institute, La Jolla, California, United States; 2 Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, Maryland, United States; 3 Micelio, Antwerp, Belgium; 4 Epidemiology and Public Health, Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, Maryland, United States
Wikidata is an openly accessible and editable, Semantic Web-compatible knowledge base that now underlies Wikipedia as a “knowledge commons” It is a full-fledged member of the linked data cloud, with a SPARQL endpoint available at Wikipedia articles can now render content queried directly from Wikidata and each Wikipedia article is hyperlinked to a corresponding data item in Wikidata.
Our team is addressing the ongoing challenge of biomedical knowledge dissemination and integration by populating Wikidata with the seeds of a semantic network linking genes, drugs and diseases. Nodes and edges in this network are populated automatically by ‘bots’ that integrate data from trusted authorities such as NCBI’s Entrez Gene, DrugBank, and the Human Disease Ontology. Using this content, we are automatically enhancing the number, content and semantic inter-relations of Wikipedia articles about genes, diseases and drugs.
Outside of Wikipedia, the open API of Wikidata provides the capacity to generate or enhance many other applications. For example, useful queries of chemical data such as “what clinically relevant drug-drug interactions are known for the drug methadone” are already possible with Wikidata’s SPARQL endpoint. Supporting APIs also provide access to the edit history of all items in the graph, providing programmatic capabilities to detect and correct vandalism and to reward individual contributors.
Wikidata is unique among biomedical Semantic Web resources in that it is editable by anyone and is embedded directly in the context of all other human knowledge. This openness and centrality make it the ideal foundation upon which to build the next generation of Web-scale semantic data. We encourage the chemical informatics and chemical biology community to join us in expanding Wikidata’s coverage of the chemical universe, in particular, the development of drug-gene and drug-disease semantic relations.

CINF: Reimagining Libraries as Innovation Centers: Enabling, Facilitating & Collaborating throughout the Research Life Cycle 8:45am - 12:00pm
Wednesday, March 16
Room 24C - San Diego Convention Center
Ye Li, Vincent Scalfani, Organizing
Ye Li, Presiding
8:45am-8:50am Introductory Remarks
8:50am-9:15am CINF 133: From dusty stacks to an information hub: Reimagining the UF libraries
Neelam Bharti1,, Sara Gonzalez2

1 Marston Science Library, University of Florida, Gainesville, Florida, United States; 2 Marston Science Library, Gainesville, Florida, United States
In the last several years, the University of Florida Libraries has been working on redesigning the Marston Science Library, rising above expectations and collaborating actively on a program of strategic planning and innovation. Once a library that contained mostly books and journals with very little study space and electrical outlets, the science library was transformed into an innovative collaboration center by providing modern technologies and study space for students and researchers. The science library became quickly a center point for the university with the inclusion of 3D printing, the MADE@UF lab, visualization and conference room, and open floor seating in the new Collaboration Commons. Considering that most of the research resources and journals are online, results have been very impressive with our user counts doubling in the last year. It's not just transforming the library space and workflow; but has also transformed the library’s organizational culture and the responsibilities of librarians. Marston’s transformation has been so successful that other UF libraries are following in similar renovations (known as “Marstonization”). This transition has demonstrated a huge step in rethinking and redesigning a traditional library space as a step towards the inventive libraries of the future.

9:15am-9:40am CINF 134: Expanding the research commons model into disciplinary instances
Jeremy Garritano,

University Libraries, University of Maryland, College Park, Maryland, United States
In a distributed library environment, providing services to faculty, staff and students can be complicated with concerns of dispersed libraries staff, properly targeting services to appropriate campus segments, and leveraging various infrastructures of individual libraries. At the University of Maryland, a Research Commons model was first developed at the “main library,” with a focus on both virtual and physical services. After an initial academic year, the development of a disciplinary Commons was considered to complement the Research Commons and the previously established Learning Commons. In the summer of 2015, a taskforce was created to outline the creation of a Science Commons that would be connected to the Research Commons. This talk will present the general philosophy of the Commons model as interpreted at the University of Maryland as well as discuss the administrative and organizational evolution of the Commons. Descriptions of preliminary services of the Science Commons as well as their assessment will also be discussed.

9:40am-10:05am CINF 135: Libraries for the future: A digital economy perspective
Jeremy Frey,, Steven Brewer

University of Southampton, Southampton, United Kingdom
The discussion of Libraries for (and of) the future formed a major theme for the IT as a Utility (ITaaU) Challenge area network ( of the Research Councils UK (RCUK) Digital Economy theme. We present a summary of the discussion and conclusions of the workshops and meetings examining the future role of research and community libraries that have taken place under the auspices of the ITaaU Network. The concept of a library in the digital age was informed by considering the origins and uses of research libraries over time as not only a repository but as an active research space. The key role of the University library as a meeting point between disciplines enabling and informing interdisciplinary discourse long before this became necessary to formally acknowledge this need. New ways of interacting with the research and wider community will be discussed, along with the way in which digital presence (“digital aura”) of people and “books” alter the information flow between organisations and people.

10:05am-10:20am Intermission
10:20am-10:45am CINF 136: Leveraging the interdisciplinarity of chemistry: Building interdisciplinary collaborations
Kiyomi Deards,

Research and Instructional Services, University of Nebraska-Lincoln, Lincoln, Nebraska, United States
An outreach event started by three chemists in Nebraska has spawned several collaborations both nationally and state wide. Learn how they are leveraging the interdisciplinary nature of chemistry and chemical information to create outreach and scholarly collaborations within STEM (Science Technology Engineering Math) and with the social sciences.

10:45am-11:10am CINF 137: Predicting local trends in scholarly communication for decision-making in collection development: An exploration beyond citation analysis
Ye Li,

University of Michigan, Ann Arbor, Michigan, United States
Data-driven collection development has been one of the means employed by academic librarians to revolutionize library collections and services for many years. Citation analysis of scholarly publications from researchers of an institution, in combination with the resource usage data and interlibrary loan data, often can generate a baseline of needed resources during a given time period. However, most analysis focused on counting the frequency of current or past citations or usages; and few studies have used current data to predict the future trends in scholarly communication and demands of new resources. In this study, we explore the possibility of using basic regression models and machine learning tools in the emerging data science field to analyze citation, usage and other library transaction data in Chemistry and related research fields. This analysis will identify useful features, such as citation counts, subjects, and keywords etc., and their corresponding weights for prediction of future trends and potential research directions in a specific institution. Other features like costs, budgets, and usage statistics could be included in the model to predict the importance and the likelihood of keeping a title or a group of titles in the next few years. One focus of the study is to make a useful model for revealing the trends of publishing open access articles among chemists. A successful data model has the potential to help librarians select open access journals for recommendation and decide whether to pay the member institution fees. Through applying the model, we hope to tie our decision-making in collection development closer to the local trends of research and scholarly communication through an evidence-based approach.

11:10am-11:35am CINF 138: Academic technologies: A new library service to offer advanced software training
Vincent Scalfani,, Melissa Green

University Libraries, University of Alabama, Tuscaloosa, Alabama, United States
Libraries have started to offer access to a tremendous amount of advanced academic software such as graphing, 3D design, and technical computing programs. Traditionally, students are expected to learn much of this software on their own or within their courses. We have found a great need to go beyond simply offering access to software in our libraries. As such, there is a tremendous opportunity for new collaborations and teaching initiatives with software applications among librarians, students, and faculty in their coursework and research. This presentation will cover what libraries are doing to meet software training needs as well as our own experience teaching workshops and offering consultations to support various software applications such as Adobe Creative Suite, ChemDraw, IBM SPSS Statistics, MathWorks Matlab, MS Office, QtiPlot, and Trimble SketchUp Pro. We will conclude this presentation with our ideas for the future role of libraries with advanced software training and collaborations.

11:35am-12:00pm CINF 139: Enhanced chemical understanding through 3D-printed models

Amy Sarjeant1,, Peter Wood4, Ian Bruno1, Ye Li2, Vincent Scalfani3, Shawn O'Grady2

1 Cambridge Crystallographic Data Centre, Cambridge, United Kingdom; 2 University of Michigan, Ann Arbor, Michigan, United States; 3 University Libraries, University of Alabama, Tuscaloosa, Alabama, United States; 4 CCDC, Cambridge, United Kingdom
With the advent of affordable 3D Printing technology, including in-house printers and web-based commercial enterprises, what had long been a novelty is rapidly becoming a useful tool in the education process. Students have long used chemical model kits to create tactile molecules which help elucidate principles of bonding, VSEPR theory and other three-dimensional properties which are difficult to understand from the two-dimensional world of textbooks and slide presentations. The ability to print copies of common molecules, as they appear in the solid state, can not only bring a sharper understanding of these “static” concepts but can also shed light on dynamic processes such as those involved in molecular machines, host-guest chemistry, protein docking phenomena, and molecular motions. The inherent difficulty in creating models which can demonstrate these dynamic behaviors is finding the correct parameters and materials which will produce an interlocking, flexible model which remains robust. Many academic libraries have created 3D labs providing 3D visualization and printing services. These 3D labs enable us to develop these models collaboratively with researchers, educators, digital fabrication specialists and librarians together.

Using data available from the nearly 800,000 structures in the Cambridge Structural Database (CSD) and software embedded in the visualization and exploration program Mercury, we explore the procedures needed to produce such classroom aids. This presentation, timed to coincide with the end of the 50th anniversary year of the Cambridge Structural Database, will describe our attempts to create dynamic 3D models as well as several educational modules which can be used in conjunction with them.

CINF: Chemistry, Data & the Semantic Web: An Important Triple to Advance Science 1:30pm - 4:45pm
Wednesday, March 16
Room 25B - San Diego Convention Center
Evan Bolton, Stuart Chalk, Organizing
Evan Bolton, Stuart Chalk, Presiding
1:30pm-1:35pm Introductory Remarks
1:35pm-2:00pm CINF 140: IUPHAR/BPS guide to pharmacology (GtoPdb): Concise mapping for the triples of chemistry, data, and protein target classifications
Christopher Southan,, Joanna Sharman, Adam Pawson, Elena Faccenda, Jamie Davies

Guide to PHARMACOLOGY, University of Edinburgh, Göteborg, Sweden
The International Union of Basic and Clinical Pharmacology Committee on Receptor Nomenclature and Drug Classification (NC-IUPHAR) provides authoritative reports on G protein-coupled receptors (GPCRs) Nuclear Hormone Receptors and Ion Channels as pharmacology-based classifications. While these recommendations surfaced as Pharmacological Review papers (i.e. unstructured) since the 1990’s, they were already underpinning the protein tables in GtoPdb's predecessor, IUPHAR-DB, by 2003. By 2012 this hierarchical data structure had expanded into the GtoPdb schema covering essentially all target classes for pharmacology, drug discovery and chemical biology. As of August 2015 the expert-curated relationship capture from the literature covers 1505 target-to-ligand mappings of which 1228 human protein IDs have quantitative interaction data recorded against 5860 chemical structures. The motivation, evolutionary trajectory, the need for community engagement to fill data gaps and future directions of the resource will be outlined. Descriptions will cover the challenges of cross-referencing alternative gene/protein hierarches, each of which has different navigational utilities and linkages to chemistry in GtoPdb. These now extend beyond receptors to enzymes and include NC-IUPHAR, HGNC, UniProt, Ensembl, InterPro, Gene Ontology and E.C. numbers. The adaption of our classifications to encompass a new immunopharmacology project will also be discussed.

2:00pm-2:25pm CINF 141: Open PHACTS: Semantic interoperability for drug discovery
Herman Van Vlijmen1,, Open PHACTS Consortium2

1 Computational Chemistry, Discovery Sciences EU, Janssen, Beerse, Belgium; 2, Vienna, Austria
The Open PHACTS project ( has built a semantic platform for drug discovery that integrates data over diverse sets of public chemistry and biological data. It currently connects linked open data from 12 different data sources, including chemical compounds, protein targets, biological pathways and tissues, and diseases. The diversity and size and of the Open PHACTS data are growing rapidly, and it contains currently more than 3 billion triples. The Open PHACTS project is a unique collaboration between European academic groups, small businesses and large pharmaceutical companies, partially funded by the EU. The driver for the project is to enable scientists to easily access and process data from multiple sources to solve real-world drug discovery problems that were very difficult to solve before. These drug discovery problems formed the basis for selecting what public data sources were integrated in the Open PHACTS project. Anyone can freely access the Open PHACTS data through a well-documented interface (API), and numerous workflows to answer specific biomedical questions have been developed and published using the KNIME and Pipeline Pilot pipelining tools. In addition, several custom applications have been built using the API. Open PHACTS has shown that Linked Open Data in the form of RDF triples can be used effectively by the scientific community, and allows queries that were previously very difficult or impossible to run. Future directions include the integration of additional public and commercial data sources, integration of internal company data with Open PHACTS data, and the continued development of workflows for scientific questions that can only be answered using linked data.

2:25pm-2:50pm CINF 142: Representation of drug discovery knowledge in the ChEMBL and SureChEMBL databases
Anna Gaulton,

Chemogenomics Team, European Molecular Biology Laboratory - European Bioinformatics Institute, Cambridge, United Kingdom
The ChEMBL1 bioactivity database and the SureChEMBL patent resource were both originally developed as commercial products but have been transferred to EMBL-EBI and are now freely available to academic and industrial researchers. The increasing availability of such open chemistry data has had a dramatic impact on the field of cheminformatics. This talk will address some of the issues and complexity involved in curating and maintaining these large-scale chemistry resources and the strategies employed to facilitate integration and mining of these data2,3.


1. Bento, A.P., Gaulton, A., Hersey, A., Bellis, L.J., Chambers, J. Davies, M., Krüger, F.A., Light, Y., Mak, L., McGlinchey, S., Nowotka, M., Papadatos, G., Santos S., Overington, J.P. The ChEMBL bioactivity database: an update. Nucleic Acids Res. 42, D1083-D1090 (2014).

2. Papadatos, G., Gaulton, A., Hersey, A., Overington J.P. Activity, assay and target data curation and quality in the ChEMBL database. J. Comput. Aided Mol. Des. DOI:10.1007/s10822-015-9860-5 (2015).

3. Hersey, A., Chambers, J., Bellis, L., Bento, A.P., Gaulton, A., Overington, J.P. Chemical databases: curation or integration by user-defined equivalence? Drug Discov. Today Technol. 14, 17-24 (2015).

2:50pm-3:15pm CINF 143: Chemical knowledge representation and access in Wolfram|Alpha and Mathematica

Eric Weisstein,

Scientific Content, Wolfram|Alpha, Champaign, Illinois, United States
Wolfram|Alpha ( is a freely available website that contains and exposes curated data sets taken from hundreds of technological, scientific, sociological, and other domains--including a substantial and growing set of chemical data. This data is accessible directly via the website, through its API, and through a number of other specialized sources (such as various apps and SIRI). More recently, it is also available in the Wolfram Language and Mathematica as a set of built-in functions centered around an entity-property approach to information representation.

This talk will focus on the infrastructure developed for representing and accessing data (especially chemical data) in Wolfram|Alpha and on the Wolfram Language functionality for making this data even more computationally accessible within Mathematica. The talk will also touch on the extensive unit system now built into Mathematica through the unit and physical quantity infrastructure backend developed for Wolfram|Alpha.

The introduction of entity, entity class, property, property qualifier, and related Wolfram Language symbols provides a flexible way to represent, access, and compute with data. At the same time, Wolfram|Alpha and Mathematica implement a natural language encoding and processing system for easily accessing information and automatically converting plain text inputs and computational queries into the Wolfram Language. The resulting synthesis of data representation, exposure, and access provides a powerful and extensible framework that is practically applicable to virtually any domain of interest (including chemistry).

Wolfram|Alpha's knowledge comes from a combination of Mathematica computations, roughly 1000 curated data sets, and links to a number of real-time data sources. Additional chemistry-specific functionality currently under development in the Wolfram Language includes service connections to the Open PHACTS, ChemSpider, and PubChem databases (which will allow computations using chemical databases more extensive than those directly built in to the Wolfram Language), computational encoding of functional groups (for further graph related exploration), and improved support for pharmaceuticals and chemical compounds relevant in medical sciences.

3:15pm-3:30pm Intermission
3:30pm-3:55pm CINF 144: Helping people navigate the changing seas of scientific information
David Evans1,, Pieder Caduff1, Thibault Geoui2, Juergen Swienty-Busch2

1 Reed Elsevier Properties SA, Neuchatel, Switzerland; 2 Elsevier Information Systems, GmbH, Frankfurt, Germany
Many people suggest that chemistry is the central science. It certainly underpins much of our modern life, and has a central role in delivery many of the solutions to key problems facing mankind today. At RELX Group we provide high quality content, and data, and analytics tools to help scientists make these new discoveries. In order to provide the systems that will meet the demands of tomorrow scientists we are re-building out infrastructures today, including different classifications, linking technologies. In this presentation we will discuss the ongoing requirements for the production of a major online research resources, in particular ontologies and taxonomies in the chemical, biological and biomedical areas, automatic indexing of content, simplification of search strategies. We will also provide an insight into how our strategies for the future are being influences by changes in user behaviour and demands, and technology on the web.

3:55pm-4:20pm CINF 145: Characterization and categorization of novel knowns, unknowns, and the interface between physical and digital
Graeme Whitley1,, Bernd Berger2, Timothy Adams2

1 Wiley, Hoboken, New Jersey, United States; 2 Wiley-VCH, Weinheim, Germany
We present our experience as a publisher in categorizing novel compounds, partially characterized metabolites, mixtures, and other edge cases in the interface between lab instrument, literature, and the chemical knowledge space. Examples and solutions from a variety of domains, include toxicology and clinical, will be provided, with an emphasis on spectroscopic data.

4:20pm-4:45pm CINF 146: Semantic approaches for biochemical knowledge discovery
Michel Dumontier,

Medicine, Stanford University, Stanford, California, United States
With its focus on investigating the basis for the sustained existence of living systems, biochemistry has always been a fertile, if not challenging, domain for formal knowledge representation and automated reasoning. Thousands of databases and hundreds of ontologies are publically available, and there is a salient opportunity to mine these for discovery. In this talk, I will discuss our efforts to build a rich foundational network of ontology-annotated linked data, develop methods to intelligently retrieve content of interest, uncover significant biochemical associations, and pursue new avenues for drug repositioning. As the portfolio of semantic technologies continue to mature in terms of functionality, scalability, and an understanding of how to maximize their value, biochemical researchers will be strategically poised to pursue increasingly sophisticated projects at improving our overall understanding of human health and disease.

CINF: Reimagining Libraries as Innovation Centers: Enabling, Facilitating & Collaborating throughout the Research Life Cycle 1:30pm - 4:45pm
Wednesday, March 16
Room 24C - San Diego Convention Center
Ye Li, Vincent Scalfani, Organizing
Vincent Scalfani, Presiding
1:30pm-1:35pm Introductory Remarks
1:35pm-2:00pm CINF 147: Leveraging the VIVO research networking system to facilitate collaboration and data visualization

Michaeleen Trimarchi, Danielle Bodrero Hoggan,

Kresge Library, The Scripps Research Institute, La Jolla, California, United States
VIVO is a Research Networking System (RNS) based on open source software originally developed at Cornell. The Scripps Research Institute's Kresge Library staff created the Scripps VIVO Scientific Profiles RNS with NIH grant support in 2009-2011 and have continued to enhance this Linked Open Data resource. At the start of the research life cycle, faculty can search VIVO to identify potential collaborators. When they are preparing grant proposals and submitting renewals, they can include VIVO's NIH Biosketch Lists and PubMed Papers links. VIVO's metadata allows for the automated creation of Map of Science and Co-author Network data visualizations based on journal articles in a faculty member's profile. In addition, Library staff reuse the data generated for VIVO publication ingest to create custom research collaboration network visualizations to support NIH training grant applications.

2:00pm-2:25pm CINF 148: Stanford profiles created to support the university’s scholarly community
Grace Baysinger,

Swain Chem & Chem Eng Library, Stanford University Libraries, San Jose, California, United States
In 2014, Stanford created the Stanford Profiles website to support faculty and to facilitate their research activities by extending to other schools, institutes and administrative units on campus the Community Academic Profiles (CAP) system that has been available to School of Medicine faculty since 2004. Currently, there are more than 18,000 profiles of faculty, graduate students, postdocs and staff in Stanford Profiles. The profiles are available through both a public and a Stanford-only view. A faculty profile may include biographical information, research interests, publications, courses taught, name of graduate and postdoctoral advisees, doctoral programs the faculty member is associated with as a PhD advisor, faculty collaborators, plus cross-references for faculty members by keywords. Data is being pulled to profiles from the unit that generates the data (e.g. University Registrar for courses taught). A user is able to download a curriculum vita created from profile data. Under a system developed by Stanford University Libraries, new publications are 'exported' to Stanford Profiles, where the citations and other relevant data are displayed in the profile owner's inbox for review. Once an individual has approved a publication, it will appear on his or her profile. Salesforce Chatter, a leading social-networking platform designed for the business context, is integrated into the Stanford-only view, making it easy to work closely with colleagues in a private, secure environment. The CAP Working Group has overseen the development of Stanford Profiles; a collaborative group containing representatives from participating units. Data from Stanford Profiles can be shared with other Drupal-based websites on campus via APIs, thus saving time and duplication of effort.

2:25pm-2:50pm CINF 149: Managing researchers' reputations throughout the research life cycle
Linda Galloway,, Anne Rauh

Syracuse University Libraries, Syracuse, New York, United States
Publically documenting research impact using professional, academic, and social networks has become an increasingly important component of the research life cycle. At Syracuse University Libraries, STEM Librarians assist researchers in developing and managing their online portfolios. Tools like figshare, github, Slideshare,, Research Gate, Google Scholar, and more can be used in building one’s online reputation. From data to peer-reviewed journal articles, teaching researchers how to best promote their work will highlight their accomplishments and create opportunities for researcher and librarian interactions. This presentation will give an overview of networking tools and include descriptions of recommended services and outreach strategies. Attendees will learn the best tools and resources for managing their professional reputation and for helping researchers to do the same.

2:50pm-3:05pm Intermission
3:05pm-3:30pm CINF 150: Anatomy of the chemistry research enterprise in the academic sector: Serving the underserved in a large research institution
Leah McEwen,

Clark Library, Cornell University, Ithaca, New York, United States
The Research Life Cycle (RLC) at any research institution involves a myriad of scientific and technical support roles, including instrumentation, data management, information access, environment health and safety. Researchers engage with many of these services and these providers in turn liaise across numerous disciplines and departments. All of these functions involve the use of technical information for analysis, interpretation and documentation. In supporting these other research support groups, libraries contribute more fully to the RLC and engage more broadly across the research community. This talk will outline outreach services developed for a variety of service groups on an academic university campus, including chemical analysis labs, chemistry IT services, Environmental Health & Safety and Occupational Medicine.

3:30pm-3:55pm CINF 151: Safety use case for chemical safety information
Ralph Stuart,

Dept of Env Hlth Safety, Keene State College, Keene, New Hampshire, United States
Since 2010, increasing interest in chemical safety in general and laboratory safety in particular has led to the development of new tools for risk assessment of chemical use in the laboratory. In 2015, the NFPA issued a new standard for chemical safety in the teaching setting. This presentation will describe how these tools can be used to support prudent planning of laboratory research and teaching. The safety professional's and librarian's role in using these tools will be described and sources of chemical safety information highlighted.


3:55pm-4:20pm CINF 152: PubChem BioAssay: Grow with the community
Yanli Wang,

Building 38a, Room 5s506, Bethesda, Maryland, United States
The PubChem BioAssay repository was set up in 2004 by the National Center for Biotechnology Information (NCBI). While initially serving as an archival information system for small molecule bioactivity data from HTS experiment, the BioAssay database was further developed over the years to support depositions of small molecule and RNAi research result that are associated with publications. The data content in PubChem BioAssay is contributed by world-wide screening facilities, research laboratories, as well as literature curation projects. The database has now received over 1,000,000 bioassay record submissions (BioAssay accession, AID), containing 200 million bioactivity outcomes against tens of thousands protein and gene targets. Being created to meet the community’s need for data sharing, the more than one decade and tireless development at PubChem has been supported, driven, and stimulated by the participation and enthusiasm of the community. This presentation will describe the effort from the community and PubChem when working together to support and advocate data sharing and open access. BioAssay may be accessed at
Additional retrieval and data analysis tools are available at Bioassay data may be submitted using the PubChem Upload tool at: PubChem provides embargo mechanism to assist data deposition associated with manuscript submission.

4:20pm-4:40pm Discussion
4:40pm-4:45pm Concluding Remarks
CINF: Chemistry, Data & the Semantic Web: An Important Triple to Advance Science 8:15am - 11:55am
Thursday, March 17
Room 25B - San Diego Convention Center
Evan Bolton, Stuart Chalk, Organizing
Evan Bolton, Stuart Chalk, Presiding
8:15am-8:20am Introductory Remarks
8:20am-8:45am CINF 153: Linking chemical and non-chemical data in structured product labeling
Yulia Borodina,, Bill Hess, CoCo Tsai, Pete Phong, Lonnie Smith

FDA, Catonsville, Maryland, United States
Structured Product Labeling (SPL) is a document markup standard approved by Health Level Seven (HL7) and adopted by FDA as a mechanism for exchanging product and facility information. Product information provided by companies in SPL format may be accessed from the FDA Online Label Repository ( and the National Library of Medicine DailyMed web site ( The product information indexing initiative has the goal of enhancing access to the electronic product information provided by the companies. Indexing refers to the creation by FDA of one or more files with machine-readable annotations that can be linked to the product SPL provided by the company. FDA maintains and publishes SPL Indexing Files for Pharmacologic Class, Substance, Product Concept, Biological Drug Substance, and Billing Units. Data from the Indexing Files can be linked to data in both SPL resources and external resources via chemical and non-chemical identifiers.

8:45am-9:10am CINF 154: Ginas: A global effort to define and index substances in medical products
Tyler Peryea1,, Lawrence Callahan2

1 Informatics, NIH NCATS, North Bethesda, Maryland, United States; 2 FDA, Silver Spring, Maryland, United States
Chemical databases have a rich history in the recent past. The development of systematic nomenclature, chemical data formats, and identity standards have allowed chemical data to become increasingly definable and searchable. However, the scope of definitional chemical databases, to date, has remained largely on small well-defined organic molecules. Large molecules and complex poly-disperse substances are often neglected, under-standardized, or entirely ignored. Historically, the scope of chemical databases has been slow to expand into other substance classes due to a lack of available standards, a lack of software tools, and a lack of motivating usage cases for deeply describing such materials. The increasing need to track and monitor medical products of all forms on the global market place, however, has motivated the the creation of the ISO IDMP standards, with ISO 11238 describing a strategy for encoding substance information of diverse forms and origins: from simple chemicals, to complex polymers, extending even to plant and animal material. ginas is a global effort to implement the ISO IDMP substance standard with useful and distributable software tools, with the aim of facilitating the global interchange of well-defined substance information from 'lithium' to 'leeches'.

9:10am-9:35am CINF 155: TranSMART Foundation: An open-data and open-science platform to integrate molecular and clinical data in translational research and precision medicine
Rudolph Potenzone,

tranSMART Foundation, Redmond, Washington, United States
The tranSMART Foundation is a not-for-profit organization that fosters the evolution of the open source tranSMART Platform in support of translational research. With active research in over 100 labs worldwide, the Platform is used daily by scientists in industry, research foundations, academic labs and medical schools. Molecular data, genomics information, proteomics experiments are stored together with anonymized patient data, outcomes, time series and wearable sensor data and our system allows for sub setting, query and routine analyses. Advanced and complex analytics and visualization are possible through our API and many examples of interconnectivity are available such as with R, Spotfire and Genome browser.

In this talk, we will also cover an interesting approach to advance our understanding of disease and diagnostic and treatment options. We have held our first Datathon and others are in planning. A Datathon brings together multiple data repositories, often ones that have never been used in concert previously, within the tranSMART Platform. Key analytical tools and extensions that could be suited to the particular topic of the Datathon are also gathered into a single platform instance. Finally, a team of human experts is assembled that includes data scientists, machine learning practitioners along with experienced researches from the particular topic disease area of interest. Over the course of three days, this teams work with the Platform and the assembled data to attempt to learn new relationships and form novel hypotheses that can form the basis of future research effort. Results from these Datathon sessions will be shared.

9:35am-10:00am CINF 156: Leveraging RxNorm and drug classifications for analyzing prescription datasets
Olivier Bodenreider,

Lister Hill National Center for Biomedical Communications, National Library of Medicine, Bethesda, Maryland, United States
Prescription datasets (e.g., claims data obtained from Medicare Part D) represent a rich source of information for studying frequencies of prescription and co-prescription (i.e., concomitant medications). We demonstrate that RxNorm supports the conversion of various kinds of identifiers for clinical drugs (e.g., National Drug Code and First DataBank) to RxCUIs, the identifiers required for exchanging drug information as part of the Meaningful Use incentive program. Moreover, drug classes provide a convenient way of analyzing prescription datasets at a higher level (e.g., by aggregating specific medications, such as Lipitor 10 MG oral tablet, into the class statins). RxNorm is well integrated with many drug classification systems, such as the Anatomical Therapeutic Chemical (ATC) classes, and contributes to the class-level analysis of prescription datasets.

10:00am-10:15am Intermission
10:15am-10:40am CINF 157: Evolution of digital and semantic chemistry at Southampton
Jeremy Frey1,, Simon Coles2, Colin Bird1

1 University of Southampton, Southampton, United Kingdom; 2 University of Southhampton, Hampshire, United Kingdom
We take a historical view of e-Science and e-Research (alternatively called Cyber-Infrastructure) developments within the range of Chemical Sciences at the University of Southampton (UK). We discuss the development of several stages of the evolving data ecosystem as Chemistry moves into the digital age of the 21st Century. We cover our research on aspects of the representation of chemical information in the context of the world wide web (WWW) and its semantic enhancement (the Semantic Web) and illustrate this with the example of the representation of quantities and units within the Semantic Web. We explore the changing nature of laboratories as computing power becomes increasing powerful and pervasive and specifically look at the function and role of electronic or digital research notebooks. Having focussed on the creation of chemical data and information in context, we highlight the use and reuse of this data as facilitated by the features provided by digital repositories and their importance in facilitating the exchange of chemical information touching on the issues of open and or intelligent access to the data.

10:40am-11:05am CINF 158: Implementing chemistry platform for OpenPHACTS: Lessons learned
Colin Batchelor, Alexey Pshenichnov, Jon Steele, Valery Tkachenko,

Royal Society of Chemistry, Rockville, Maryland, United States
The Open PHACTS project delivers an online platform integrating a wide variety of data from across chemistry and the life sciences and an ecosystem of tools and services to query this data in support of pharmacological research, turning the semantic web from a research project into something that can be used by practising medicinal chemists in both academia and industry. In the summer of 2015 it was the first winner of the European Linked Data Award. At the Royal Society of Chemistry we have provided the chemical underpinnings to this system and in this talk we review its development over the past five years. We cover both our early work on semantic modelling of chemistry data for the Open PHACTS triplestore and more recent work building an all-purpose data platform, for which the Open PHACTS data has been an important test case, what has worked well, what's missing and where this is is likely to go in future.

11:05am-11:30am CINF 159: Representation of molecular structures and related computations on the semantic web: A universal data model and its ontology
Mirek Sopek2,, Stuart Chalk1, Neil Ostlund2, Jacob Bloom2

1 Department of Chemistry, University of North Florida, Jacksonville, Florida, United States; 2 Chemical Semantics, Inc., Gainesville, Florida, United States
Chemical Semantics, Inc. is a company with a mission of bringing the semantic web to computational chemistry, with future goals covering chemical results from other areas. The company has built a universal portal that enables computational chemists to publish results of their computations on semantic web servers (powered by semantic triple stores) holding RDF data.

This presentation will report the work of the definition, implementation and evaluation of a new data model based on semantic web standards. This new model exploits further the RDF data model for efficient encoding of the molecular structures and basic results of computational chemistry experiments. Various serializations methods were tested including Turtle and JSON-LD.

The model is exceptionally flexible and allows for various types of chemical structure representation (e.g. Cartesian, fractional or based on Z-matrix). It enables the encoding of various structural units like residues for polymers and biopolymers and groups. Efficient encoding of bonding information enables fast substructure searches using standard tools like SPARQL and application in the domain of cheminformatics. The model offers maximum possible flexibility, allowing users to add their own data without the destroying the readability of the core elements.
The model enables practitioners to interact with the data in much more flexible way using variety of current programming tools and languages.
The most important aspect of the model is its fully semantic character, i.e. encoding the meaning of the data in the data itself through the reference to the new edition of Gainesville Core ontology1.

[1] Neil Ostlund, Mirek Sopek, Proceedings of the 6th International Workshop on Semantic Web Applications and Tools for Life Sciences, Edinburgh, UK, December 10, 2013.

11:30am-11:55am CINF 160: GlyTouCan international glycan structure repository using semantic web technologies
Issaku Yamada1,, Kiyoko Aoki-Kinoshita2,3, Nobuyuki Aoki2, Daisuke Shinmachi2, Masaaki Matsubara1, Akihiro Fujita2, Shinichiro Tsuchiya2, Shujiro Okuda4, Noriaki Fujita3, Hisashi Narimatsu3

1 The Noguchi Institute, Tokyo, Japan; 2 Graduate School of Engineering, Soka University, Tokyo, Japan; 3 Research Center for Medical Glycoscience, AIST, Tsukuba, Japan; 4 Graduate School of Medical and Dental Sciences, Niigata University, Niigata, Japan
Glycans are known as the third major biomolecules, next to DNA and proteins, and they have been found to be involved in various important biological functions. The structure of glycans, however, differs greatly from DNA and proteins in that they are branched, as opposed to linear sequences of amino acids or nucleotides. Therefore, the storage of glycan information in databases, let alone their curation, has been a difficult problem.
This has caused efforts in the integration of glycan data between different databases difficult, making an international repository for glycan structures, where unique accession numbers are assigned to every identified glycan structure, necessary. As such, an international team of developers and glycobiologists have collaborated to develop this repository, called GlyTouCan, which has been released this year and is freely available at, to provide a centralized resource for depositing glycan structures, compositions and topologies, and to retrieve accession numbers for each of these registered entries.
GlyTouCan has been developed based on Semantic Web technologies, providing links to other major glycan databases such as GlycomeDB and BCSDB, using RDF. The RDF data of linked resources in GlyTouCan use GlycoRDF, an ontology to represent glycomics data. Moreover, the glycan structure representation called WURCS is used as the main format for storing glycans, thus ensuring uniqueness of even ambiguous glycan structures while representing them as linear strings. This allows for efficient searching of the repository for existing structures because a simple text comparison can be used. In addition, an enhancement of WURCS as an RDF representation allows a glycan structure to be searched using a SPARQL query.
As a result, GlyTouCan enables researchers to reference glycan structures simply by accession number, as opposed to by chemical structure or text string, which has been a burden to integrate glycomics databases in the past. Moreover, GlyTouCan is being supported by the MIRAGE initiative, recommending that its accession numbers be used when reporting glycomics experiments in publications that include identified glycan structures. This will also allow easier identification of glycan structures in publications.
Thus, in the future, not only can GlyTouCan serve as a central registry, but it can serve as a portal to search for glycan-related publications as well as other biological information.

CINF: General Papers 9:00am - 11:50am
Thursday, March 17
Room 24C - San Diego Convention Center
Elsa Alvaro, Erin Davis, Organizing
Elsa Alvaro, Erin Davis, Presiding
9:00am-9:05am Introductory Remarks
9:05am-9:35am CINF 161: Progress toward a conformational database for sesquiterpene reaction pathways
Jordan Zehr2,, Dean Tantillo1, Christian Hamann3,

1 Dept Chemistry, UC Davis, Davis, California, United States; 2 Chemistry & Biochemistry, Albright College, Reading, Pennsylvania, United States
The transformation of the bisabolyl cation in to a range of sesquiterpene natural products has been described in the literature (Hong, YJ; Tantillo, DJ. J. Am. Chem. Soc. 2014, 136, 2450−2463). Hong and Tantillo proposed unifying mechanistic pathways by which the moncylcic bisabolyl cation is converted the into mono-, di- and tricyclic molecules containing an array of interesting structural features including 3-to-7-membered rings, fused rings, spiro centers, geometric and stereoisomers, and conjugated dienes. The sesquiterpene products of these pathways include barbatene, bazzanene, chamigrene, chamipinene, cumacrene, cuprenene, dunniene, isobazzanene, iso-g-bisabolene, isochamigrene, laurene, microbiotene, sesquithujene, sesquisabinene, thujopsene, trichodiene, and widdradiene. Now that the chemistry steps for the pathways leading to these products have been established we are focused on establishing a conformational library in database format of sesquterpene carbocation intermediates and products. We propose that analysis of this database will provide insight into the detailed stereoelectronic requirements of these chemically complex carbocation cascades.

9:35am-10:05am CINF 162: OMPOL: Visualization of large chemical spaces
Peter Corbett, Colin Batchelor, Alexey Pshenichnov, Valery Tkachenko,

Royal Society of Chemistry, Rockville, Maryland, United States
In last few years the number and the size of chemical databases has been steadily increasing, as has the complexity of information residing in those databases creating truly multidimensional chemical spaces. Yet the most common user interface approach still remains based on search-and-browse workflow thus essentially preventing a proper navigation through such databases and hiding data patterns which may belong to other dimensions. As we at the Royal Society of Chemistry are building a chemical database service it is potentially useful to be able to visualize large chemical spaces, ranging in size from tens of thousands to tens of millions of compounds. Dimensionality reduction techniques such as PCA have been used to produce two-dimensional displays of large chemical spaces, via the production of scatterplots. Standard chart-plotting libraries allow interactive scatterplots to be produced, but do not scale well to large numbers of data points. Our new visualisation tool, OMPOL, is a browser-based tool for displaying and interacting with these data sets, allowing people to smoothly and responsively pan and zoom these plots, view the names and structures associated with the data points, select regions of chemical space and find typical and atypical members of those regions.

10:05am-10:35am CINF 163: Comparison of machine learning algorithms for the prediction of critical values and acentric factors for pure compounds
Wendy Carande,, Andrei Kazakov, Kenneth Kroenlein

NIST, Boulder, Colorado, United States
Speed and accuracy are primary factors to consider when choosing a machine learning algorithm for prediction of thermophysical properties. Individually, swift computational methods often incur large deviations between predicted and experimental values, but ensemble methods can make up for this shortcoming. We propose a boosting method in which multiple “weak learners” are combined to create a stronger predictive algorithm, and we present predictions for critical temperature, critical pressure, and acentric factor. Our training set for a given compound consists of the 15 most structurally similar (as determined by the Tanimoto metric) compounds for which we have experimental data. 19 predictive models, each with automated feature selection, are combined to construct our ensemble. These methods include multivariate adaptive regression spline models, linear models (using ridge regression, lasso, elastic net, and partial least squares strategies), rule-based model trees with nearest-neighbor corrections, and single-variable quadratic models. The median of the prediction pool provides a property estimate for the target compound and the median absolute deviation (MAD) of the predictions provides an uncertainty measure. We find that combining these methods performs favorably against any individual method in the prediction algorithm pool.

10:35am-10:50am Intermission
10:50am-11:20am CINF 164: Optimal superposition of arbitrarily ordered molecules using the Kuhn-Munkres algorithm
Berhane Temelso1,, Joel Mabey1, Toshiro Kubota3, George Shields2

1 701 Moore Avenue, Bucknell University, Lewisburg, Pennsylvania, United States; 2 Deans Office, 113 Marts Hall, Bucknell University, Lewisburg, Pennsylvania, United States; 3 Mathematical Sciences, Susquehanna University, Selinsgrove, Pennsylvania, United States
When assessing the similarity between two isomers whose atoms are ordered identically, one typically translates and rotates their Cartesian coordinates for best alignment and computes the pairwise root mean square distance (RMSD). However, if the atoms are ordered differently, it is necessary to find the best ordering of the atoms and check for chirality before calculating a meaningful pairwise RMSD. The exponential scaling of the computational cost of finding best ordering makes it too expensive for any system with more than ten atoms. We report the use of Kuhn-Munkres matching algorithm to reduce the cost of finding the best ordering from exponential to polynomial scaling. That allows the application of this scheme to any arbitrary system in a reasonably short time. The implementation of this approach and its application to systems ranging from molecular clusters to large peptides will be demonstrated.

11:20am-11:50am CINF 165: Predicting drug-induced hepatic systems' toxicity by integrating transporter interaction profiles

Eleni Kotsampasakou,, Gerhard Ecker

Department of Pharmaceutical Chemistry, University of Vienna, Vienna, Austria
Systems pharmacology studies that utilize large data sets, such as protein–protein interaction networks and the FDA adverse event reports, can enhance the understanding of drug adverse events and pinpoint off-targets [1]. In this context, drug-induced liver injury (DILI) is a major challenge for drug development, as it comprises one of the main causes of attrition [2]. There are several indications in literature associating hepatic transporter inhibition with manifestations of DILI, such as OATP1B1 and 1B3 with hyperbilirubinemia [3] and BSEP with cholestasis [4].
Towards this direction, we developed statistical classification models for predicting hepatotoxic endpoints, such as hyperbilirubinemia, cholestasis and DILI, by combining the physicochemical and structural properties of compounds with their hepatic transporter inhibition profiles. For the latter task, we used our in-house transporter inhibition models for BSEP, P-glycoprotein, BCRP and OATP1B1 and OATP1B3. Several meta- and base-classifiers were investigated and the classification models obtained were of reasonable performance for all three endpoints.
For the case of hyperbilirubinemia, OATP1B1 and 1B3 inhibition profiles are evaluated as important descriptors, even though there is no significant improvement of the statistical performance of the model when using transporters’ information. In contrast, for cholestasis the use of transporter inhibition profiles significantly improves the model’s performance, although the individual transporters are not ranked high in comparison to other physicochemical descriptors. Finally for general DILI prediction, descriptors annotating transporter inhibition do not influence the model’s performance. In addition, their importance is low compared to other physicochemical descriptors, such as lipophilicity. This suggests that for the entire liver system, there is no clear association pattern with transporters - at least not for the particular ones investigated.

The research leading to these results has received support from the Innovative Medicines Initiative Joint Undertaking under grant agreement n°115002 (eTOX), as well as from the Austrian Science Fund, grant F3502.

1. Berger, S.I. et al., Interdiscip Rev Syst Biol Med 2011, 3, (2), 129–135
2. O’ Brien, P.J. et al., Arch Toxicol 2006, (80), 580–604
3. Chang, J. H. et al., Mol Pharm 2013, 10, (8), 3067-75
4. Vinken M. et al., Toxicol Sci 2013, 136,(1), 97–106

CINF: Chemistry, Data & the Semantic Web: An Important Triple to Advance Science 1:30pm - 4:20pm
Thursday, March 17
Room 25B - San Diego Convention Center
Evan Bolton, Stuart Chalk, Organizing
Evan Bolton, Stuart Chalk, Presiding
1:30pm-1:35pm Introductory Remarks
1:35pm-2:00pm CINF 166: Ontology for biomedical investigations (OBI)
Bjoern Peters,, James Overton, Randi Vita, OBI consortium

Division of Vaccine Discovery, La Jolla Institute for Allergy & Immunology, La Jolla, California, United States
The Ontology for Biomedical Investigations (OBI) provides terms with precisely defined meaning to describe all aspects of how biomedical investigations are conducted. OBI re-uses ontologies that provide a representation of biomedical knowledge from the Open Biological and Biomedical Ontologies (OBO) project and adds the ability to describe how this knowledge was derived. OBI covers all phases of the investigation process, such as planning, execution and reporting. It represents information and material entities that participate in these processes, as well as roles and functions. Prior to OBI, it was not possible to use a single internally consistent resource that could be applied to multiple types of experiments for these applications. OBI has made this possible by creating terms for entities involved in biological and medical investigations and by importing parts of other biomedical ontologies such as GO, ChEBI and PATO without altering their meaning. OBI is being used in a wide range of projects covering genomics, multi-omics, immunology, and catalogs of services. The OBI project is an open cross-disciplinary collaborative effort, encompassing multiple research communities from around the globe. The OBI Consortium maintains a web resource ( providing details on the people, policies, and issues being addressed in association with OBI. The current release of OBI is available at

2:00pm-2:25pm CINF 167: Protein ontology: Fostering connections in chemical biology
Darren Natale1,2,

1 Georgetown University Medical Center, Washington, District of Columbia, United States; 2 PRO Consortium, Washington, District of Columbia, United States
Our understanding of The Way Things Work advances when we are able to make connections between individual observations and the entities such observations are about. Such understanding is especially facilitated when we have the ability to say precisely what is known, without overstating or understating, about precisely what entity. With this notion in mind, the Protein Ontology (PRO) was developed to provide protein entity representation at key levels of abstraction, ranging from general protein families down to specific protein PTM forms. Here, we describe PRO, its development, and its place in the network of knowledge using examples from the fields of proteomics, glycobiology, and pharmacology.

2:25pm-2:50pm CINF 168: Ontologies for classifying and modeling drug discovery data
Stephan Schuerer1,3,, Asiyah Yu Lin1, Saurabh Mehta1, Hande Kücük McGinty2, Qiong Cheng3, Amar Koleti3, Nooshin Zadeh1, Dusica Vidovic1,3

1 Pharmacology, University of Miami, Miami, Florida, United States; 2 Computer Science, University of Miami, Miami, Florida, United States; 3 Center for Computational Science, University of Miami, Miami, Florida, United States
Several research consortia and countess projects in pharmaceutical companies generate, organize, and analyze small molecule drug screening data. Such consortia supported by the NIH Common Fund include the (now past) Molecular Libraries Program (MLP), and currently the Illuminating the Druggable Genome (IDG) and the Library of Integrated Network-based Cellular Signatures (LINCS) projects. A large component of the MLP program was the development of chemical probes to study a wide variety of biological questions. This program generated new assay technologies, huge amounts of chemical biology screening data and over 350 chemical probes. The observation of an apparent strong bias of drug discovery research and development efforts towards targets that are already well studied, motivated the IDG program to prioritize novel drug targets and catalyze the development of chemical entities that target understudied proteins in these families. The LINCS program has a systems biology focus. The project creates a reference 'library' of molecular signatures, such as changes in gene expression and other cellular phenotypes that occur when cells are exposed to a variety of perturbing agents, and computational tools for data integration, access, and analysis. Dimensions of LINCS signatures include the biological model system (cell type), the perturbation (e.g. small molecules) and the assays that generate diverse of phenotypic profiles.
Data integration is a common and critical challenge in these and other projects; and data integration requires common metadata standards and conventions for data representation and exchange. Towards the goal of creating common data standards to represent data in these and other projects that produce data relevant for drug discovery, and to support software tools that we and others have been building as part of these projects, we have been developing ontologies including the BioAssay Ontology (BAO) and the Drug Target Ontology (DTO). The goal of these ontologies is enable the knowledge-based classification of diverse datasets into categories that facilitates re-use and context-specific integration of these data, for example to develop predictive models or to quickly explore and correlate different datasets.
BAO, DTO and other ontologies provide a robust framework to represent, integrate, model, and query diverse drug discovery data generated in different projects.

2:50pm-3:05pm Intermission
3:05pm-3:30pm CINF 169: Immune Epitope Database (IEDB) and its use of formal ontologies
Randi Vita,, James Overton, Bjoern Peters

Division of Vaccine Discovery, La Jolla Institute for Allergy & Immunology, La Jolla, California, United States
The Immune Epitope Database (IEDB) is a resource provided by the NIH/NIAID to make all published experimental data regarding immune epitopes freely available to the scientific public. Immune epitopes are the specific portion of a pathogen, allergen or autoantigen that is recognized by antibodies or T cells of the immune system. They are most often linear peptides, but can also be carbohydrates, lipids, metals, or other structures. The IEDB represents experimental assays demonstrating the binding of an epitope specific adaptive immune receptor (TCR, antibody, or MHC molecule) to an antigen in a consistent and easily searchable manner by harnessing established biological ontologies for its data representation and creating new ontologies, when needed. Formal ontologies provide standardized nomenclature, hierarchical relationships, and logical definitions. They also provide a simple mechanism to link disparate resources and allow sophisticated queries across these resources.

3:30pm-3:55pm CINF 170: PubChemRDF: Semantic annotation and search
Gang Fu1,, Evan Bolton2

1 NCBI, NIH, Rockville, Maryland, United States; 2 NCBI, NIH, Bethesda, Maryland, United States
PubChem is an open repository for chemical substance description, biological activities and biomedical annotations. PubChem databases have been cross-referenced with other National Center for Biotechnology Information (NCBI) resources, such as PubMed, Gene, Biosystems, and so on. Semantic Web standards offer a well-defined syntax for the formal representation of the PubChem knowledgebase, and Semantic Web technologies facilitate the query and reasoning of PubChem data. PubChemRDF project focused on the semantic annotations of PubChem databases, which were accomplished by using standardized ontologies that promise the high compatibility and consistency with currently existing cheminformatics and bioinformatics resources. Semantic annotations may help PubChem data to be shared, reused, and analyzed across chemical, biological, and life science domains. PubChemRDF provides a new ability for researchers to utilize schema-less database with rule-based reasoner to search and analyze data. We will demonstrate how to combine SPARQL queries and Description Logic (DL) queries for question answering using PubChemRDF data.

3:55pm-4:20pm CINF 171: Generic scientific data model and ontology for representation of chemical data
Stuart Chalk,

Department of Chemistry, University of North Florida, Jacksonville, Florida, United States
The current movement toward openness and sharing of data is likely to have a profound effect on the speed of scientific research and the complexity of questions we can answer. However, a fundamental problem with currently available datasets (and their metadata) is heterogeneity in terms of implementation, organization, and representation.

To address this issue we have developed a generic scientific data model (SDM) to organize and annotate raw and processed data, and the associated metadata. This paper will present the current status of the SDM, implementation of the SDM in JSON-LD, and the associated scientific data model ontology (SDMO). Example usage of the SDM to store data from a variety of sources with be discussed along with initial efforts to develop SPARQL queries, based on the SDMO, that allows federated search across different datasets.