Technical Program with Abstracts

ACS Chemical Information Division (CINF)
247th ACS National Meeting, Spring 2014
Dallas, TX (March 16-20, 2014)

CINF Symposia

E. Bolstad, Program Chair

[Created Fri Feb 14 2014, Subject to Change]

Sunday, March 16, 2014

Joint CINF-RSC CICAG Symposium: Chemical Schemas, Taxonomies and Ontologies - AM Session
Ontologies and Substances

Omni Dallas Hotel
Room: Deep Ellum A
Cosponsored by COMP, MEDI, ORGN, PHYS
Antony Williams, Jeremy Frey, Simon Coles, Leah McEwen, Organizers
Jeremy Frey, Presiding
8:30 am - 11:55 am
8:30 Introductory Remarks
8:45 1 Bridging worlds: Speaking multiple scientific languages

Jessica Peterson1, J.Peterson@elsevier.com, Pieder Caduff2, David Evans2, Juergen Swienty-Busch3. (1) Elsevier Inc, Philadelphia, United States, (2) Reed Elsevier Properties SA, Neuchâtel, Switzerland, (3) Elsevier Information Systems GmbH, Frankfurt, Germany

Cross-scientific communication and actionable information are some of the biggest hurdles when research teams need to take critical decisions. Each scientific sub-discipline uses their very own specific and idiosyncratic vocabularies and nomenclatures. Enabling effective interdisciplinary communication either between applications or within a single framework requires extensive cross mapping of taxonomies, data normalization and ontological relationships harmonization. Better decisions, fewer errors, quicker results are results interdisciplinary teams can achieve when such methods and networks are put in place. We present our experiences leading up to interoperable functionality across two disparate research tools through small molecule taxonomy mapping

9:10 2 Ontology-driven information system for chemical and materials science

Nico Adams1, nico.adams@csiro.au, Murray Jensen1, Danielle Kennedy1, Cornelius Kloppers2, Yanfeng Shu2, Claire D'Este2, Craig Lindley2. (1) Materials Science and Engineering, CSIRO, Clayton, Victoria 3150, Australia, (2) Computational Informatics, CSIRO, Hobart, Tasmania 7001, Australia

Standard information systems for the chemical and materials sciences are almost entirely predicated on the notion of chemical structure and composition as the unique criterion of identity, which also encodes properties. However, the properties of many extended materials, such as polymers (including coordination polymers) are not comprehensible on the basis of their chemical structure and composition alone – rather they are often determined by the provenance and processing history of a material. The melting points of polymers, for example, can be significantly changed through a simple post-processing step such as compounding – this changes property-determining intermolecular interactions rather than fundamental chemistry and as such is not encoded by simple structure representations. This, in turn, means that chemical and process descriptions need to be interoperable and associable with chemical objects. Ontologies map particularly well onto the problem of developing integrated chemical and process representation of entities and can be used to develop complete provenance traces of materials. In this talk, we will discuss our approach to developing such representations and how they can be leveraged for multiple purposes in materials information systems.

9:35 3 Open ontologies and chemical characterization

Colin Batchelor1, Leah R McEwen2, lrm1@cornell.edu. (1) Royal Society of Chemistry, Cambridge, United Kingdom, (2) Cornell University, Ithaca, NY 14850, United States

Cheminformatics techniques afford us opportunity to create tools and protocols that can allow chemists to ask sophisticated questions of the expanding corpus of chemical literature and data and return relevant and well-organized results. One of the most formulaic, and therefore most tractable, parts of the chemistry article is the characterization data for organic syntheses. We provide a worked example of how open ontologies from the RSC, EBI and elsewhere can be used to describe characterization data for a given synthesis, and discuss advantages for elucidating and amassing reported data from the literature. We consider options and outstanding challenges in engaging bench chemists in extending this work to less formulaic areas of the literature, such as full reaction schema and materials specification, mining the primary literature in such a way to suggest recognizable patterns to chemists.

10:00 Intermission
10:15 4 FDA terminology for substances

Yulia Borodina, yulia.borodina@fda.hhs.gov, Larry Callahan, Frank Switzer, Bill Hess, Randy Levin. Office of the Commissioner, FDA, Silver Spring, MD 20993, United States

FDA has started publishing a controlled terminology for substances used in medicinal products. The terminology supports types of substances described in the ISO IDMP 11238 standard. It is implemented as an extension of the Structured Product Labeling (SPL) markup standard and is internally known as “Substance Indexing.” Indexing refers to the creation by FDA of one or more files with machine-readable annotations that can be linked to the product SPL provided by the company. These machine-readable tags in SPL format allow the information to be easily incorporated, based on assigned codes, into electronic health records, e-prescribing systems, pharmacovigilance and clinical decision support systems for rapid searching, sorting of, and access to, relevant product information needed to make critical health care decisions and enhance patient care. A Substance Index File contains information about one substance concept. It includes the substance definition and the unique identifier (UNII) assigned by the FDA Substance Registration System. For most substances, the definition includes a representation of their molecular structure. Chemical substances and mixtures of chemical substances are defined by one or more chemical structures representing structural units or moieties that constitute the substance. Each chemical structure is accompanied by the IUPAC International Chemical Identifier (InChI). Definitions of proteins and nucleic acids include structural units represented using a single letter notation. These and other defining characteristics and their relationships will be discussed.

10:40 5 Pipeline for automated structure-based classification in the ChEBI ontology

Janna Hastings, hastings@ebi.ac.uk, Venkatesh Muthukrishnan, John W May, Gareth Owen, Christoph Steinbeck. Cheminformatics and Metabolism, European Bioinformatics Institute, Cambridge, Cambridgeshire CB10 1SD, United Kingdom

ChEBI is a database and ontology of chemical entities of biological interest. As of October 2013, it contains more than 35,500 entries, organised into structure-based and role-based classification hierarchies. Each entry is extensively annotated with a name, definition and synonyms, other metadata such as cross-references, and chemical structure information where appropriate. In addition to the classification hierarchy, the ontology also contains diverse chemical and ontological relationships. While ChEBI is primarily manually maintained, recent developments have focused on improvements in curation through partial automation of common tasks. We will describe a pipeline we have developed for structure-based classification of chemicals into the ChEBI structural classification. The pipeline connects class-level structural knowledge encoded in Web Ontology Language (OWL) axioms as an extension to the ontology, and structural information specified in standard representations. We make use of the Chemistry Development Kit, the OWL API and the OWLTools library. Harnessing the pipeline, we are able to suggest the best structural classes for the classification of novel structures within the ChEBI ontology.

11:05 6 Accessing Open PHACTS: Interactive exploration of compounds and targets from the semantic web

Katrin Stierand1, stierand@zbh.uni-hamburg.de, Tim Harder3, Lothar Wissler2, Christian Lemmen2, Matthias Rarey1. (1) ZBH Center for Bioinformatics, University of Hamburg, Hamburg, 20146 Hamburg, Germany, (2) BioSolveIT GmbH, St. Augustin, Germany, (3) Philips Medical Systems DMC GmbH, Hamburg, Germany

Pharmacological research is hampered by scattered data which have to be retrieved by varying methods and in different data formats. This heterogeneity increases research costs and limits throughput. Over the last two years, the Open PHACTS Discovery Platform has been developed as a centralized repository, integrating pharmacological data from a variety of information resources and providing tools and services to query these integrated data in pharmacological research. Here, we present the ChemBioNavigator (CBN), a web application allowing to navigate the Open PHACTS chem-bio space with a focus on small molecules and their targets. CBN comprises of a large visualization area with different view modes and two information panels, allowing a deeper insight in information for compounds and targets. It allows interactive exploration of compound sets through sorting and subset selection as well as extending sets by substructure or similarity search. The relation between compounds and targets is defined by assay data from the Discovery Platform. Each compound and each target is annotated with information from multiple data sources which is provided together with the provenance for each data point. In this contribution we roughly outline the CBN technology and highlight the advantages of exploiting the integrated data through the CBN's smart and intuitive interface.

11:30 7 Machine-processable representation and applications of the Globally Harmonized System

Mark I Borkum, m.i.borkum@soton.ac.uk, Department of Chemistry, University of Southampton, Southampton, Hampshire SO17 1BJ, United Kingdom

The Globally Harmonized System of Classification and Labelling of Chemicals, or GHS, is an international standard, created by the United Nations, designed to supersede the many classification and labelling systems that are currently in use around the world. In the European Union, the GHS is implemented as the CLP Regulation. One of the barriers to the adoption of the GHS is the requirement for organisations to map the classification and labelling elements of their existing systems onto those of the GHS. A requirement that, from June 2015, will be mandatory for any organisation that operates within the European Union. We present a software system for the semi-automatic extraction and enrichment of a machine-processable controlled vocabulary of classification and labelling elements from the information content of the CLP Regulation, along with an exemplar dataset of descriptions of substances and mixtures. As our software system uses Semantic Web technologies, the resulting data is highly amenable to further manipulation and integration with data from other sources.

Sunday, March 16, 2014

Joint CINF-CSA Trust Symposium: Energy Information Resources to Help Catalyze Your Research - AM Session

Omni Dallas Hotel
Room: Deep Ellum B

Grace Baysinger, Organizers
Grace Baysinger, Presiding
8:30 am - 11:55 am
8:30 Introductory Remarks
8:35 8 Trends in bio-based chemicals: Business intelligence from published literature

Steve M Watson, s.watson@elsevier.com, Alternative Energy, Elsevier, Amsterdam, North Holland 1043NX, The Netherlands

A new information solution, Elsevier Biofuel, has been used to review the literature landscape, revealing trends in R&D towards the production of valuable chemicals from biomass. Elsevier Biofuel comprises advanced search and analysis tools, using a domain specific taxonomy to automatically classify over 21 million documents, many in full text, covering relevant journal publications, patents, technical reports, conference proceedings and trade publications. The analysis highlights emerging technology areas, commercial opportunities, and the leading companies staking a claim to this space.

9:05 9 Chemistry databases and alerting services for finding the best energy research content

Serin Dabb1, dabbs@rsc.org, Richard Kidd2. (1) Royal Society of Chemistry, Cambridge, Cambridgeshire CB4 0WF, United Kingdom, (2) Royal Society of Chemistry, Cambridge, Cambridgeshire CB4 0WF, United Kingdom

Chemistry research underpins the pursuit of sustainable and renewable materials for energy generation and storage. Due to the interdisciplinary nature of energy research, tools are needed to provide relevant chemically related content to a researcher in this field, whether they have a chemistry background or not. As with all fields of research, services which condense validate, or filter information are even more valuable due the sheer quantity of information available on-line. Current literature updating services (such as Catalysts and Catalysed Reactions), highly curated information (such as The Merck Index Online), or information aggregators (such as ChemSpider), are all useful portals for energy researchers to find chemical information. This talk will outline the different types of content and subject coverage of these databases and tools, and how they relate to researchers in the field of energy generation and storage.

9:35 10 Sustainable chemistry in the CAS databases

Cristian Dumitrescu1, cdumistrescu@cas.org, Roger Schenck2. (1) Department of Marketing, Chemical Abstracts Service, Columbus, OH 43202, United States, (2) Department of Marketing, Chemical Abstracts Service, Columbus, OH 43202, United States

Covering chemistry in its broadest sense, the CAS databases are a reflection of the research efforts in the area of chemistry for environmental sustainability. From greener syntheses of pharmaceutical candidates, to more efficient conversion of the biomass, to new energy storage materials, this presentation will cover discovering insights into these technologies. In the area of planning greener synthetic procedures, a focus will be on finding more environmentally friendly catalysts and solvents in CAS's collection of chemical reactions.

10:05 Intermission
10:20 11 Fading shades of gray? ACS Meeting preprints past, present, and future

David Flaxbart, flaxbart@austin.utexas.edu, Mallet Chemistry Library, University of Texas at Austin, Austin, TX 78713, United States

Meeting preprints from the energy-related divisions of ACS have been a part of the disciplinary literature since the 1930s. But the march of time seems to be leaving them behind, and even access to the legacy content is under threat from several quarters. This presentation will offer a quick history of this unique format of gray literature, the present state of affairs, and some advice for the future.

10:50 12 On the fly collection development to support emergent energy research initiatives

Donna T. Wrublewski, dtwrublewski@library.caltech.edu, George Porter, Dana Roth. Sherman Fairchild Library, California Institute of Technology, Pasadena, CA 91125, United States

Within the last few years, two major energy research initiatives have been undertaken on the Caltech campus: the Resnick Sustainability Institute (RSI) and the Joint Center for Artifical Photosynthesis (JCAP). RSI's research includes the production of electricity and fuels from renewable sources, the distribution and storage of energy, and other sustainability projects. JCAP's area of research is focused on achieving a system of artificial photosynthesis for utilizing solar energy. Both of these projects require resources from a broad range of fields, including chemistry, physics, biology, environmental &systems engineering, and more. This talk will give a brief description of some of the research being done in these projects, along with an overview of where energy research has traditionally been categorized in terms of collection development. A publication analysis will be presented to elucidate where information is being referenced from and where affiliated researchers are publishing, and how this correlates with the current collection holdings. Other considerations, such as identifying resources, outreach to groups about availability, and re-use of the collections, will also be discussed.

11:20 13 X marks the spot: Using xSearch for discovering energy information

Grace Baysinger, graceb@stanford.edu, Swain Chemistry &Chemical Engineering Library, Stanford University, Stanford, CA 94305-5081, United States

Online collections in academic research libraries are growing at a rapid pace. To help users discover relevant information in the "digital stacks", robust yet intuitive discovery tools are needed. Multidisciplinary researchers in areas such as energy and the environment need to a broad array of resources to meet their information needs. xSearch is a locally customized federated search service that lets users search 250+ sources (databases, data &statistics, full-text books, full-text journals, grant &funding sources, government documents, images, streaming media, news, patents, reference materials, reports, and theses &dissertations). This presentation will cover energy-related resources selected for xSearch and will summarize search features available for finding relevant information.

11:50 Concluding Remarks

Sunday, March 16, 2014

Joint CINF-RSC CICAG Symposium: Chemical Schemas, Taxonomies and Ontologies - PM Session
Platforms and Processes

Omni Dallas Hotel
Room: Deep Ellum A
Cosponsored by COMP, MEDI, ORGN, PHYS
Leah McEwen, Antony Williams, Jeremy Frey, Simon Coles, Organizers
Jeremy Frey, Presiding
1:30 pm - 5:00 pm
1:30 Introductory Remarks
1:35 14 Building a semantic chemistry platform with the Royal Society of Chemistry

Valery Tkachenko, tkachenkov@rsc.org, Colin Batchelor, Peter Corbett, Antony Williams. eScience, Royal Society of Chemistry, Cambridge, Cambridgeshire CB4 0WF, United Kingdom

We live in an exponentially expanding world of “big data”. Social networks, global portals and other distributed systems have been attempting to deal with the problem for a few years now. Scientific applications are commonly lagging behind the mainstream trends due to the complexity of the scientific domain. The Royal Society of Chemistry is building the Global Chemistry Network connecting a variety of resources both in-house and external, bridging gaps and advancing the chemical sciences. One of the main issues connected to the world of big data is the ease of navigation and comprehensiveness of the search capabilities. This is where the approach of the semantic web meets the world of big data. We will present our approaches in building a global federated chemistry platform connecting multiple domains of chemistry using semantic web technologies.

2:00 15 Ontology work at the Royal Society of Chemistry

Antony J. Williams, williamsa@rsc.org, Colin Batchelor, Peter Corbett, Jon Steele, Valery Tkachenko. eScience, Royal Society of Chemistry, Cambridge, Cambridgeshire CB4 0WF, United Kingdom

We provide an overview of the use we make of ontologies at the Royal Society of Chemistry. Our engagement with the ontology community began in 2006 with preparations for Project Prospect, which used ChEBI and other Open Biomedical Ontologies to mark up journal articles. Subsequently Project Prospect has evolved into DERA (Digitally Enhancing the RSC Archive) and we have developed further ontologies for text markup, covering analytical methods and name reactions. Most recently we have been contributing to CHEMINF, an open-source cheminformatics ontology, as part of our work on disseminating calculated physicochemical properties of molecules via the Open PHACTS. We show how we represent these properties and how it can serve as a template for disseminating different sorts of chemical information.

2:25 16 PubChem: Data access, navigation, and integration by means of classifiers and ontologies

Evan Bolton, bolton@ncbi.nlm.nih.gov, National Center for Biotechnology Information, Bethesda, MD 20894, United States

PubChem is a sizeable archive of chemical biology information with over 115 million sample descriptions, 46 million unique small molecules, and 200 million biological activity results. Navigation of such a large corpus of information is made easier by the means of classification systems (such as NLM's MeSH and WIPO's IPC) and ontologies (such as EBI's ChEBI and Gene Ontology Consortium's GO). Coupled with semantic annotation provided by the PubChemRDF project, data integration is facilitated across scientific domains. This talk will demonstrate how classifiers and ontologies are leveraged within the PubChem resource to improve data navigation but also how chemical identity groups and similarity play a key role in locating related chemical information.

2:50 17 AMI2: High-through extraction of semantic chemistry from the scientific literature

Andy Howlett, aph36@cam.ac.uk, Mark Williamson, Peter Murray-Rust. Department of Chemistry, Unilever Centre for Molecular Informatics, University of Cambridge, Cambridge, United Kingdom

The OpenSource AMI2-chem system ingests the scientific literature and extracts chemical facts on a high-throughput basis. It can take a variety of inputs, most importantly PDF and images (e.g. PNG) and transform them successively into SVG, XHTML and finally CML (Chemical Markup Language). Chemistry can be extracted from text, tables and diagrams and represented as molecules, reactions, recipes/procedures (ChemicalTagger), spectra, classification trees and more generally plots and data tables. The text and tables are created in XHTML and SVG. Other AMI2 systems extract biological sequences, species and phylogenetics (NexML) so creating a multidisciplinary semantic resource.

3:15 Intermission
3:30 18 Creating context for the experiment record: User-defined metadata

Cerys Willoughby, cerys.willoughby@me.com, Jeremy G Frey, Simon J Coles, Colin L Bird. Department of Chemistry, University of Southampton, Southampton, Hampshire SO17 1BJ, United Kingdom

The drive towards more transparency in research and open data increases the importance of being able to find information and make links to the data. Metadata is an essential ingredient for facilitating discovery and is used in Electronic Laboratory Notebooks to curate experiment data and associated entries with descriptive information and classification labels that can be used for aggregation and identification. Machine-generated metadata helps with facilitating metadata exchange and enabling interoperability, but such metadata is not necessarily in a form friendly for the humans that also need it. A survey of metadata usage in an ELN developed at the University of Southampton indicates many users do not use metadata effectively. Whilst some groups are comfortable with metadata and are able to design a metadata structure that works effectively, many users have no knowledge of where to start to define metadata or even an understanding of what it is and why it is useful. The metadata used within the notebooks is dominated by a few categories, in particular materials, data formats, and instruments. Further investigation is under way to determine whether this pattern of metadata use is common in other online environments, whether users are more likely to create certain types of metadata, and whether lessons can be learned from other environments to encourage metadata use. These findings will contribute to strategies for encouraging and improving metadata use in ELNs such as improved interface designs, user education, standard schema designs, and encouraging collaboration between same-discipline groups to promote consistency and best practices.

3:55 19 Experiment markup langauge: A combined markup language and ontology to represent science

Stuart Chalk, schalk@unf.edu, Department of Chemistry, University of North Florida, Jacksonville, FL 32224, United States

In its basic essence, when we perform science we create a detailed journal of what we do. In order to capture this in the context of the semantic web it is necessary to break down the scientific process and look at the types of information that we record. The Experiment Markup Language (ExptML) is proposed as a framework for capturing the content and context of science experiments. This is coupled with an ontology that represents the relationships between content elements, thus creating a graph of science experiments. A discussion of the current version of this system will be presented with examples along with a roadmap of future development.

4:20 20 Development of formal representations of the synthesis and processing histories of metal-organic frameworks (MOFs) using the ChemAxiom, ChEBI, CMO, and CHEMINF ontologies

Nico Adams1, Nico.Adams@csiro.au, Danielle Kennedy1, Murray Jensen1, Cornelius Kloppers2, Yanfeng Shu2, Claire D'Este2, Craig Lindley2. (1) Materials Science and Engineering, CSIRO, Clayton, Victoria 3168, Australia, (2) Computational Informatics, CSIRO, Hobart, Tasmania 7001, Australia

Metalorganic frameworks – a subset of coordination polymers - are currently a hot topic in chemical and materials research. They have some of the highest recorded surface areas and are finding applications in gas storage, catalysis and biosensing areas. Significant repositories of both physical MOF collections and data about MOFs (whether experimentally determined or computed) are currently being assembled by the research community and informatics systems for the retrieval of appropriate information associated with these compounds are required. However, informatics issues associated with MOFs remain to be resolved. For example, there are currently no internationally accepted naming conventions or topology descriptions despite a first recommendation by IUPAC, which was recently published. The formal and ontology-driven representation of MOFs, their structural features as well as their synthetic histories, will go some way towards overcoming challenges in information retrieval associated with these compounds and may also facilitate greater MOF data interoperability and re-use. This talk will illustrate our approach to the development of such representations and will illustrate, how these representations can be leveraged in a number of application contexts.

4:45 Panel Discussion

Sunday, March 16, 2014

Translational Cancer Bioinformatics: Data, Methods and Applications - PM Session

Omni Dallas Hotel
Room: Deep Ellum B
Cosponsored by COMP

Rachelle Bienstock, Shuxing Zhang, Organizers
Shuxing Zhang, Presiding
1:00 pm - 3:00 pm
1:00 21 New chemistry and powerful interactive technologies to discover PPI antagonists

Carlos J. Camacho, ccamacho@pitt.edu, Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA 15260, United States

Although there is no shortage of potential protein-protein interaction (PPI) drug targets, only a handful of known low-molecular-weight inhibitors exist. One problem is that current efforts are dominated by low-yield high-throughput screening, whose rigid framework is not suitable for the diverse chemotypes present in PPIs. Here, I will describe recent progress in our efforts to develop open access interactive (i.e., real-time) web-based drug discovery technologies. Our goal is to bring knowledge into the virtual screening pipeline by developing tools that create synergy between chemist, biologist, and other experts to deliver (ant)-agonists for hard targets. The pharmacophore-based technologies build on the role of anchor residues, or deeply buried hot spots, have in molecular recognition of PPIs. Novel tera-chemistry that redesigns these entry points with anchor-biased virtual multicomponent reactions delivers tens of millions of readily synthesizable compounds. Application of this approach to the MDM2/p53 cancer target led to high hit rates, resulting in a large and diverse set of confirmed inhibitors, and co-crystal structures validate the design strategy. Our unique technologies promise to expand the development of novel chemical probes for cancer research and the exploration of the human interactome by leveraging in-house small-scale assays and user-friendly chemistry to rationally design ligands for PPIs with known structure.

1:30 22 Integrative analysis of multidimensional cancer genomics data

Shihua Zhang, Wenyuan Li, Chun-Chi Liu, X. Jasmine Zhou, xjzhouo@usc.edu. Molecular and Computational Biology, University of Southern California, Los Angeles, CA 90068, United States

Recent technology has made it possible to simultaneously perform multi-platform genomic profiling of biological samples, resulting in so-called 'multi-dimensional genomic data'. Such data provide unique opportunities to study the coordination between regulatory mechanisms on multiple levels. However, integrative analysis of multi-dimensional genomics data for the discovery of combinatorial patterns is currently lacking. Here, we develop several methods to address this challenge. We applied our methods to the multi-dimensional ovarian cancer data from the The Cancer Genome Atlas project, including the copy number variation, DNA methylation, gene expression and microRNA expression data. We successfully identified perturbed pathways that would have been overlooked with only a single type of data, uncovered associations between different layers of cellular activities and allowed the identification of clinically distinct patient subgroups. Our study provides useful protocols for uncovering hidden patterns and their biological implications in multi-dimensional 'omic' data.

2:00 23 New application to estimate the diversity of molecular databases

Iwona Weidlich1, iweidlic@coddes.com, Igor Filippov2. (1) CODDES LLC, Rockville, MD 20852, United States, (2) VIF Innovations LLC, Rockville, MD 20852, United States

To build a SAR/QSAR model applicable to a wide variety of chemotypes the training set should be diverse, but exactly how diverse and what is the appropriate measure of diversity is still an unresolved problem. Some of the frequently used ways to measure diversity include: a) Bio‑activity profile or distribution of physico-chemical properties b) Distribution or central moments of pairwise similarity c) Clustering-based approaches and dimension reduction. Bio-activity data and physico-chemical properties are not always available for all chemical structures of interest. Methods relying on mean pairwise similarity make an a priori assumption that the underlying distribution is of a specific shape (i.e. Gaussian), while methods using either clustering or a distribution of properties make it difficult to quantify to compare two very different sets of molecules to claim that one is more diverse than the other. Therefore we explored a possibility of creating a diversity measure which satisfies the following criteria: d) it is trivially computed from the structure itself for any structure, e) it is applicable to any set of molecules without any a priory assumptions about its composition f) it is easy to use to compare two different sets. With the increasing availability of large datasets of molecules numbering in tens of millions of structures there is enough publicly accessible data to conduct an investigation into the diversity distribution over a set of sets of randomly selected compounds which has not been possible before. We present a new application to estimate the diversity of a chemical set of molecules.

2:30 24 Computational analysis of pleckstrin homology (PH) domains for cancer drug development

Shuxing Zhang, shuxing@gmail.com, Department of Experimental Therapeutics, MD Anderson Cancer Center, Houston, TX 77054, United States

The pleckstrin homology (PH) domain is a protein domain of approximately 120 residues that occurs in a variety of proteins involved in intracellular signaling or as constituents of the cytoskeleton. Through interactions with phosphatidylinositol lipids and other proteins, PH domain plays a critical role in recruiting oncogene proteins (e.g., Akt) to the membranes for activation and thus contributing to cancer growth. In present study, we attempt to understand the PH domain structures and their genomics signatures which may help us design specific inhibitors for targeted cancer therapies. PH domains have low sequence identity, usually below 30%. To date over 30 PH domain structures have been determined. Upon analyzing these structure, it has been demonstrated that the 3D fold of PH domains is highly conserved, with a N-terminal α-helix, followed with seven β-strand structures. The loops connecting these β-strands can differ significantly in length, and this may provide the source of the domain's specificity. Indeed individual PH domains possess affinities and specificities for different phosphoinositides. Also based on our structural analysis, we implemented position-site specific matrices (PSSM), which have been used to determine the secondary structures for query sequences. Additionally we employed the phylogenic tree of PH domain constructed based upon the PSSM to classify the PH domain and our observations are in agreement to their known binding specificity to different phosphoinositides, and thereby allowing design of selective inhibitor binding. This strategy has been rigorously cross-validated with our PH domain structure data set, and it was also successfully applied to prediction of the GAB1 PH domain structure, followed by discovery of potent GAB1 inhibitors to kill cancer cells.

Sunday, March 16, 2014

Neglected and Rare Disease Drug Discovery Needs Open Data - PM Session

Omni Dallas Hotel
Room: Deep Ellum B

Antony Williams, Joel Freundlich, Sean Ekins, Organizers
Sean Ekins, Presiding
3:15 pm - 5:15 pm
3:15 25 Looking back at Mycobacterium tuberculosis mouse efficacy testing to move new drugs forward

Sean Ekins1,2, ekinssean@yahoo.com, Robert C Reynolds3, Antony J Williams4, Alex M Clark5, Joel S Freundlich6. (1) Collaborations in Chemistry, United States, (2) Collaborative Drug Discovery, inc, United States, (3) University of Alabama at Birmingham, United States, (4) Royal Society of Chemistry, United States, (5) Molecular Materials Informatics, Canada, (6) Rutgers University, United States

An urgent need exists to find new therapeutic options to address drug resistance and shorten therapy for those infected with Mycobacterium tuberculosis (Mtb). Whole-cell phenotypic screens have yielded well over 1500 in vitro inhibitors of Mtb which must be prioritized for in vivo testing. The mouse in vivo model for Mtb infection represents a validated model that is commonly utilized to select compounds for later studies. We have examined the data from the 1940's to the present day on this in vivo model which has not previously been curated. We have performed extensive structure-activity analysis and machine learning modeling that highlight new insights that could aid Mtb drug discovery. For example a Bayesian model (N = 338) had similar statistics to previously published in vitro models (ROC Score = 0.77, concordance = 70.57, specificity = 68.81 and sensitivity = 72.14). SVM and tree based models were also generated. We also describe how the SAR Table mobile app can be employed to mine local structure activity relationships. We will also describe the steps needed to make this open data.

3:45 26 Sharing methods to build predictive machine learning models for neglected and rare disease drug discovery

Paul J Kowalczyk, pauljkowalczyk@gmail.com, Department of Computational Chemistry, SCYNEXIS, Research Triangle Park, NC 27709-2878, United States

We present – and share - sets of 'better practices' for eight machine learning methods, focusing on classification: recursive partitioning, random forests, neural networks, partial least squares, support vector machines, k-nearest neighbors, self-organizing maps and naïve bayes. Various tuning parameters for each method are analyzed, e.g. the dimensions and topology of a self-organizing map; the depth of trees for partitioning methods; the number of nodes / hidden nodes in a neural network; the choice of kernel. The choice of molecular descriptors has also been studied. Descriptors studied include topological descriptors (e.g., atom pairs), circular fingerprints (e.g., ECFP / FCFP), and constitutive fingerprints (e.g., MDL keys). Metrics for model performance include AUC, sensitivity, specificity and Cohen's kappa. The success and utility of these 'better practices' are demonstrated using publicly available antimalarial datasets. Each data mining effort is collected into a compendium – an interactive document that bundles primary data, statistical methods, figures and derived data together with textual documentation and conclusions. This interactivity allows one to reproduce the research, and modify and extend the various components. We show how the compendia might be used for neglected and rare disease drug discovery and, additionally, how they might serve as tutorials for data mining. Specifically, we demonstrate how one might use the compendia to build predictive machine learning models for any dataset. The open-source R software environment is used for all data mining tasks. All text, code, data and auxiliary content will be made freely available.

4:15 27 Royal Society of Chemistry developments to support open drug discovery

Antony J. Williams, williamsa@rsc.org, Alexey Pshenichnov, Jon Steele, Ken Karapetyan, Richard Gay, Valery Tkachenko. eScience, Royal Society of Chemistry, Cambridge, Cambridgeshire CB4 0WF, United Kingdom

In recent years the Royal Society of Chemistry has become known for our development of freely accessible data platforms including ChemSpider, ChemSpider Reactions and our new chemistry data repository. In order to support drug discovery RSC participates in a number of projects including the Open PHACTS semantic web project, the PharmaSea natural products discovery project and the Open Source Drug Discovery project in collaboration with a team in India. Our most recent developments include extending our efforts to support neglected diseases by the provision of high quality datasets resulting from our curation efforts to support modeling, the delivery of enhanced application programming interfaces to allow open source drug discovery teams to both source and deposit data from our chemistry databases and the provision of a micropublishing platform to report on various aspects of work supporting neglected disease drug discovery. This presentation will review our existing efforts and our plans for extended development.

4:45 28 How can PubChem be leveraged for neglected and rare disease drug discovery?

Evan Bolton, bolton@ncbi.nlm.nih.gov, National Center for Biotechnology Information, Bethesda, MD 20894, United States

PubChem is a free resource serving the chemical biology community. PubChem contains more than 215 million biological test results from over 715 thousand biological experiments. This data covers nearly two million unique small molecules and six thousand unique defined protein targets. PubChem is an open archive and provides a platform for scientists to share data. PubChem integrates this information with other biomedical resources found at the U.S. National Center for Biotechnology Information, such as PubMed and GenBank. This talk will help show how PubChem can be harnessed by drug discovery researchers in their quest for cures of tropical, rare, and neglected and diseases (TRND).

Sunday, March 16, 2014

CINF Scholarship for Scientific Excellence - EVE Session

Omni Dallas Hotel
Room: Dallas C

Guenter Grethe, Organizers
, Presiding
6:30 pm - 8:30 pm

29 Efficacy of chemical hyperstructures in similarity searching and virtual screening

Edmund Duesbury, lip12ed@sheffield.ac.uk, John Holliday, Peter Willett. Information School, University of Sheffield, Sheffield, South Yorkshire S1 4DP, United Kingdom

Two techniques exist in data fusion which have been proven to work in various forms in chemoinformatics: similarity fusion, where different similarity measures are combined; and group fusion, where similarities are combined from multiple reference molecules. The hyperstructure concept however is another form of data fusion, being a hypothetical moleculethat is constructed from the overlap of a set of existing molecules. Initially proposed to reduce the time of database searching, it has also been used directly for virtual screening on two occasions since its inception [1,2], the latter of which showed it to be useful as a 2-dimensional QSAR method. The concept's performance however in 2-dimensional similarity searching has to date not been shown to be effective, and has not been evaluated thoroughly on large sets of compounds. The work being carried out in this project aims to evaluate hyperstructures as an alternative (if not superior) method for fusion-based similarity-searching, with an emphasis on virtual screening. Current progress on the project will be discussed, including a brief overview of how hyperstructures are constructed, evaluated for virtual screening and compared with existing search methods. Of particular interest will be a comparison with existing data fusion methods. Results in this work show that the hyperstructure concept is not as effective for virtual screening as group fusion using ECFP4 fingerprints in terms of numbers of actives retrieved, but retrieves a greater diversity of molecules. This suggests that the two approaches are complementary, suggesting that it may be beneficial to apply similarity fusion to the two techniques to improve virtual screening. References [1] Brown, N. PhD Thesis. University of Sheffield, 2002. [2] Palyulin, Radchenko, &Zefirov. (2000). Molecular field Topology analysis method in QSAR studies of organic compounds. J Chem. Inf. Comput. Sci, 40, 659–667.


30 3D-QSAR using quantum-mechanics-based molecular interaction fields

Ahmed El Kerdawy1, Ahmed.Elkerdawy@fau.de, Stefan Güssregen2, Hans Matter2, Matthias Hennemann1,3, Timothy Clark1,3,4. (1) Computer-Chemistry-Center, Friedrich-Alexander University Erlangen-Nürnberg, Erlangen, Germany, (2) R&D, LGCR, Structure, Design and Informatics, Sanofi-Aventis Deutschland GmbH, Frankfurt am Main, Germany, (3) Interdisciplinary Center for Molecular Materials, Friedrich-Alexander University Erlangen-Nürnberg, Erlangen, Germany, (4) Centre for Molecular Design, University of Portsmouth, Portsmouth, United Kingdom

The natural evolution of the different computer-aided drug design (CADD) methods involves a shift toward using quantum-mechanics (QM)-based approaches. This shift is not only the result of the ever growing computational power but also due to the need for more accurate and more informative approaches to describe molecular properties and binding characteristics than the currently available ones. QM approaches do not suffer from the limitations inherent to the ball-and-spring description and the fixed atom-centered charge approximation in the classical force fields mostly used by CADD methods. In this project we introduce a protocol for shifting 3D-QSAR, one of the most widely used ligand-based drug design approaches, through using QM based molecular interaction fields (MIFs) which are the electron density (ρ), hydrogen bond donor field (HDF), hydrogen bond acceptor field (HAF) and molecular lipophilicity potential (MLP) to overcome the limitations of the current force-field-based MIFs. The average performance of the QM-MIFs (QMFA) models for nine data sets was found to be better than that of the conventional force-field-based MIFs models. In the individual data sets, the QMFA models always perform better than, or as well as, the conventional approaches. It is particularly encouraging that the relative performance of the QMFA models improves in the external validation.


31 PoseView: Visualization of protein-ligand interactions in 2D

Katrin Stierand, stierand@zbh.uni-hamburg.de, Matthias Rarey. ZBH Center for Bioinformatics, University of Hamburg, Hamburg, Germany

Computational methods for drawing small molecules are well-established in chemistry and the associated scientific fields. For the 2D visualization of protein-ligand complexes there is considerably less choice. We developed PoseView, which generates two-dimensional diagrams of protein-ligand complexes, showing the ligand, the interactions and the interacting residues. All depicted molecules are drawn on an atomic level as structure diagrams; thus, the output plots are clearly structured and easily readable for the scientist. The algorithm is based on a combinatorial optimization strategy which solves parts of the layout problem non-heuristically. PoseView is widely accepted in the community: it is incorporated in the PDB providing a diagram for each protein with a co-crystallized druglike ligand. Large-scale application studies on approximately 200,000 complexes from the PDB reached success rates of > 92%. Beyond the depiction of single complex diagrams the comparative two-dimensional graphical representation of protein-ligand complex series featuring different ligands bound to the same active site offers a quick insight in their binding mode differences. In comparison to arbitrary orientations of the residue molecules in the individual complex depictions a consistent placement improves the legibility and comparability within the series. The automatic generation of such consistent layouts offers the possibility to apply it to large data sets originating from computer-aided drug design methods. We developed a new approach, which automatically generates a consistent layout of interacting residues for a given series of complexes. The method generates high-quality layouts, showing mostly overlap-free solutions with molecules which are displayed as structure diagrams providing interaction information in atomic detail. Compared to existing approaches these drawings substantially simplify the visual analysis of large compound series. In our contribution we will present an overview of the algorithms and an insight in the application of PoseView in different scenarios.


32 Metal template approach towards efficiency enhancement in hydrogen-bond promoted enantioselective organocatalysis

Tathagata Mukherjee, tathagata.mukherjee@chem.tamu.edu, John A. Gladysz. Department of Chemistry, Texas A&M University, College Station, Texas 77840, United States

The concept of preorganization involves engineering a receptor to be complementary to a guest prior to a binding event. This can render the host-guest interaction entropically and enthalpically more favorable. This notion can be extended further to chiral hydrogen bond donors as they are immensely popular in enantioselective organocatalysis. To put to test the effect of "metal-templated efficency enhancement in organocatalysis" we have chosen a 2-guanidinobenzimidazole derivative (GBI ) as a simple hydrogen bond donor. GBI upon complexation to transition metals becomes preorganized for several hydrogen bonding motifs. With preorganization, the rotational degrees of freedom that are intrinsic to the ligand are greatly reduced and both the reactivity and enantioselectivity are enhanced significantly.

The generality of the catalytic behavior was established by extending the Michael addition reaction to various dicarbonyls and different nitroolefins. Exceeding the expectation, the "ruthenium-templated" catalyst showed good enantioselectivity even in the cases of notoriously infamous (in terms of enantioselectivity) alkyl nitroolefins. The above results are self-evident enough to combine "organocatalysis" with "metal-templated efficiency enhancement" to support the title. These new class of hybrid (organic-inorganic) hydrogen bond donors is unique and has excellent potential to start up a new branch of catalysis.

Monday, March 17, 2014

Keeping the Thrill Alive: Research Data and Electronic Notebooks - AM Session
Data Curation

Omni Dallas Hotel
Room: Deep Ellum A
Cosponsored by COMP, MEDI, ORGN, PHYS
Antony Williams, Jeremy Frey, Simon Coles, Leah McEwen, Organizers
Antony Williams, Leah McEwen, Presiding
8:30 am - 11:50 am
8:30 Introductory Remarks
8:35 33 Profiling common types of research data produced by chemists at the University of Michigan

Ye Li, liye@umich.edu, Shapiro Science Library, University of Michigan, Ann Arbor, Michigan 48109, United States

Best practices in data management and metadata standards for Chemistry data are still mostly under development with the exception of areas related to mature “big data” fields such as geochemistry. In order to support data sharing in Chemistry research communities, our first step is to identify the common types of research data produced by Chemists. These different types of data can be grouped according to different research themes meaningful for a specific research community to form the Data Type Profile (DTP) of the research community. For example, for organic synthesis, its DTP can include chemical structures, reaction schemes, physical and chemical properties, spectral data, thermodynamic data of reactions, chromatographic data, and crystallographic data etc. and the specific types of data under these categories. The basic DTP can be enriched with relevant data lifecycle stories to form a comprehensive DTP and then be used to propose metadata standards and best practices in managing and sharing data. Here, as a case study, publications authored by principal investigators (PIs) in the Department of Chemistry at the University of Michigan were retrieved from Web of Science and grouped by PIs. Those journal articles published during the past two years were selected as the main reference set. The data originated from these labs and appeared in texts, tables, figures, captions, and supplements of the articles were identified, described and categorized. The descriptions were collected in a FileMaker database to be grouped under different research themes so that we could construct DTP for each research theme instead of traditional sub-discipllines. By completing DTP for the research themes in the Chemistry Department, we will obtain a representative scope of the common types of data to the interest of chemists and use it as the foundation to facilitate data sharing in Chemistry.

8:55 34 Distributing, managing, and updating cheminformatics experiments

Paul J Kowalczyk, pauljkowalczyk@gmail.com, Department of Computational Chemistry, SCYNEXIS, Research Triangle Park, NC 27709-2878, United States

We demonstrate how one might report cheminformatics experiments as instances of reproducible research, i.e., how one might author and distribute integrated dynamic documents that contain the text, code, data and any auxiliary content needed to recreate the computational results. We show how the contents of these documents, including figures and tables, can be recalculated each time the document is generated. This integration of computational code, method description and data allows researchers the opportunity to both verify the published research and adapt/extend the methods presented. Open-source tools are used for all document generation: the R software environment is used to process chemical structures and mine and analyze biological and chemical data; the knitr package is used to generate reports (PDF); the markdown package is used to generate valid (X)HTML content; and the beamer package is used to create slides for presentation. Specific examples are presented for the visualization, analysis and mining of publicly available antimalarial datasets, with particular attention paid to automatically generating PDF reports, slides for presentations and valid (X)HTML content. All text, code, data and auxiliary content will be made freely available.

9:15 35 Dark reaction project: Archiving and deriving value from unreported "failed" hydrothermal synthesis reactions

Joshua Schrier1, jschrier@haverford.edu, Sorelle Friedler2, Alexander Norquist1. (1) Department of Chemistry, Haverford College, Haverford, PA 19041, United States, (2) Department of Computer Science, Haverford College, Haverford, PA 19041, United States

Most chemical reactions that have been performed are deemed "unsuccessful" and are never reported in the literature. There is no forum for collecting these "dark reactions", nor a means for deriving value from them, but they are nevertheless valuable because they define the bounds on the reaction conditions needed to successfully synthesize a product. Moreover, this vast dataset is currently languishing in old laboratory notebooks. In this talk, we will describe our work on creating a searchable public online repository for reaction data that enables better management, sharing, and utilization of these dark reactions. Our initial efforts focus on the hydrothermal synthesis of organically-templated inorganic solids, as just a few reactants (one or two inorganic components, one or two organic components, and solvent) and a few reaction conditions (pH, temperature, reaction time) yield a diversity of products. We will describe the types of data we collect and the overall architecture of our database software. We will demonstrate the user interface for entering new reactions and searching existing reactions. We will discuss the potential for our (open-source) software to be adapted and utilized for other types of chemical synthesis data-management applications. Finally, we will describe our progress in using the "dark reaction" dataset to accelerate exploratory synthesis by using machine learning. We use cheminformatics software to compute derived properties of the reagents and trained a decision-tree algorithm to predict the success of new reactions. We then use this to perform virtual screening of commercially available reagents. We will discuss our preliminary results in experimentally validating these predictions.

9:35 36 Factors to consider when choosing the right ELN for capturing and collaborating with your research data

Philip Mounteney1, pm@dotmatics.com, Berkley A Lynch2, Tamsin E Mansley2, Sharang Phatak1, Jess W Sager1. (1) Dotmatics, Inc., San Diego, CA 92121, United States, (2) Dotmatics, Inc., Woburn, MA 01801, United States

Chemistry ELNs (electronic lab notebooks) have been evolving for years, and are becoming widely deployed. However, the structured format of the chemistry ELN does not satisfy the needs of other research groups. We will present a simple, easy-to-use, web-based electronic laboratory notebook that contains interfaces for chemists, biologists and other scientific disciplines in a fully compliant environment with audit trails. This is coupled with robust and flexible data searching capabilities that promote data sharing, collaboration, easy report generation and knowledge transfer. The web-based interface and cloud-based deployment described facilitate secure global and mobile access from a variety of devices in the lab, at the desk, across research/partner sites or on-the-go, particularly as the paradigm of BYOD (Bring Your Own Device) becomes more prevalent.

9:55 Intermission
10:10 37 Royal Society of Chemistry activities to develop a data repository for chemistry-specific data

Aileen Day, Alexey Pshenichnov, Ken Karapetyan, Colin Batchelor, Peter Corbett, Jon Steele, Valery Tkachenko, Antony J Williams, williamsa@rsc.org. eScience, Royal Society of Chemistry, Cambridge, United Kingdom

The Royal Society of Chemistry publishes many thousands of articles per year, the majority of these containing rich chemistry data that, in general, in limited in its value when isolated only to the HTML or PDF form of the articles commonly consumed by readers. RSC also has an archive of over 300,000 articles containing rich chemistry data especially in the form of chemicals, reactions, property data and analytical spectra. RSC is developing a platform integrating these various forms of chemistry data. The data will be aggregated both during the manuscript deposition process as well as the result of text-mining and extraction of data from across the RSC archive. This presentation will report on the development of the platform including our success in extracting compounds, reactions and spectral data from articles. We will also discuss our developing process for handling data at manuscript deposition and the integration and support of eLab Notebooks (ELNS) in terms of facilitating data deposition and sourcing data. Each of these processes is intended to ensure long-term access to research data with the intention of facilitating improved discovery.

10:30 38 Eureka research workbench: An open source eScience laboratory notebook

Stuart Chalk, schalk@unf.edu, Department of Chemistry, University of North Florida, Jacksonville, FL 32224, United States

Scientists are looking for ways to leverage web 2.0 technologies in the research laboratory and as a consequence a number of approaches to web-based electronic notebooks are being evaluated. In this presentation we discuss the Eureka Research Workbench, an electronic laboratory notebook built on semantic technology and XML. Using this approach the context of the information recorded in the laboratory can be captured and searched along with the data itself. A discussion of the current system is presented along with the development roadmap and long-term plans relative to linked open data.

10:50 39 PubChem: A platform to archive and share scientific information

Evan Bolton, bolton@ncbi.nlm.nih.gov, National Center for Biotechnology Information, Bethesda, MD 20894, United States

As an open archive, PubChem accepts substance sample descriptions, experimental results, textual annotations, and links to other on-line resources. PubChem incorporates and integrates data not only between contributors but also to the information content available at the US National Center for Biotechnology (NCBI), which includes PubMed and GenBank. In addition, PubChem provides a variety of tools to help researchers explore and analyze relevant information. This talk will help demonstrate how chemists can leverage the PubChem resource to archive and share their data with the community at large.

11:10 40 Keeping the thrill alive: Data on demand

Berkley A Lynch1, berkley.lynch@dotmatics.com, Tamsin E Mansley1, Philip Mounteney2, Sharang Phatak2, Jess W Sagar2. (1) Dotmatics, Inc., Woburn, MA 01801, United States, (2) Dotmatics, Inc., San Diego, CA 92121, United States

Continued access to research data after the original scientist has moved on, or after a project has wrapped up, is vital for any organization. This data is critical for publications or IP applications and yet it is often stored in disparate, sometimes inaccessible, places: e.g. a paper notebook on a PI's bookshelf, or multiple folders in scattered PC or network locations. Without guidance from the original scientist it is difficult to locate a specific piece of data. This situation is most prevalent in academic institutions where there is constant turnover of research staff, as students graduate and move on, as well as in short term industry partnerships with academia and CROs. We will present solutions that enable the secure and audited storage, retrieval, and sharing of data of all types between research partners, collaborators and customers, locally or across the globe. These solutions provide instant communication through the cloud or local network, enabling and enriching collaborative research, and providing continued access to original data long after the original researcher has moved on.

11:30 Panel Discussion

Monday, March 17, 2014

Keeping the Thrill Alive: Research Data and Electronic Notebooks - PM Session
eLab Notebooks

Omni Dallas Hotel
Room: Deep Ellum A
Cosponsored by COMP, MEDI, ORGN, PHYS
Leah McEwen, Antony Williams, Jeremy Frey, Simon Coles, Organizers
Simon Coles, Presiding
1:30 pm - 4:50 pm
1:30 Introductory Remarks
1:35 41 Building a mobile reaction lab notebook

Alex M Clark, aclark@molmatinf.com, R&D, Molecular Materials Informatics, Montreal, QC H3J2S1, Canada

Electronic Lab Notebook (ELN) software is a burgeoning field, with applicability to both industrial and academic research environments. Numerous products have been launched, tailored for large enterprises or small research groups. Many of these projects have an interface based on web technologies or native mobile apps. This presentation will describe the creation of a mobile interface specific to one major group of lab notebook users: synthetic chemists. A reaction-centric app can be designed to provide an effective user experience for specifying the scheme and experimental details of chemical reactions, and making them digitally available. By combining the user-facing mobile app with cloud-hosted algorithms and centralized databases, a large number of supporting features and reference data can be made available, and special attention will be paid to green chemistry metrics. The core technologies for this product have already been demonstrated in apps such as the Mobile Molecular DataSheet (MMDS), Reaction101, Yield101, MolSync and Green Solvents. The means by which these are assembled into a fully functional reaction lab notebook will be described.

1:55 42 Generating metadata for an experiment: Using a tablet ELN

Cerys Willoughby, cerys.willoughby@me.com, Jeremy G Frey, Simon J Coles, Susanne Coles. Department of Chemistry, University of Southampton, Southampton, Hampshire SO17 1BJ, United Kingdom

The drive towards more transparency in research and open data increases the importance of being able to find information and make links to the data. Metadata is an essential ingredient for facilitating this discovery. We have discovered from our analysis of metadata usage in the LabTrove ELN that many users do not use metadata effectively and in many cases do not understand what it is or why it is useful. LabTrove provides a flexible metadata framework, enabling the creation of two different kinds of user-defined metadata: Sections used to describe the content of the entries, and key-value pair data useful for describing specific elements of the experiment. When we developed a prototype tablet ELN called Notelus, able to integrate with LabTrove, we wanted to ensure that we could make the best use of the metadata capabilities of LabTrove by providing the users with the opportunity to add useful metadata simply and easily. Part of the decision making process for the development of Notelus was defining structures to represent an experiment and an associated plan based upon our experiences of how researchers planned, recorded, and organized their experiment data. Metadata is then automatically generated based on information provided by the user about the experiment and from the experiment plan if one is used in addition to basic metadata, such as author, date and time of creation, and name of the notebook. The result is metadata associated with the experiment record in LabTrove that represent the kinds of information that researchers use and search for, but including information that they have typically not considered using when they have created their own metadata manually.

2:15 43 Mining ELN based structured test and chemical property data to optimize catalyst development

Philip J Skinner, philip.skinner@perkinelmer.com, Joshua A Bishop, Josh.Bishop@PERKINELMER.COM, Rudy Potenzone, Megean Schoenberg. PerkinElmer, Waltham, Massachusetts 02451, United States

Understanding the relationship between chemical structure, chemical properties and assays was once the purview of lifesciences where systems and workflows have been developed to manage the flow of information and to inform decisions. Industrial chemists deal with fewer unique materials but are now equally interested in the correlation of material and the structured numerical results within tests and assays. Optimizing the physical performance of the materials is achieved through composition changes and measured through particle size distributions, rheological assays and other tests.
This presentation will present the outcome of a series of proof of concepts towards the optimization of chemical catalysts. In these examples the ELN was utilized, above and beyond a traditional record keeping role, to standardize the collection of structured numerical test data. In each case the correlated multi-test data was available for interrogation through Spotfire in order to optimize the physical characteristics of the material.

2:35 44 Sample management with the LabTrove ELN

Jeremy G Frey, j.g.frey@soton.ac.uk, Simon J Coles, Andrew J Milsted, Cerys Willoughby, Colin L Bird. Chemistry, University of Southampton, Southampton, Hampshire SO17 1BJ, United Kingdom

A significant advantage of the Electronic Laboratory Notebook (ELN) over its more traditional rival, the paper notebook, is the ability of the ELN to retain the journal characteristics of traditional notebooks while exploiting the potential for linking together procedures, materials, samples, observations, data, and analysis reports. The ability to link individual experiment records with the materials and samples used is particularly important for establishing unequivocally the provenance of a given procedure. Similarly, information relating to an instrument used to make critical measurements might also be a significant aspect of the provenance chain. LabTrove, a web-based ELN developed at the University of Southampton, provides a range of higher-level linking options that maintain the relationships between notebook entries and data objects. We report case studies that illustrate the application of the high-level linking features of LabTrove to sample management:[ul][li]Keeping track of separate samples and demonstrating reproducibility between batches of sample;[/li][li]Maintaining links between samples, experiment data, derived data, and subsequent publications;[/li][li]Using a combination of URI and barcodes to provide both ready identification of specific items and a simple sample management system, with no additional effort.[/li][/ul]

2:55 Intermission
3:10 45 Digital data repositories in chemistry and their integration with journals and electronic laboratory notebooks

Henry S Rzepa, rzepa@ic.ac.uk, Matt S Harvey, Nick Mason. Chemistry, Imperial College London, London, London SW 2AZ, United Kingdom

A web-based electronic laboratory notebook system for organising and conducting computational experiments and associated experimental data is described. This system manages the lifecycle of an experiment from instantiation, through execution on high-performance computing resources, to data retrieval and deposition of final results and metadata in a digital repository for long-term curation. We demonstrate integration with three different third-party repositories (SPECTRa-DSpace, Figshare and Chempound). Each of these provide citable access to their contents using persistent identifiers, following the principles of the Amsterdam Manifesto on data citation. Examples illustrating the two-publisher model of the scientific article are presented, in which the data object is cited via its persistent identifier (DOI) and presented as either a fileset or an interactive rendering of the data and transcluded into the narrative component of a publication. We demonstrate recent advances in our use of Handle identifiers (the underlying infrastructure of the DOI system) to expose dataset metadata using specific handle types, in conjunction with web service resolution of Handles/DOIs to structured data. This allows automated discovery and mining of datasets and thus enables, for example, the dynamic generation of complex, interactive views that could be integrated within e-lab notebooks.

3:30 46 Data exchange between electronic lab notebooks and data repositories

Rory Macneil, rmacneil@researchspace.com, Research Space, Edinburgh, United Kingdom

3:50 47 Standardized representations of ELN reactions for categorization and duplicate/variation identification

Roger A Sayle, roger@nextmovesoftware.com, Daniel M Lowe. NextMove Software, Cambridge, United Kingdom

Electronic lab notebooks (ELNs) can be considered experiment or document centric databases, where each record corresponds to a unique experiment. In an ELN, there is no intrinsic notion of a "duplicate", each experiment is represented by its own notebook page. Registration systems and more traditional reaction databases, on the other hand, are connection table centric, with duplicates identified and unique items assigned their own identifier or key. Populating reaction databases from (pharmaceutical) ELNs therefore requires an operational definition of "same reaction". Here we describe approaches and challenges to identifying reaction "variations" in large reaction databases. Amongst the techniques considered, is the use of the proposed IUPAC reaction InChI identifier or RInChI.

4:10 48 Extracting data, information, and knowledge from an ELN

Colin L Bird1, colinl.bird@soton.ac.uk, Simon J Coles1, Jeremy G Frey1, Richard J Whitby1, Aileen E Day2. (1) Chemistry, University of Southampton, Southampton, Hampshire SO17 1BJ, United Kingdom, (2) Royal Society of Chemistry, Cambridge, Cambridgeshire CB4 0WF, United Kingdom

Electronic Laboratory Notebooks (ELNs) are used routinely to capture and preserve procedures, materials, samples, observations, data, and analysis reports. ELNs also retain the journal characteristics of traditional paper notebooks, thus enabling the capture of thoughts and deliberations. ELNs contain a wealth of data, information, and knowledge that is organised internally but is not necessarily readily accessible for external reuse. Dial-a-Molecule is a UK Grand Challenge Network that promotes research aimed at bringing about a step change in our ability to deliver molecules quickly and efficiently: How can we make molecules in days not years? To realise this vision, it will be essential to exploit the vast body of currently inaccessible chemical data and information held in ELNs, not only to make that data and information available but also to develop protocols for discovery, access and ultimately automatic processing. The data-information-knowledge-wisdom (DIKW) hierarchy is often represented as a pyramid, with data at the base. Identifying relationships places data “in formation” and patterns lead to “actionable information”, which we usually think of as knowledge. We report research that uses the knowledge layer as the entry point to the DIKW pyramid as embodied within ELNs. Core metadata identifying the material being made available and how it might be obtained is published as an elnItemManifest, which is an XML file. If the material is potentially useful, another team can use the contact information in the elnItemManifest to obtain fuller contextual information – data “in formation” - held within the ELN. To reuse the material, the other team would request the detail metadata that describes the data itself. The Dial-a-Molecule approach regards this detail as the third layer of a three-tier model that corresponds to the DIKW hierarchy. If the vision is realised, chemists will have acquired the wisdom to make molecules in days not years.

4:30 Panel Discussion

Monday, March 17, 2014

Sci-Mix - EVE Session

Dallas Convention Center
Room: Hall F

Jeremy Garritano, Erin Bolstad, Organizers
, Presiding
8:00 pm - 10:00 pm

2 Ontology-driven information system for chemical and materials science

Nico Adams1, nico.adams@csiro.au, Murray Jensen1, Danielle Kennedy1, Cornelius Kloppers2, Yanfeng Shu2, Claire D'Este2, Craig Lindley2. (1) Materials Science and Engineering, CSIRO, Clayton, Victoria 3150, Australia, (2) Computational Informatics, CSIRO, Hobart, Tasmania 7001, Australia

Standard information systems for the chemical and materials sciences are almost entirely predicated on the notion of chemical structure and composition as the unique criterion of identity, which also encodes properties. However, the properties of many extended materials, such as polymers (including coordination polymers) are not comprehensible on the basis of their chemical structure and composition alone – rather they are often determined by the provenance and processing history of a material. The melting points of polymers, for example, can be significantly changed through a simple post-processing step such as compounding – this changes property-determining intermolecular interactions rather than fundamental chemistry and as such is not encoded by simple structure representations. This, in turn, means that chemical and process descriptions need to be interoperable and associable with chemical objects. Ontologies map particularly well onto the problem of developing integrated chemical and process representation of entities and can be used to develop complete provenance traces of materials. In this talk, we will discuss our approach to developing such representations and how they can be leveraged for multiple purposes in materials information systems.


8 Trends in bio-based chemicals: Business intelligence from published literature

Steve M Watson, s.watson@elsevier.com, Alternative Energy, Elsevier, Amsterdam, North Holland 1043NX, The Netherlands

A new information solution, Elsevier Biofuel, has been used to review the literature landscape, revealing trends in R&D towards the production of valuable chemicals from biomass. Elsevier Biofuel comprises advanced search and analysis tools, using a domain specific taxonomy to automatically classify over 21 million documents, many in full text, covering relevant journal publications, patents, technical reports, conference proceedings and trade publications. The analysis highlights emerging technology areas, commercial opportunities, and the leading companies staking a claim to this space.


18 Creating context for the experiment record: User-defined metadata

Cerys Willoughby, cerys.willoughby@me.com, Jeremy G Frey, Simon J Coles, Colin L Bird. Department of Chemistry, University of Southampton, Southampton, Hampshire SO17 1BJ, United Kingdom

The drive towards more transparency in research and open data increases the importance of being able to find information and make links to the data. Metadata is an essential ingredient for facilitating discovery and is used in Electronic Laboratory Notebooks to curate experiment data and associated entries with descriptive information and classification labels that can be used for aggregation and identification. Machine-generated metadata helps with facilitating metadata exchange and enabling interoperability, but such metadata is not necessarily in a form friendly for the humans that also need it. A survey of metadata usage in an ELN developed at the University of Southampton indicates many users do not use metadata effectively. Whilst some groups are comfortable with metadata and are able to design a metadata structure that works effectively, many users have no knowledge of where to start to define metadata or even an understanding of what it is and why it is useful. The metadata used within the notebooks is dominated by a few categories, in particular materials, data formats, and instruments. Further investigation is under way to determine whether this pattern of metadata use is common in other online environments, whether users are more likely to create certain types of metadata, and whether lessons can be learned from other environments to encourage metadata use. These findings will contribute to strategies for encouraging and improving metadata use in ELNs such as improved interface designs, user education, standard schema designs, and encouraging collaboration between same-discipline groups to promote consistency and best practices.


20 Development of formal representations of the synthesis and processing histories of metal-organic frameworks (MOFs) using the ChemAxiom, ChEBI, CMO, and CHEMINF ontologies

Nico Adams1, Nico.Adams@csiro.au, Danielle Kennedy1, Murray Jensen1, Cornelius Kloppers2, Yanfeng Shu2, Claire D'Este2, Craig Lindley2. (1) Materials Science and Engineering, CSIRO, Clayton, Victoria 3168, Australia, (2) Computational Informatics, CSIRO, Hobart, Tasmania 7001, Australia

Metalorganic frameworks – a subset of coordination polymers - are currently a hot topic in chemical and materials research. They have some of the highest recorded surface areas and are finding applications in gas storage, catalysis and biosensing areas. Significant repositories of both physical MOF collections and data about MOFs (whether experimentally determined or computed) are currently being assembled by the research community and informatics systems for the retrieval of appropriate information associated with these compounds are required. However, informatics issues associated with MOFs remain to be resolved. For example, there are currently no internationally accepted naming conventions or topology descriptions despite a first recommendation by IUPAC, which was recently published. The formal and ontology-driven representation of MOFs, their structural features as well as their synthetic histories, will go some way towards overcoming challenges in information retrieval associated with these compounds and may also facilitate greater MOF data interoperability and re-use. This talk will illustrate our approach to the development of such representations and will illustrate, how these representations can be leveraged in a number of application contexts.


21 New chemistry and powerful interactive technologies to discover PPI antagonists

Carlos J. Camacho, ccamacho@pitt.edu, Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA 15260, United States

Although there is no shortage of potential protein-protein interaction (PPI) drug targets, only a handful of known low-molecular-weight inhibitors exist. One problem is that current efforts are dominated by low-yield high-throughput screening, whose rigid framework is not suitable for the diverse chemotypes present in PPIs. Here, I will describe recent progress in our efforts to develop open access interactive (i.e., real-time) web-based drug discovery technologies. Our goal is to bring knowledge into the virtual screening pipeline by developing tools that create synergy between chemist, biologist, and other experts to deliver (ant)-agonists for hard targets. The pharmacophore-based technologies build on the role of anchor residues, or deeply buried hot spots, have in molecular recognition of PPIs. Novel tera-chemistry that redesigns these entry points with anchor-biased virtual multicomponent reactions delivers tens of millions of readily synthesizable compounds. Application of this approach to the MDM2/p53 cancer target led to high hit rates, resulting in a large and diverse set of confirmed inhibitors, and co-crystal structures validate the design strategy. Our unique technologies promise to expand the development of novel chemical probes for cancer research and the exploration of the human interactome by leveraging in-house small-scale assays and user-friendly chemistry to rationally design ligands for PPIs with known structure.


24 Computational analysis of pleckstrin homology (PH) domains for cancer drug development

Shuxing Zhang, shuxing@gmail.com, Department of Experimental Therapeutics, MD Anderson Cancer Center, Houston, TX 77054, United States

The pleckstrin homology (PH) domain is a protein domain of approximately 120 residues that occurs in a variety of proteins involved in intracellular signaling or as constituents of the cytoskeleton. Through interactions with phosphatidylinositol lipids and other proteins, PH domain plays a critical role in recruiting oncogene proteins (e.g., Akt) to the membranes for activation and thus contributing to cancer growth. In present study, we attempt to understand the PH domain structures and their genomics signatures which may help us design specific inhibitors for targeted cancer therapies. PH domains have low sequence identity, usually below 30%. To date over 30 PH domain structures have been determined. Upon analyzing these structure, it has been demonstrated that the 3D fold of PH domains is highly conserved, with a N-terminal α-helix, followed with seven β-strand structures. The loops connecting these β-strands can differ significantly in length, and this may provide the source of the domain's specificity. Indeed individual PH domains possess affinities and specificities for different phosphoinositides. Also based on our structural analysis, we implemented position-site specific matrices (PSSM), which have been used to determine the secondary structures for query sequences. Additionally we employed the phylogenic tree of PH domain constructed based upon the PSSM to classify the PH domain and our observations are in agreement to their known binding specificity to different phosphoinositides, and thereby allowing design of selective inhibitor binding. This strategy has been rigorously cross-validated with our PH domain structure data set, and it was also successfully applied to prediction of the GAB1 PH domain structure, followed by discovery of potent GAB1 inhibitors to kill cancer cells.


26 Sharing methods to build predictive machine learning models for neglected and rare disease drug discovery

Paul J Kowalczyk, pauljkowalczyk@gmail.com, Department of Computational Chemistry, SCYNEXIS, Research Triangle Park, NC 27709-2878, United States

We present – and share - sets of 'better practices' for eight machine learning methods, focusing on classification: recursive partitioning, random forests, neural networks, partial least squares, support vector machines, k-nearest neighbors, self-organizing maps and naïve bayes. Various tuning parameters for each method are analyzed, e.g. the dimensions and topology of a self-organizing map; the depth of trees for partitioning methods; the number of nodes / hidden nodes in a neural network; the choice of kernel. The choice of molecular descriptors has also been studied. Descriptors studied include topological descriptors (e.g., atom pairs), circular fingerprints (e.g., ECFP / FCFP), and constitutive fingerprints (e.g., MDL keys). Metrics for model performance include AUC, sensitivity, specificity and Cohen's kappa. The success and utility of these 'better practices' are demonstrated using publicly available antimalarial datasets. Each data mining effort is collected into a compendium – an interactive document that bundles primary data, statistical methods, figures and derived data together with textual documentation and conclusions. This interactivity allows one to reproduce the research, and modify and extend the various components. We show how the compendia might be used for neglected and rare disease drug discovery and, additionally, how they might serve as tutorials for data mining. Specifically, we demonstrate how one might use the compendia to build predictive machine learning models for any dataset. The open-source R software environment is used for all data mining tasks. All text, code, data and auxiliary content will be made freely available.


29 Efficacy of chemical hyperstructures in similarity searching and virtual screening

Edmund Duesbury, lip12ed@sheffield.ac.uk, John Holliday, Peter Willett. Information School, University of Sheffield, Sheffield, South Yorkshire S1 4DP, United Kingdom

Two techniques exist in data fusion which have been proven to work in various forms in chemoinformatics: similarity fusion, where different similarity measures are combined; and group fusion, where similarities are combined from multiple reference molecules. The hyperstructure concept however is another form of data fusion, being a hypothetical moleculethat is constructed from the overlap of a set of existing molecules. Initially proposed to reduce the time of database searching, it has also been used directly for virtual screening on two occasions since its inception [1,2], the latter of which showed it to be useful as a 2-dimensional QSAR method. The concept's performance however in 2-dimensional similarity searching has to date not been shown to be effective, and has not been evaluated thoroughly on large sets of compounds. The work being carried out in this project aims to evaluate hyperstructures as an alternative (if not superior) method for fusion-based similarity-searching, with an emphasis on virtual screening. Current progress on the project will be discussed, including a brief overview of how hyperstructures are constructed, evaluated for virtual screening and compared with existing search methods. Of particular interest will be a comparison with existing data fusion methods. Results in this work show that the hyperstructure concept is not as effective for virtual screening as group fusion using ECFP4 fingerprints in terms of numbers of actives retrieved, but retrieves a greater diversity of molecules. This suggests that the two approaches are complementary, suggesting that it may be beneficial to apply similarity fusion to the two techniques to improve virtual screening. References [1] Brown, N. PhD Thesis. University of Sheffield, 2002. [2] Palyulin, Radchenko, &Zefirov. (2000). Molecular field Topology analysis method in QSAR studies of organic compounds. J Chem. Inf. Comput. Sci, 40, 659–667.


30 3D-QSAR using quantum-mechanics-based molecular interaction fields

Ahmed El Kerdawy1, Ahmed.Elkerdawy@fau.de, Stefan Güssregen2, Hans Matter2, Matthias Hennemann1,3, Timothy Clark1,3,4. (1) Computer-Chemistry-Center, Friedrich-Alexander University Erlangen-Nürnberg, Erlangen, Germany, (2) R&D, LGCR, Structure, Design and Informatics, Sanofi-Aventis Deutschland GmbH, Frankfurt am Main, Germany, (3) Interdisciplinary Center for Molecular Materials, Friedrich-Alexander University Erlangen-Nürnberg, Erlangen, Germany, (4) Centre for Molecular Design, University of Portsmouth, Portsmouth, United Kingdom

The natural evolution of the different computer-aided drug design (CADD) methods involves a shift toward using quantum-mechanics (QM)-based approaches. This shift is not only the result of the ever growing computational power but also due to the need for more accurate and more informative approaches to describe molecular properties and binding characteristics than the currently available ones. QM approaches do not suffer from the limitations inherent to the ball-and-spring description and the fixed atom-centered charge approximation in the classical force fields mostly used by CADD methods. In this project we introduce a protocol for shifting 3D-QSAR, one of the most widely used ligand-based drug design approaches, through using QM based molecular interaction fields (MIFs) which are the electron density (ρ), hydrogen bond donor field (HDF), hydrogen bond acceptor field (HAF) and molecular lipophilicity potential (MLP) to overcome the limitations of the current force-field-based MIFs. The average performance of the QM-MIFs (QMFA) models for nine data sets was found to be better than that of the conventional force-field-based MIFs models. In the individual data sets, the QMFA models always perform better than, or as well as, the conventional approaches. It is particularly encouraging that the relative performance of the QMFA models improves in the external validation.


32 Metal template approach towards efficiency enhancement in hydrogen-bond promoted enantioselective organocatalysis

Tathagata Mukherjee, tathagata.mukherjee@chem.tamu.edu, John A. Gladysz. Department of Chemistry, Texas A&M University, College Station, Texas 77840, United States

The concept of preorganization involves engineering a receptor to be complementary to a guest prior to a binding event. This can render the host-guest interaction entropically and enthalpically more favorable. This notion can be extended further to chiral hydrogen bond donors as they are immensely popular in enantioselective organocatalysis. To put to test the effect of "metal-templated efficency enhancement in organocatalysis" we have chosen a 2-guanidinobenzimidazole derivative (GBI ) as a simple hydrogen bond donor. GBI upon complexation to transition metals becomes preorganized for several hydrogen bonding motifs. With preorganization, the rotational degrees of freedom that are intrinsic to the ligand are greatly reduced and both the reactivity and enantioselectivity are enhanced significantly.

The generality of the catalytic behavior was established by extending the Michael addition reaction to various dicarbonyls and different nitroolefins. Exceeding the expectation, the "ruthenium-templated" catalyst showed good enantioselectivity even in the cases of notoriously infamous (in terms of enantioselectivity) alkyl nitroolefins. The above results are self-evident enough to combine "organocatalysis" with "metal-templated efficiency enhancement" to support the title. These new class of hybrid (organic-inorganic) hydrogen bond donors is unique and has excellent potential to start up a new branch of catalysis.


33 Profiling common types of research data produced by chemists at the University of Michigan

Ye Li, liye@umich.edu, Shapiro Science Library, University of Michigan, Ann Arbor, Michigan 48109, United States

Best practices in data management and metadata standards for Chemistry data are still mostly under development with the exception of areas related to mature “big data” fields such as geochemistry. In order to support data sharing in Chemistry research communities, our first step is to identify the common types of research data produced by Chemists. These different types of data can be grouped according to different research themes meaningful for a specific research community to form the Data Type Profile (DTP) of the research community. For example, for organic synthesis, its DTP can include chemical structures, reaction schemes, physical and chemical properties, spectral data, thermodynamic data of reactions, chromatographic data, and crystallographic data etc. and the specific types of data under these categories. The basic DTP can be enriched with relevant data lifecycle stories to form a comprehensive DTP and then be used to propose metadata standards and best practices in managing and sharing data. Here, as a case study, publications authored by principal investigators (PIs) in the Department of Chemistry at the University of Michigan were retrieved from Web of Science and grouped by PIs. Those journal articles published during the past two years were selected as the main reference set. The data originated from these labs and appeared in texts, tables, figures, captions, and supplements of the articles were identified, described and categorized. The descriptions were collected in a FileMaker database to be grouped under different research themes so that we could construct DTP for each research theme instead of traditional sub-discipllines. By completing DTP for the research themes in the Chemistry Department, we will obtain a representative scope of the common types of data to the interest of chemists and use it as the foundation to facilitate data sharing in Chemistry.


34 Distributing, managing, and updating cheminformatics experiments

Paul J Kowalczyk, pauljkowalczyk@gmail.com, Department of Computational Chemistry, SCYNEXIS, Research Triangle Park, NC 27709-2878, United States

We demonstrate how one might report cheminformatics experiments as instances of reproducible research, i.e., how one might author and distribute integrated dynamic documents that contain the text, code, data and any auxiliary content needed to recreate the computational results. We show how the contents of these documents, including figures and tables, can be recalculated each time the document is generated. This integration of computational code, method description and data allows researchers the opportunity to both verify the published research and adapt/extend the methods presented. Open-source tools are used for all document generation: the R software environment is used to process chemical structures and mine and analyze biological and chemical data; the knitr package is used to generate reports (PDF); the markdown package is used to generate valid (X)HTML content; and the beamer package is used to create slides for presentation. Specific examples are presented for the visualization, analysis and mining of publicly available antimalarial datasets, with particular attention paid to automatically generating PDF reports, slides for presentations and valid (X)HTML content. All text, code, data and auxiliary content will be made freely available.


35 Dark reaction project: Archiving and deriving value from unreported "failed" hydrothermal synthesis reactions

Joshua Schrier1, jschrier@haverford.edu, Sorelle Friedler2, Alexander Norquist1. (1) Department of Chemistry, Haverford College, Haverford, PA 19041, United States, (2) Department of Computer Science, Haverford College, Haverford, PA 19041, United States

Most chemical reactions that have been performed are deemed "unsuccessful" and are never reported in the literature. There is no forum for collecting these "dark reactions", nor a means for deriving value from them, but they are nevertheless valuable because they define the bounds on the reaction conditions needed to successfully synthesize a product. Moreover, this vast dataset is currently languishing in old laboratory notebooks. In this talk, we will describe our work on creating a searchable public online repository for reaction data that enables better management, sharing, and utilization of these dark reactions. Our initial efforts focus on the hydrothermal synthesis of organically-templated inorganic solids, as just a few reactants (one or two inorganic components, one or two organic components, and solvent) and a few reaction conditions (pH, temperature, reaction time) yield a diversity of products. We will describe the types of data we collect and the overall architecture of our database software. We will demonstrate the user interface for entering new reactions and searching existing reactions. We will discuss the potential for our (open-source) software to be adapted and utilized for other types of chemical synthesis data-management applications. Finally, we will describe our progress in using the "dark reaction" dataset to accelerate exploratory synthesis by using machine learning. We use cheminformatics software to compute derived properties of the reagents and trained a decision-tree algorithm to predict the success of new reactions. We then use this to perform virtual screening of commercially available reagents. We will discuss our preliminary results in experimentally validating these predictions.


36 Factors to consider when choosing the right ELN for capturing and collaborating with your research data

Philip Mounteney1, pm@dotmatics.com, Berkley A Lynch2, Tamsin E Mansley2, Sharang Phatak1, Jess W Sager1. (1) Dotmatics, Inc., San Diego, CA 92121, United States, (2) Dotmatics, Inc., Woburn, MA 01801, United States

Chemistry ELNs (electronic lab notebooks) have been evolving for years, and are becoming widely deployed. However, the structured format of the chemistry ELN does not satisfy the needs of other research groups. We will present a simple, easy-to-use, web-based electronic laboratory notebook that contains interfaces for chemists, biologists and other scientific disciplines in a fully compliant environment with audit trails. This is coupled with robust and flexible data searching capabilities that promote data sharing, collaboration, easy report generation and knowledge transfer. The web-based interface and cloud-based deployment described facilitate secure global and mobile access from a variety of devices in the lab, at the desk, across research/partner sites or on-the-go, particularly as the paradigm of BYOD (Bring Your Own Device) becomes more prevalent.


40 Keeping the thrill alive: Data on demand

Berkley A Lynch1, berkley.lynch@dotmatics.com, Tamsin E Mansley1, Philip Mounteney2, Sharang Phatak2, Jess W Sagar2. (1) Dotmatics, Inc., Woburn, MA 01801, United States, (2) Dotmatics, Inc., San Diego, CA 92121, United States

Continued access to research data after the original scientist has moved on, or after a project has wrapped up, is vital for any organization. This data is critical for publications or IP applications and yet it is often stored in disparate, sometimes inaccessible, places: e.g. a paper notebook on a PI's bookshelf, or multiple folders in scattered PC or network locations. Without guidance from the original scientist it is difficult to locate a specific piece of data. This situation is most prevalent in academic institutions where there is constant turnover of research staff, as students graduate and move on, as well as in short term industry partnerships with academia and CROs. We will present solutions that enable the secure and audited storage, retrieval, and sharing of data of all types between research partners, collaborators and customers, locally or across the globe. These solutions provide instant communication through the cloud or local network, enabling and enriching collaborative research, and providing continued access to original data long after the original researcher has moved on.


42 Generating metadata for an experiment: Using a tablet ELN

Cerys Willoughby, cerys.willoughby@me.com, Jeremy G Frey, Simon J Coles, Susanne Coles. Department of Chemistry, University of Southampton, Southampton, Hampshire SO17 1BJ, United Kingdom

The drive towards more transparency in research and open data increases the importance of being able to find information and make links to the data. Metadata is an essential ingredient for facilitating this discovery. We have discovered from our analysis of metadata usage in the LabTrove ELN that many users do not use metadata effectively and in many cases do not understand what it is or why it is useful. LabTrove provides a flexible metadata framework, enabling the creation of two different kinds of user-defined metadata: Sections used to describe the content of the entries, and key-value pair data useful for describing specific elements of the experiment. When we developed a prototype tablet ELN called Notelus, able to integrate with LabTrove, we wanted to ensure that we could make the best use of the metadata capabilities of LabTrove by providing the users with the opportunity to add useful metadata simply and easily. Part of the decision making process for the development of Notelus was defining structures to represent an experiment and an associated plan based upon our experiences of how researchers planned, recorded, and organized their experiment data. Metadata is then automatically generated based on information provided by the user about the experiment and from the experiment plan if one is used in addition to basic metadata, such as author, date and time of creation, and name of the notebook. The result is metadata associated with the experiment record in LabTrove that represent the kinds of information that researchers use and search for, but including information that they have typically not considered using when they have created their own metadata manually.


46 Data exchange between electronic lab notebooks and data repositories

Rory Macneil, rmacneil@researchspace.com, Research Space, Edinburgh, United Kingdom


49 Stepping through virtual communication into Virtmon

Gwendolyn Tennell1,2, gtennell3669@skymail.susla.edu, Feng Li2. (1) Computer Science, Southern University Shreveport, Sheveport, Louisiana 71107, United States, (2) Computer Information and Graphics Technology, Indiana University Purdue University Indianapolis (IUPUI), Indianapolis, Indiana, United States

While isolation is an important property from a security perspective, virtual machines (VMs) often need to communicate and exchange a considerable amount of data. Research in virtualization technology has been focused mainly on increasing isolation of co-resident virtual machines. The isolation properties of virtualization demand that the shared resources are strictly separated. The machine registers are also restricted; therefore virtual machines are forced to fallback to inefficient network emulation for communication. This research is based upon a stealthy way to communicate between virtual machines and virtual machine managers (VMMs) running on the Linux operating system. Virtmon is a paravirtualized virtual machine introspection (PVMI). It is a platform upon which users install and load a group of kernel modules. The Virtmon project utilizes the intra-to-exo channel to communicate stealthily between the virtual machine and its virtual machine manager, and the exo-to-intra channel to communicate stealthily between the virtual machine manager and the virtual machine, using a shadow region. The shadow region hides any activity between the machines and monitor which keeps malware from detecting and hijacking the communication between the two. The unrestricted PVMI framework shifts the challenges from bridging the semantic gap, to protecting and hiding the PVMI mechanism. Therefore, communication is secure, allowing undetected assistance from a privileged VMM to a VM. The Virtmon project has not only allowed the VMM to cross communication barriers undetected, but also allows for unrestricted registers, into which more data can be exchanged.


63 Your data in the cloud: Facts and fears

Sharang Phatak1, sharang.phatak@dotmatics.com, Berkley A Lynch2, Tamsin E Mansley2, Philip Mounteney1, Jess W Sager1. (1) Dotmatics, Inc., San Diego, CA 92121, United States, (2) Dotmatics, Inc., Woburn, MA 01801, United States

Modern research teams are often decentralized, consisting of internal and external partnerships across the globe. Ensuring all the scientists involved have up-to-the-minute access to the project data they need is a challenge that can be met by secure cloud-based software deployments. This allows any scientist with an internet connection and appropriate access control to login to their project web interface to upload new data or review and analyze the current data. We will present solutions and case studies where cloud software deployments are facilitating research and collaboration through secure exchange and sharing data of all types between research partners, collaborators and customers, locally or across the globe. Instant communication through the cloud or local network enables and enriches collaborative research. We will also discuss how cloud-based solutions support the increasing requirements for mobile data access with BYOD (Bring Your Own Device).


74 Cheminfomatics for dye chemistry research: Bringing online an unprecedented 100,000 sample dye library

David Hinks1, dhinks@ncsu.edu, Nelson Vinueza-Benitez1, David C Muddiman2, Antony J Williams3. (1) Department of Textile Engineering, Chemistry &Science, North Carolina State University, Raleigh, North Carolina 27695, United States, (2) Department of Chemistry, North Carolina State University, Raleigh, North Carolina 27695, United States, (3) Department of eScience, Royal Society of Chemistry, Wake Forest, North Carolina 27587, United States

The synthetic organic chemistry industry arguably began with the commercialization of the first synthetic dye, Mauveine, by Sir William Henry Perkin in 1856. Throughout the next 150 years, research and development of dyes exploded in response to the growing demand for high performance colored products for multiple major industries, including textiles, plastics, coatings, cosmetics, and printing. While many thousands of prototype dyes have been designed, synthesized, characterized, and tested, most of the structural and property data have been kept from the open literature even though large segments of the colorant industry have matured and many high volume dyes are now off patent. This is unfortunate considering that dyes are of fundamental importance to a number of growing areas of science and technology, including solar energy capture, medicinal chemistry (e.g. photodynamic therapy for cancer treatment), biomarkers, environmental monitoring, security printing, and camouflage. The ability for all scientists to observe comprehensive dye structure-property relationship data could help advance the theoretical and practical understanding of the role of dyes in various complex systems. NC State University's recently formed Forensic Sciences Institute is building a dye library that will enable establishment of the first comprehensive cheminfomatics system for forensic trace evidence analysis of dyed materials, as well as a broad range of dye discovery projects. As part of this effort, NC State recently secured a remarkable donation of approximately 100,000 dye samples, spectra and performance data that were made by a leading chemical manufacturer over a period of more than 50 years. Significant parts of the library will be made available online for free. The scope and challenges in developing a digitized structural database will be reviewed. Once completed, the new library will provide all scientists with a powerful tool for dye discovery and knowledge.

Tuesday, March 18, 2014

Ethical Considerations in Digital Scientific Communication and Publishing - AM Session

Omni Dallas Hotel
Room: Deep Ellum A
Cosponsored by CHAL, CHED, ETHC, PROF, YCC

Leah McEwen, Barbara Moriarty, Edward Mottel, Heather Tierney, Organizers
Heather Tierney, Presiding
8:00 am - 12:00 pm
8:00 Introductory Remarks
8:05 50 Ethical dilemmas in the creation and sharing of a crystallographic database system

Suzanna Ward, ward@ccdc.cam.ac.uk, Colin R Groom. Cambridge Crystallographic Data Centre, Cambridge, United Kingdom

The Cambridge Crystallographic Data Centre is one of the oldest digital scientific publishers; since 1965 almost 700,000 small molecule crystal structures have been curated and entered into the Cambridge Structural Database. The CCDC continues to process around 200 small molecule crystal structures every day. Advanced software tools are used to determine the 'chemistry' of a molecule represented by the atomic coordinates we receive. In addition, structures are curated by our expert scientific editors. In these processes we find errors, cases of mis-(or over-)interpretation, unintentional redeterminations and the occasional fraudulent structure. All such instances require careful handling and can present ethical dilemmas. To exacerbate difficulties, deposition of experimental data (in the form of structure factors) is not universal in the small molecule crystallographic community. This is unlike the situation with protein structures and one might argue that it is time to mandate experimental data deposition. But who can issue such a mandate and will raising the hurdle for deposition reduce the number of structures deposited and therefore available to others? A further ethical dilemma is presented by the need to fund our efforts. The CCDC receives no direct grant aid for its curation activities. The Centre therefore relies on financial contributions from the users of its systems, often the very same people from whom we have received the data. Although the individual structures we hold are available free at the point of use, in order to ensure financial sustainability, restrictions are placed on the sharing of the contents of the database system. Is it ethically acceptable to have these restrictions?

8:30 51 Image manipulation in scholarly publishing: Setting standards and promoting best practice

Christina Bennett, cbennett@the-aps.org, American Physiological Society, Bethesda, MD 20814, United States

Over the past few years, scholarly publishers, including the American Physiological Society (APS), have experienced a dramatic increase in the number of publication ethics issues. In response, APS has taken several steps to revise the publication process so that articles are reviewed for potential ethical concerns prior to publication. In particular, figures within accepted manuscripts are now screened for image manipulation. Manuscripts with images that appear to contain splicing, extreme contrast adjustment, selective editing, etc. are now placed on hold until the author corrects the figures and/or discloses the modifications. While, the vast majority of the image modifications that are identified are considered to be minor presentation errors, a handful of manuscripts per year contain scientifically inappropriate image manipulation and are not published. Thus, the image review process serves to educate authors on best practice for image preparation and to protect the journal, its authors, and readers, from publishing manuscripts that contain inappropriately prepared data.

8:55 52 Tools for identifying potential misconduct: The CrossCheck service from CrossRef

Rachael Lammey, rlammey@crossref.org, CrossRef, Oxford, United Kingdom

CrossRef is an independent membership association, founded and directed by publishers. The organisation was formed in 2008 and initially focused on digital object identifier (DOI) registration and reference linking using the DOI to create a sustainable infrastructure and interoperability in online scholarly publishing. CrossRef's role in the industry has expanded however, and the organisation now provides a wider range of services that aim to serve the collective needs of the industry. In 2007, the CrossRef board identified plagiarism as an area of concern, and CrossCheck was launched in 2008. CrossCheck uses the iThenticate tool from iParadigms to compare the text of manuscripts to a unique database of content which contains the full-text of articles, books and conference proceedings from over 500 publishers. When a manuscript is uploaded, a report is prepared and presented to an editor so that they can gauge the originality of a piece of text and see where it overlaps with other papers. CrossCheck is now being widely adopted by journals as part of their peer review process, and over 100,000 manuscripts were uploaded to the iThenticate system to be checked in August 2013. This raises a new set of questions however; how to interpret the reports, how to follow-up and what limitations still exist in text-comparison software. Where do we go from here?

9:20 53 Mapping the terrain of publication ethics

Charon A Pierson, cpierson@aanp.org, American Association of Nurse Practioners, Gilbert, AZ 85298, United States

The Committee on Publication Ethics (COPE) has amassed a database of more than 400 cases of ethical misconduct since its inception in 1997. These cases have been presented and discussed at quarterly Fora and were recently analyzed and categorized (Hames, Pierson, Ridgeway, &Barbour, 2013). Case presentation at Fora provided the impetus for the development of many of the COPE resources such as Flowcharts, Guidelines, and Discussion Papers. This presentation will focus on the most common issues arising from the COPE cases: plagiarism, duplicate publication, reviewer misconduct, fabrication or falsification of data, unethical research, and authorship issues. Using case examples, we will explore the resources available from COPE to resolve ethical dilemmas in publishing.

9:45 Intermission
9:55 54 Publication ethics in ACS journals: Education and verification

Anne Coghill, a_coghill@acs.org, Editorial Office Operations, ACS Publications Division, Washington, DC 20036, United States

ACS Publications has a number of programs to educate authors about topics in publication ethics. Additionally, a number of tools are used during manuscript submission and peer review to verify that authors behave in an ethical manner. This presentation will review ACS Publications efforts to ensure the scientific integrity of material published in our journals.

10:20 55 Ethics in scientific publication: Observations of an editor and recommended best practices for authors

Kirk S Schanze, kschanze@chem.ufl.edu, Department of Chemistry, University of Florida, Gainesville, FL 32611, United States

The scientific publishing enterprise is strongly reliant on the ethics of the scientific community. Editors and reviewers do their best to identify areas of concern when a paper is subjected to peer-review, but often those best able to indentify ethical problems are the senior and contributing authors. The most common area of ethical concern is authorship, but prior publication publication and plagiarism also occur with some frequency. The talk will highlight common problems encountered by authors, editors and reviewers related to ethical issues. In addition, best practices will be discussed from the viewpont of the submitting author to alleviate possible ethical problems that may arise upon submission of a paper.

10:45 56 Dealing with scientific misconduct: Part of an editor's day-to-day work

Haymo Ross, hross@wiley.com, Wiley-VCH, Weinheim, Germany

Editors of scholarly journals are very often confronted with different types of scientific misconduct ranging from inadequate citing to outright fraud. In order to deal with this unpleasant aspect of their work they can seek advice from guidelines such as the “Ethical Guidelines for Publication in Journals and Reviews” issued by the EuCheMS. In this lecture, the aforementioned guidelines are used as a common thread to define and explain various categories of scientific misconduct, e.g., plagiarism and self-plagiarism, with a focus on the speaker's own experiences. The responsibility of authors but also reviewers and editors is reviewed and real (though anonymized) cases are given as examples.

11:10 57 Role of the journal editor in maintaining ethical standards in the changing publishing environment

Jamie Humphrey, ruthvens@rsc.org, Sarah Ruthven. Royal Society of Chemistry, Cambridge, United Kingdom

The Royal Society of Chemistry is a leading publisher of chemistry journals. Ethical issues are becoming increasingly easy to detect with new technologies, but are presenting more challenges to resolve. In this talk hear how Editors at the Royal Society of Chemistry are facing the challenge of successfully handling the growing number of ethical issues, with details of the complexities these are presenting to publishers, authors, reviewers and editors. Ethical issues covered in this presentation will include authorship disputes, plagiarism of content and fabrication of results, with the Editors explaining how they approach these situations.

11:35 Panel Discussion moderated by Gregory Ferrence

Tuesday, March 18, 2014

Cloud Computing in Cheminformatics - PM Session

Omni Dallas Hotel
Room: Deep Ellum A
Cosponsored by COMP

Rudolph Potenzone, Organizers
Rudolph Potenzone, Presiding
1:40 pm - 5:30 pm
1:40 Introductory Remarks
1:45 58 10 Years of collaborative drug discovery in the cloud

Barry A. Bunin, bbunin@collaborativedrug.com, Science, Collaborative Drug Discovery, Burlingame, CA 94010, United States

The CDD Vault is a Cloud (Secure, Hosted) full featured solution for drug discovery, including chemical registration, dose response, SAR tools, and collaborative sharing. By layering unique collaborative capabilities upon requisite drug discovery database functionality unlocks and amplifies latent synergy between biologists and chemists. The application of collaborative technologies to interrogate potency, selectively, and therapeutic windows of small molecule enzyme, cell, and animal study data will be presented. An example combining integrated bioinformatics and chemoinformatics with in vitro experimental validation to identify two leads against putative new Tuberculosis targets (in collaboration with SRI's Computer Science + Bioscience Divisions), a second example to overcome malaria chloroquine resistance (USA - South Africa four team collaboration), a third example with the TB Drug Accelerator Consortium (involving multiple collaborators plus selective disclosure between seven big pharmas with the BMGF), and a fourth example broadly across CNS therapeutic areas (in collaboration with the NIH Neuroscience Blueprint) demonstrate the general concept that a more effective collaborative model is possible today using secure, web-based collaborative technologies to bring together complementary, specialized expertise. Finally, a commercial use case working with Acetylon Pharmaceuticals from academic spinout (of Harvard) of the initial IP to working with CROs (in China) to bringing small molecules against Epigenetic targets (HDACs) into clinical trials will be shared. Securely integrating private with external data is a key capability which allows collaborative technologies like the CDD Vault to scalably handle IP-sensitive data.

2:15 59 Cloud-hosted APIs for cheminformatics designed for real time user interfaces

Alex M Clark, aclark@molmatinf.com, R&D, Molecular Materials Informatics, Montreal, QC H3J2S1, Canada

The modern trend toward cloud computing coincides with a new generation of user interface paradigms that are based on web apps or native mobile apps. Both of these platforms have enormous advantages in terms of deployment methods and the ability to provide an omnipresent user experience, but unlike traditional applications from the desktop era, they are not appropriate for intensive calculations or storage of large quantities of data. In order to make them as useful as possible, it is often necessary to outsource computation and storage, which is typically done by using a web API. The process of decoupling user interface and computation/storage is often not a simple matter, and considerable foresight and careful design is necessary when designing cheminformatics functionality that can be integrated into a seemless and responsive user experience. Particular examples will be given using mobile apps, and attention given to the partitioning between the computational features that are best localized on the device, and those that are best performed on a powerful server. Considerations include deciding how much functionality to offload, balancing latency vs. responsiveness, and the issues involved in creating an asynchronous task and the relatively high inconvenience of providing progress indicators and the ability to cancel a task. The strategies used by apps such as the Mobile Molecular DataSheet (MMDS), SAR Table and MolPrime+ will be discussed in detail.

2:45 pm 60 WITHDRAWN
3:15 Intermission
3:30 61 Application of cloud computing to Royal Society of Chemistry data platforms

Valery Tkachenko, tkachenkov@rsc.org, Ken Karapetyan, Jon Steele, Alexey Pshenichnov, Antony J. Williams. eScience, Royal Society of Chemistry, Cambridge, Cambridgeshire CB4 0WF, United Kingdom

Cloud computing offers significant advantages for the hosting of RSC chemistry databases in terms of reliability, performance and access to large scale computational power. The ChemSpider database contains almost 30 million unique chemical compounds and access to compute power to regenerate properties and add new properties is essential for efficient delivery on a manageable timescale. The use of cloud-based facilities reduces the needs for internal infrastructure and enhances performance generally at the cost of significant recoding of the platforms. This presentation will review our move of our ChemSpider related projects to the cloud, the associated challenges and both the obvious and unforeseen benefits. We will also discuss our use of parallelization technologies for mass calculation using Hadoop.

4:00 62 PubChem in the cloud

Paul Thiessen, Bo Yu, Gang Fu, Evan Bolton, bolton@ncbi.nlm.nih.gov. National Center for Biotechnology Information, Bethesda, MD 20894, United States

What is 'cloud computing'? Depending on who you ask, you get the notion that information is stored, managed, and processed somewhere 'out there' in the wild yonder of the Internet. PubChem is an on-line archive of chemical substance information and their biological activities. Depending on your perspective, PubChem can be used as a 'cloud' in that you can access, search, and analyze scientific information available in PubChem using various programmatic interfaces. PubChem information can also be put into a 'cloud' and made available. This talk will demonstrate ways one can use PubChem as a 'cloud' or from a 'cloud'.

4:30 63 Your data in the cloud: Facts and fears

Sharang Phatak1, sharang.phatak@dotmatics.com, Berkley A Lynch2, Tamsin E Mansley2, Philip Mounteney1, Jess W Sager1. (1) Dotmatics, Inc., San Diego, CA 92121, United States, (2) Dotmatics, Inc., Woburn, MA 01801, United States

Modern research teams are often decentralized, consisting of internal and external partnerships across the globe. Ensuring all the scientists involved have up-to-the-minute access to the project data they need is a challenge that can be met by secure cloud-based software deployments. This allows any scientist with an internet connection and appropriate access control to login to their project web interface to upload new data or review and analyze the current data. We will present solutions and case studies where cloud software deployments are facilitating research and collaboration through secure exchange and sharing data of all types between research partners, collaborators and customers, locally or across the globe. Instant communication through the cloud or local network enables and enriches collaborative research. We will also discuss how cloud-based solutions support the increasing requirements for mobile data access with BYOD (Bring Your Own Device).

5:00 64 Moving main stream chemical research to the cloud

Philip J. Skinner, Philip.skinner@perkinelmer.com, Joshua Bishop, Phil McHale, Rudy Potenzone. PerkinElmer, United States

Despite the advent of cloud computing, and the prevalence of applications such as GoogleDocs and Gmail, few laboratory specific applications have appeared. We will present a new platform that promises to rethink the ways scientists collect, store, share and learn from scientific observations by fully embracing the cloud philosophy. Reimagining ChemDraw for a new platform led to new ways of driving scientific collaboration. In a similar vein we have assembled a novel environment that allows scientists to gather information about their experiments, document the results, share knowledge and discuss the implications of their knowledge. Through annotations and comments within a project, scientists can collaborate in real time to try out new ideas while they share suggestions and explore new directions. By leveraging the power of crowd-sourcing we are able to rapidly adapt to changing needs and environment and optimize the full user experience from novice through user to advocate.

Wednesday, March 19, 2014

New Models in Substance Discovery - AM Session

Omni Dallas Hotel
Room: Deep Ellum A

Roger Schenck, Organizers
Roger Schenck, Presiding
8:10 am - 12:00 pm
8:10 Introductory Remarks
8:15 65 Functional requirements for chemical information retrieval for intellectual property professionals

Matthew McBride, mmcbride@cas.org, Science IP, Chemical Abstracts Service, Columbus, Ohio 43202, United States

Chemical information retrieval by intellectual property (IP) professionals requires similar functionality to end user scientists – quality content and advanced search functionality. However, IP professionals are additionally concerned with comprehensive retrieval from worldwide sources (patent authorities and non-patent literature), exemplified substance collections and Markush disclosures, association with the original citations or sources, and post-processing results from multiple search platforms. Therefore the required features of their search tools are often more demanding, requiring additional functionality to manage workflow in navigating and presenting results to a wide-array of technical users, from scientists to patent attorneys. This session focuses on providing insight from search professionals on key functional requirements for chemical information systems used in IP work, along with search examples demonstrating these features from databases on STN, Questel MMS and other search platforms.

8:45 66 New approaches to search interfaces in PubChem

Asta Gindulyte, Lianyi Han, Paul Thiessen, Bo Yu, Lewis Geer, Evan Bolton, bolton@ncbi.nlm.nih.gov. National Center for Biotechnology Information, Bethesda, MD 20894, United States

Providing easy to use web-based scientific search interfaces is a challenge. This is true for several reasons. Firstly, the devices accessing scientific information are diverse with potentially big differences in screen display size, network speed, network latency, and interface (e.g., touch-based vs. mouse-based). Secondly, the amount of publically available scientific data is growing markedly. For example, the open and free PubChem resource (http://pubchem.ncbi.nlm.nih.gov) currently contains more than 47 million unique small molecules derived from 120 million substance sample descriptions and 217 million bioactivity results from 717 thousand biological experiments. Thirdly, the availability and diversity of scientific information is very uneven between substances, with some chemicals being well studied and others being almost completely unknown. This presentation will explore ways in which PubChem is adapting, including: making search interfaces adapt to the device accessing it, developing new algorithms that determine result relevancy, and harnessing new technologies to improve the speed and scalability of substance searching.

9:15 67 From searching to finding: New developments for managing large data sets

Juergen Swienty-Busch1, j.swienty-busch@elsevier.com, David Evans2. (1) Elsevier Information Systems GmbH, Frankfurt, Hessen 60486, Germany, (2) Reed Elsevier Properties SA, Neuchatel, Switzerland

A challenge for a system that is designed to support chemistry researchers across all industries and academic environments is usability and relevancy. A chemical engineer in a chemical company has different expectations for a system serving his information needs when compared to a medicinal chemist in a pharma company or a PhD student in an inorganic lab at a university. We will describe recent developments in the Reaxys product family that address these challenges with a new and highly customizable user interface combined with an extended data set and will present use cases and applications that demonstrate its applicability in various chemistry workflows.

9:45 68 Search and navigation functionality for a major reference work online: SOS 4.0

Fiona Shortt de Hernandez1, fiona.shortt@thieme.de, Guido F. Herrmann1, Peter Loew2. (1) Thieme Publishers, Stuttgart, Germany, (2) InfoChem GmbH, Munich, Germany

There are significant challenges associated with making content from a multiauthor reference work searchable, retrievable and readable online. This task was made more complicated in the case of Science of Synthesis because of the scale and scope of the project. The online product which includes the Houben–Weyl archive (going back 100 years) covers hundreds of volumes and thousands of pages with contributions from over 1,500 authors. Science of Synthesis is now a unique, structure searchable, full-text resource in organic synthesis that provides the user with expert-evaluated methods. The online product combines full-text browsing functionality together with InfoChem's modern structure and reaction search capabilities. This talk describes how we overcame the various technical hurdles in order to achieve this. We also give specific examples of the features and functionality included in the product in order to help the researcher gain access to relevant content quickly. For example: A clean and simple user-interface design makes comprehensive searching straightforward. Useful navigation aids such as breadcrumbs and an interactive table of contents facilitate browsing. Intuitive text search options allow quick access to relevant content. Advanced text searching options help the user to define more complicated queries. The manual preparation of a name reaction index and the use of deep indexing enable the association of transformations with specific named reactions even if they are not mentioned as such in the full text. The “All-in-One” structure and reaction search from InfoChem provides a broader spectrum of weighted hits. The relevant ranking of hits using the InfoChem algorithm for structure and reaction searching as well as the MarkLogic algorithm for text searching helps the user to rate the quality of their results quickly and determine which information is of interest to them. Faceted navigation of the hitlist helps the user to further refine their results.

10:15 Intermission
10:30 69 Chemical most common denominator: Use of chemical structures for semantic enrichment and interlinking of scientific information

Valentina Eigner-Pitto, ve@infochem.de, Josef Eiblmaier, Hans Kraut, Larisa Isenko, Heinz Saller, Peter Loew. InfoChem GmbH, Munich, Germany

Academic and industrial researchers nowadays expect interactive, easy to use access to the scientific literature and supplementary information in order to satisfy their information needs quickly and comprehensively. To address these challenges major scientific publishers have begun to chemically enhance/enrich their scientific articles, reference works and eBooks with information based on chemical structures and reactions. Compound indexing, highlighting of chemical terms, one-click display of additional compound attributes and, most important, semantic linking to other relevant resources or information have been implemented. First and foremost enabling chemical structure searching in text sources offers benefits to the user never seen before by providing an additional, precise entry point to the full text. Sophisticated cutting edge technologies like chemical named entity extraction, chemical image recognition and automatic work-up of ChemDraw files are applied to fulfill these requirements. This lecture will give an overview of current implementations of the concepts listed above and will illustrate the benefits as well as the challenges and pitfalls by one concrete example project.

11:00 70 Representation and display of non-standard peptides using semi-systematic amino acid monomer naming

Roger A Sayle, roger@nextmovesoftware.com, Noel M O'Boyle. NextMove Software, Cambridge, CAMBS CB4 0EY, United Kingdom

The registration of peptide and peptide-like structures in chemical databases poses a number of technical challenges, as do other biological oligomers. Primary among these are the size and complexity of these compounds, often making it difficult or impossible for a biochemist to identify differences or similarities between compounds stored as all-atom representations. For biological (proteinogenic) sequences the solution is simple, to use strings of the one (or three) letter codes to represent the biopolymer. However, for post-translationally modified, D-, cyclic and non-standard peptides the way forward is less clear. One approach is to use an ever larger dictionary three letter codes to encode common non-standard amino acids. Unfortunately, this method quickly becomes unwieldy once the set of abbreviations exceeds those frequently encountered in the literature, requiring a chemist to consult a key or dictionary for entry or interpretation of structures. In this presentation, we propose the use of semi-systematic monomer names, based upon readily recognizable chemical line formulae, for the encoding and display of traditionally difficult to handle peptides. These rules lead to names such as N(Me)Ser(tBuOH) that are similar to those seen/used in scientific publications, though not formally ratified by IUPAC nor CAS.

11:30 71 New structure search capabilities for accessing CAS content

Kurt Zielenbach1, kzielenbach@cas.org, Bryan Harkleroad2. (1) Marketing, CAS, Columbus, Ohio 43202, United States, (2) Product Development, CAS, Columbus, Ohio 43202, United States

Access to comprehensive, high quality chemical information by research scientists is essential for ensuring successful research projects. As the number of known substances, reactions and related literature continues to grow exponentially, chemists require fast, intuitive and relevant access to this information through various access paths - particularly by chemical structure techniques. This session focuses on new technologies in development at CAS that will help chemists improve efficiencies in searching and decision making with SciFinder® – the choice for chemistry research.™ Chemical Abstracts Service (CAS), a division of the American Chemical Society, is the world's authority for chemical information and the only organization in the world whose objective is to find, collect and organize all publicly disclosed chemical substance information.

Wednesday, March 19, 2014

General Papers - PM Session

Omni Dallas Hotel
Room: Deep Ellum A

Jeremy Garritano, Erin Bolstad, Organizers
Erin Bolstad, Presiding
1:30 pm - 3:30 pm
1:30 72 UK National Chemical Database Service: An integration of commercial and public chemistry services to support chemists in the United Kingdom

Antony J. Williams, williamsa@rsc.org, Valery Tkachenko, Richard Kidd. eScience, Royal Society of Chemistry, Cambridge, Cambridgeshire CB4 0WF, United Kingdom

At a time when the data explosion has simply been redefined as “Big”, the hurdles associated with building a subject-specific data repository for chemistry are daunting. Combining a multitude of non-standard data formats for chemicals, related properties, reactions, spectra etc., together with the confusion of licensing and embargoing, and providing for data exchange and integration with services and platforms external to the repository, the challenge is significant. This all at a time when semantic technologies are touted as the fundamental technology to enhance integration and discoverability. Funding agencies are demanding change, especially a change towards access to open data to parallel their expectations around Open Access publishing. The Royal Society of Chemistry has been funded by the Engineering and Physical Science Research of the UK to deliver a “chemical database service” for UK scientists. This presentation will provide an overview of the challenges associated with this project and our progress in delivering a chemistry repository capable of handling the complex data types associated with chemistry. The benefits of such a repository in terms of providing data to develop prediction models to further enable scientific discovery will be discussed and the potential impact on the future of scientific publishing will also be examined.

2:00 73 Data enhancing the Royal Society of Chemistry publication archive

Antony J. Williams, williamsa@rsc.org, Colin Batchelor, Peter Corbett, Ken Karapetyan, Valery Tkachenko. eScience, Royal Society of Chemistry, Cambridge, Cambridgeshire CB4 0WF, United Kingdom

The Royal Society of Chemistry has an archive of hundreds of thousands of published articles containing various types of chemistry related data – compounds, reactions, property data, spectral data etc. RSC has a vision of extracting as much of these data as possible and providing access via ChemSpider and its related projects. To this end we have applied a combination of text-mining extraction, image conversion and chemical validation and standardization approaches. The outcome of this project will result in new chemistry related data being added to our chemical and reaction databases and in the ability to more tightly couple web-based versions of the articles with these extracted data. The ability to search across the archive will be enhanced as a result. This presentation will report on our progress in this data extraction project and discuss how we will ultimately use similar approaches in our publishing pipeline to enhance article markup for new publications.

2:30 74 Cheminfomatics for dye chemistry research: Bringing online an unprecedented 100,000 sample dye library

David Hinks1, dhinks@ncsu.edu, Nelson Vinueza-Benitez1, David C Muddiman2, Antony J Williams3. (1) Department of Textile Engineering, Chemistry &Science, North Carolina State University, Raleigh, North Carolina 27695, United States, (2) Department of Chemistry, North Carolina State University, Raleigh, North Carolina 27695, United States, (3) Department of eScience, Royal Society of Chemistry, Wake Forest, North Carolina 27587, United States

The synthetic organic chemistry industry arguably began with the commercialization of the first synthetic dye, Mauveine, by Sir William Henry Perkin in 1856. Throughout the next 150 years, research and development of dyes exploded in response to the growing demand for high performance colored products for multiple major industries, including textiles, plastics, coatings, cosmetics, and printing. While many thousands of prototype dyes have been designed, synthesized, characterized, and tested, most of the structural and property data have been kept from the open literature even though large segments of the colorant industry have matured and many high volume dyes are now off patent. This is unfortunate considering that dyes are of fundamental importance to a number of growing areas of science and technology, including solar energy capture, medicinal chemistry (e.g. photodynamic therapy for cancer treatment), biomarkers, environmental monitoring, security printing, and camouflage. The ability for all scientists to observe comprehensive dye structure-property relationship data could help advance the theoretical and practical understanding of the role of dyes in various complex systems. NC State University's recently formed Forensic Sciences Institute is building a dye library that will enable establishment of the first comprehensive cheminfomatics system for forensic trace evidence analysis of dyed materials, as well as a broad range of dye discovery projects. As part of this effort, NC State recently secured a remarkable donation of approximately 100,000 dye samples, spectra and performance data that were made by a leading chemical manufacturer over a period of more than 50 years. Significant parts of the library will be made available online for free. The scope and challenges in developing a digitized structural database will be reviewed. Once completed, the new library will provide all scientists with a powerful tool for dye discovery and knowledge.

3:00 75 QM/MM docking for GPCR targets

Art E Cho, artcho@korea.ac.kr, Minsup Kim. Department of Bioinformatics, Korea University, Seoul, Republic of Korea

The study of GPCR's as drug targets have been intensified in recent days as more and more X-ray structures of GPCR are solved. In fact, awarding of the 2012 Nobel prize in chemistry to Lefkowitz and Kobilka was just the beginning of the race to the discovery of GPCR targeted drugs. Many research groups have been trying to use current-state docking programs for screening of compound libraries against GPCR targets to varying degrees of success. As many people in the computational drug discovery community know by now, there is even a CASP-style competition dedicated to GPCR targeted docking. Docking to GPCR targets is in many ways quite different from docking to other targets. Binding sites of GPCR's are generally open and flexible. One must consider these peculiarities for GPCR docking. Over the years we have developed docking methodologies utilizing QM/MM calculations in order to take into account of phenomena that cannot be described by conventional force fields. Recently we devised a docking protocol based on these QM/MM docking methods, combined with induced fit docking and solvent calculations with GPCR targets in mind. Application of this protocol to a set of known GPCR cocrystal structures reveals that when the particularities of GPCR binding sites are carefully treated, the performance of docking to GPCR can be greatly enhanced.