- Message from the Chair
- Letter from the Editor
- Awards and Scholarships
- Interview with David Evans
- Technical Program
- Herman Skolnik Award Symposium 2016
- Developing databases and standards in chemistry
- Two decades of open chemical data at the Developmental Therapeutics Program (DTP) at the National Cancer Institute (NCI)
- Using InChI to manage data
- Open chemistry resources provided by the NCI computer-aided drug design (CADD) group
- Evolution of open chemical information
- Open chemical information at the European Bioinformatics Institute
- History and the future of tools and software components for working with public chemistry data
- PubChem a resource for cognitive computing
- SPL and openFDA resources of open substance data
- Building a network of interoperable and independently produced linked and open biomedical data
- Chemical structure representation in PubChem
- iRAMP and PubChem: of the people, for the people
- Open chemical information: where now and how?
- ANYL: New Directions in Chemometrics: Making Sense of Big & Small Chemical Data Sets
- Shedding Light on the Dark Genome: Methods, Tools & Case Studies
- Bringing Cheminformatics into the College Chemistry Classroom
- Herman Skolnik Award Symposium 2016
- Committee Reports
- Sponsor Announcements
- CINF Officers and Functionaries
Building a network of interoperable and independently produced linked and open biomedical data
Michel Dumontier of Stanford University, and his co-workers, develop tools and methods to represent, store, publish, integrate, query, and reuse biomedical data, software, and ontologies, with an emphasis on reproducible discovery, which necessitates data science tools and methods, and community standards. Data need to be “FAIR”,26 that is, findable, accessible, interoperable, and reusable.
The Semantic Web is the new global web of knowledge: it has standards for publishing, sharing and querying facts, expert knowledge and services, and a scalable approach for the discovery of independently formulated and distributed knowledge. Linked Data offers a solid foundation for FAIR data: entities are identified using globally unique identifiers (URIs); entity descriptions are represented with a standardized language (resource description framework, RDF); data can be retrieved using a universal protocol (HTTP); and entities can be linked together to increase interoperability.
Bio2RDF is an open source project to unify the representation and interlinking of biological data using RDF: it transforms silos of life science data into a globally distributed network of linked data for biological knowledge discovery. It shows how datasets are connected together. Queries can be federated across private and public Protocol and RDF Query Language (SPARQL) databases. A graph-like representation is amenable to finding mismatches and discovering new links.27 EbolaKB28 is an example using linked data and software.
In current, unpublished research on network analysis and discovery, Michel’s team is examining whether they can implement an open version of PREDICT29 using linked data. HyQue,30,31 for hypothesis validation, is a platform for knowledge discovery that uses data retrieval coupled with automated reasoning to validate scientific hypotheses. It builds on semantic technologies to provide access to linked data, ontologies, and Semantic Web services, uses positive and negative findings, captures provenance, and weighs evidence according to context. It has been used to find aging genes in nematodes, and to assess cardiotoxicity of tyrosine kinase inhibitors
The network of linked data goes beyond biology. Michel displayed a network from about 2007, and the linking open data cloud diagram as of August 2014, to show how rapid has been the expansion over domains:
EMBL-EBI have been producing RDF for two years, PubChemRDF was released more than two years ago, and NLM has released a beta version of Medical Subject Headings (MeSH) RDF linked data, but lack of coordination makes Linked Open Data chaotic and unwieldy. There is no shortage of vocabularies, ontologies and community-based standards. The National Center for Biomedical Ontology (NCBO) manages a repository of all publicly available biomedical ontologies and terminologies. The NCBO BioPortal resource makes these ontologies and terminologies available via a Web browser and Web Services. The NCBO Annotator service takes as input natural-language text and returns as output ontology terms to which the text refers. The Center for Extended Data Annotation and Retrieval (CEDAR) project relies on the BioPortal ontology repository and the NCBO Annotator. CEDAR is making data submission smarter and faster, so biomedical researchers and analysts create and use better metadata. Through better interfaces, terminology, metadata practices, and analytics, CEDAR optimizes the metadata pathway from provider to end user.
PubChem engaged the community to reuse and extend existing vocabularies. Semanticscience Ontology (SIO) is an effective upper level ontology, with over 1,500 classes and 207 object properties. Chemical Information Ontology (CHEMINF)32 is a collaborative ontology that distinguishes algorithmic, or procedural information from declarative, or factual information, and renders of particular importance the annotation of provenance to calculated data.
Large scale publishing on the Web across biomedical datatypes is possible. Hubs such as NCBI and EMBL-EBI now integrate data, but there is need for global coordination on all data types. Standard vocabularies must to be open, freely accessible, and demonstrably reused. Worldwide data integration formats such as RDF can improve linking of data, and some toolkits that are easier to deploy will provide standards-compliant, linked data. The development and use of standards by PubChem, and others, brings us closer to an interoperability ideal, but much more work is needed to support computational discovery in a reproducible manner.