Building a network of interoperable and independently produced linked and open biomedical data

Michel Dumontier

Michel Dumontier of Stanford University, and his co-workers, develop tools and methods to represent, store, publish, integrate, query, and reuse biomedical data, software, and ontologies, with an emphasis on reproducible discovery, which necessitates data science tools and methods, and community standards. Data need to be “FAIR”,26 that is, findable, accessible, interoperable, and reusable.

The Semantic Web is the new global web of knowledge: it has standards for publishing, sharing and querying facts, expert knowledge and services, and a scalable approach for the discovery of independently formulated and distributed knowledge. Linked Data offers a solid foundation for FAIR data: entities are identified using globally unique identifiers (URIs); entity descriptions are represented with a standardized language (resource description framework, RDF); data can be retrieved using a universal protocol (HTTP); and entities can be linked together to increase interoperability.

Bio2RDF is an open source project to unify the representation and interlinking of biological data using RDF: it transforms silos of life science data into a globally distributed network of linked data for biological knowledge discovery. It shows how datasets are connected together. Queries can be federated across private and public Protocol and RDF Query Language (SPARQL) databases. A graph-like representation is amenable to finding mismatches and discovering new links.27 EbolaKB28 is an example using linked data and software.

In current, unpublished research on network analysis and discovery, Michel’s team is examining whether they can implement an open version of PREDICT29 using linked data. HyQue,30,31 for hypothesis validation, is a platform for knowledge discovery that uses data retrieval coupled with automated reasoning to validate scientific hypotheses. It builds on semantic technologies to provide access to linked data, ontologies, and Semantic Web services, uses positive and negative findings, captures provenance, and weighs evidence according to context. It has been used to find aging genes in nematodes, and to assess cardiotoxicity of tyrosine kinase inhibitors

The network of linked data goes beyond biology. Michel displayed a network from about 2007, and the linking open data cloud diagram as of August 2014, to show how rapid has been the expansion over domains:

Linked Data RDFLinked Data Cloud

EMBL-EBI have been producing RDF for two years, PubChemRDF was released more than two years ago, and NLM has released a beta version of Medical Subject Headings (MeSH) RDF linked data, but lack of coordination makes Linked Open Data chaotic and unwieldy. There is no shortage of vocabularies, ontologies and community-based standards. The National Center for Biomedical Ontology (NCBO) manages a repository of all publicly available biomedical ontologies and terminologies. The NCBO BioPortal resource makes these ontologies and terminologies available via a Web browser and Web Services. The NCBO Annotator service takes as input natural-language text and returns as output ontology terms to which the text refers. The Center for Extended Data Annotation and Retrieval (CEDAR) project relies on the BioPortal ontology repository and the NCBO Annotator. CEDAR is making data submission smarter and faster, so biomedical researchers and analysts create and use better metadata. Through better interfaces, terminology, metadata practices, and analytics, CEDAR optimizes the metadata pathway from provider to end user.

PubChem engaged the community to reuse and extend existing vocabularies. Semanticscience Ontology (SIO) is an effective upper level ontology, with over 1,500 classes and 207 object properties. Chemical Information Ontology (CHEMINF)32 is a collaborative ontology that distinguishes algorithmic, or procedural information from declarative, or factual information, and renders of particular importance the annotation of provenance to calculated data.

Large scale publishing on the Web across biomedical datatypes is possible. Hubs such as NCBI and EMBL-EBI now integrate data, but there is need for global coordination on all data types. Standard vocabularies must to be open, freely accessible, and demonstrably reused. Worldwide data integration formats such as RDF can improve linking of data, and some toolkits that are easier to deploy will provide standards-compliant, linked data. The development and use of standards by PubChem, and others, brings us closer to an interoperability ideal, but much more work is needed to support computational discovery in a reproducible manner.