Open chemical information at the European Bioinformatics Institute

Christoph Steinbeck

Christoph Steinbeck of EMBL-EBI looked back to his early years as a natural products chemist, and recounted what has happened since the old days of access to Beilstein and CAS in 1992. There were no open source software libraries for cheminformatics in those days, but there were computer-assisted structure elucidation (CASE) systems.12,13 Christoph sold his CASE software to Bruker and it got buried. He learned that successful science requires data and software to be free and open.

So in 2000 he and his co-workers began work on an open source library for bioinformatics, cheminformatics, and computational chemistry written in Java: the Chemistry Development Kit (CDK).14,15,16 Sixteen years later, it is a well-established, mature code base (564,171 lines of code), maintained by a large development team; 16,521 commits have been made by 115 contributors.

Christoph’s database years really began when he moved to EMBL-EBI, although his open database NMRShiftDB17,18 was written earlier. It contains 50,000 compounds and their spectra. Christoph’s current research interest is documenting the metabolomes of all species on the planet. To coin Donald Rumsfeld’s phraseology, “known knowns” can be found in databases, “known unknowns” can be found using NMRShiftDB, but “unknown unknowns” are dark matter. Too many metabolomes are not known.

EMBL-EBI has many important databases, Chemical Entities of Biological Interest (ChEBI) and ChEMBL being just two of them. ChEBI is a freely available dictionary of molecular entities focused on small chemical compounds. The molecular entities are either products of nature or synthetic products used to intervene in the processes of living organisms. ChEBI incorporates an ontological classification, whereby the relationships between molecular entities or classes of entities and their parents or children are specified. ChEMBL is an open data resource of binding, functional, and ADMET bioactivity data for a large number of druglike compounds.19 The types of data reported in PubChem and ChEMBL are distinct and complementary. To maximize the utility of the two datasets EMBL-EBI has worked with the PubChem group to develop a data exchange mechanism.

It is estimated that there are about 8.7 million eukaryotic species on earth, of which 1.2 million have been identified and classified. Three or four thousand complete species genomes have been sequenced. What about completed metabolomes? Steinbeck’s team has argued that the time is now right to focus intensively on model organism metabolomes.20 They have proposed a grand challenge to identify and map all metabolites onto metabolic pathways, to develop quantitative metabolic models for model organisms, and to relate organism metabolic pathways within the context of evolutionary metabolomics.

Species metabolomes are now being assembled through data sharing in metabolomics. MetaboLights21,22,23 is an EMBL-EBI database for metabolomics experiments and derived information. It is cross-species and cross-technique, and covers metabolite structures and their reference spectra as well as their biological roles, locations and concentrations, and experimental data from metabolic experiments. Christoph’s team has reported one dataset24 in the data publication Scientific Data.