Evolution of open chemical information

Valery Tkachenko

Valery Tkachenko of the Royal Society of Chemistry (RSC) continued the theme of open data in chemistry. Everything changed in 1992 with the arrival of the World Wide Web. Later, PubChem changed the world of chemical information. ChemSpider, a structure-centric hub for Web searching now contains 57 million compounds chemicals from over 500 different sources, and deposition of data is ongoing. It differs from PubChem in that curation and annotation are crowdsourced. ChemSpider has analytical data, text and literature references, and data on compounds and reactions. NextMove Software’s text mining software has been used to analyze reactions from the RSC archive of journal articles, output CML, and break down each procedure summary into steps.

We are moving into the world of the Internet of Things and phones with modular, replaceable parts. Gartner has identified the Top 10 Strategic Technology Trends for 2016. Our world is hyperconnected, and connections require standards. The IUPAC “color books” took years to write, and thus data quality issues arose. Evan Bolton has referred to the proliferation of errors in public and private databases as “robochemistry”. Manual curation of huge databases is not feasible but automatic quality control systems such as RSC’s Chemistry Validation and Standardization Platform (CSVP) can be developed. CVSP allows users to upload chemical structure files which are then validated, and optionally standardized, in preparation for publication or submission to a chemical database. About 200 rules have been encoded and expressed as XML, to check for errors in, for example, the depiction of stereochemistry. The community can amend these rules. The structure’s relationship to names, SMILES, and other identifiers also needs checking.

Knowledge from the past is used to derive wisdom. The Open PHACTS discovery platform has been developed to reduce barriers to drug discovery in businesses and academia. It contains multiple data sources, integrated and linked together so that users can easily see the relationships between compounds, targets, pathways, diseases and tissues. The platform has been used to answer complex questions in drug discovery. It was built in collaboration with a large consortium of organizations involved in drug discovery, and is founded on Semantic Web and linked data principles. RSC developed the chemical data handling software for OpenPHACTS.

A high percentage of raw data is lost in the science data publishing workflow. Horizon 2020 is a very large EU research and innovation program. It already mandates open access to all scientific publications; from 2017, research data are open by default, with possibilities to opt out. In the era of Uber, transportation is now a commodity. Will scientific data become a commodity by 2020? How will publishers cope? Authorities have moved from centralized to decentralized to distributed, as we have moved into the hyperconnected world. We are on a verge of a new technical revolution; RSC is excited, and is ready to ride high on the wave of data science developments.