Open chemical information: where now and how?

Evan Bolton

Evan Bolton gave the award address on behalf of both awardees. Many people think that cheminformatics is a solved problem. “Open” is now a popular adjective: open learning, open access, open data, open government, open source, and so on. “Open” was much less of an “in” word when PubChem was conceived. There is still little openness when it comes to scientific data. There is still a lot to be done in the open space. For example, openness is not widespread in drug discovery. We have to empower researchers with ready access to information so that they do not repeat work that has already been done.

PubChem is an open archive; the data are free, accessible, and downloadable. Information is uploaded by depositors, it is normalized and displayed, and it can then be downloaded by other researchers. Algorithms carry out the normalization, but sometimes they go wrong and can introduce ambiguity; later processing of this ambiguous data can result in data corruption or error. For example, chemical file format interconversion can be “lossy”, such as when converting from SDF to SMILES, where the coordinates are lost and stereo must be perceived by algorithms. Different software packages may “normalize” or convert a chemical structure in different ways. This variation produces tens of different representations of nitro groups and azides in PubChem.

Atom environments have to be standardized. Data clean-up approaches include structure standardization; consistency filtering (name-structure matching, and use of authoritative sources, and hand-curated black, gray, and white lists); chemical concepts (groupings of chemical names, setting a preferred concept for a given structure, and a preferred structure for a given concept); and cross-validation via text mining (to gather evidence to support the reported association of a chemical to other entities). A chemical structure may be represented in many different ways (tautomer and salt-form drawing variations are common, for example), and the chemical meaning of a substance may change with context (e.g., the solid form may involve a hydrate, which affects molecular weight when weighing out a substance to make a solution). The boiling point of benzene is both 176.2°F and 200-500°F in PubChem Compound; the first record is that for benzene, but the second is for coal tar oil (a crude form of benzene). There are many-to-many relationships between chemical concepts and chemical structures.

PubChem is successful because it is inclusive, free, robust, innovative, and helpful. If a chemical exists, you often find it. Evan singled out a few features of PubChem for particular mention. Substances are converted to compounds, but the original information is kept. There is clear provenance, so users can trace from whom the data came. Information is downloadable, and there are extensive programmatic interfaces. PubChem is constantly improved, can handle a lot of abuse, and is sustainable. The PubChem synonym classification was available first in RDF. It indicates the chemical name type, allows grouping of names, and can involve guess work. More authoritative name sources have been added. Most non-classified names are unhelpful (perhaps because of chemical name corruption, or chemical name fragments).

As more data are added, the scalability of PubChem is difficult to maintain. It is not uncommon to reach the limit of technology. For example, PubChem could no longer use SQL databases for some queries due to performance bottlenecks. After examination of noSQL technologies like Solr/Lucene, better approaches were determined. An example of this is PubChem’s structured data query (SDQ), which uses the Sphinx search engine to perform the query, but then fetches data from an SQL database. It is a query language with clear logic in concise format, communicating with a JSON object. It features a powerful search ability, a URL-accessible Common Gateway Interface (CGI), and easy application integration.

PubChem faces many challenges. One is growth: 50% of the resources of the project are needed just to keep scaling the system. Government mandates (like the current HTTPS-only edict) necessitate regular migrations. Data clean-up and error proliferation prevention require constant vigilance: the team uses existing technology where possible, but solutions do not always exist. They must be developed for PubChem to remain scalable.

Chemical structure databases have come a long way since the origins of computerization in the 1960s, and the rise of databases such as CAS REGISTRY and Beilstein in the 1970s. The 2010s are the era of large, open chemical databases of aggregated content, with RESTful programmatic access. These large open collections of tens of millions of chemical structures need methods to lock down the data without curation, otherwise non-curation combined with open exchange of data leads to error proliferation. Digital standards are needed to improve chemical data exchange and chemical data clean-up methods to prevent error proliferation. Close attention to provenance, and a set of clear definitions for chemical concepts, are also needed.

ACS CINF had a data summit at the spring 2016 meeting in San Diego. Ten half-day symposia were held over five days, with over 70 speakers, including experts from different related domains. The summit helped to identify informatics “pain points” for which we need to find solutions. The Research Data Alliance and IUPAC had a follow-up workshop in July at EPA, where a number of projects were discussed. One on chemical structure standardization education and outreach aims to help chemists and other stakeholders to understand the issues of chemical structure standardization. Another, updating IUPAC’s graphical representation guidelines, seeks to help chemists to understand the issues of chemical structure standardization, often apparent in chemical depiction. Other recommendations concern open chemical structure file formats, and best practices in normalizing chemical structures. There are plans to develop a small-scale ontology of chemical terms, based on terms in the IUPAC Orange Book as a case study. A project on the IUPAC Gold Book data structure is related to a current effort to extract the content and term identifiers, and convert them into a more accessible and machine-digestible format for increased usability. Finally, a scoping project on use cases for semantic chemical terminology applications will focus on researching the current chemical data transfer and communication landscape for potential applications of semantic terminology.

We are entering a new era: in the 2020s we will have large, extensively machine-curated, open collections, with clear provenance, and standard approaches to file formats and normalization, where errors do not proliferate, and links are cross-validated. Open knowledge bases will emerge that contain all open scientific knowledge that is computable (i.e., inferences can be drawn using natural language questions). By the 2030s machine-based inference will drive the majority of scientific questions, and efficiency of research will grow exponentially by harnessing “full” scientific knowledge.

In all, accurate computer interpretation of scientific information content is paramount. It needs to be at or above the level of the human scientist for this vision of the future to occur. It will be the great achievement of our generation to make this leap forward. Improved chemical information standards and uniform approaches will be critical for it to occur.