Enabling Machines to “Read” the Chemical Literature: Techniques, Case Studies and Opportunities

This symposium covered many themes: text-mining for chemicals, genes and proteins; relating chemical entities to ontologies; extraction of chemical properties (especially from tables), and their association with compounds; interpreting structures in the CHEMKIN format; and chemical image to structure conversion. Broadly, the symposium was organized such as to flow from text-mining from patents, to general text-mining, and finally to mining chemical structure information from images.

Obdulia Rabal (University of Navarra) gave a talk entitled “CHEMDNER-Patents: automatic recognition of chemical and biological entities in patents.” She described the upcoming CHEMDNER-Patents challenge, which consists of three tasks for participants’ systems to attempt: chemical entity recognition, chemical passage detection, and gene and protein recognition. For the challenge a corpus of 21,000 patent abstracts was assembled from various patent offices (WIPO, EPO, USPTO, CIPO, DPMA, SIPO); these abstracts were then manually annotated for chemicals, genes and proteins. The entities in the corpus are further classified into seven classes for chemicals and eight for genes and proteins, indicating the type of mention, for example, FAMILY for a family of compounds. The results will be available in the proceedings of the upcoming BioCreative V workshop. The corpus and more information are available from: http://www.biocreative.org.

George Papadatos (European Bioinformatics Institute, EBI) gave a talk entitled “SureChEMBL: an open patent chemistry resource.” He gave an overview of SureChEMBL’s functionality. Through a collaboration with Open PHACTS, SciBite is now providing biochemical annotations, for example, for genes, proteins, and disease. Currently these annotations are available on-demand, but it is hoped that they can be integrated into their database and made available via the Open PHACTS API later this year. Currently, about 80,000 novel compounds are being added to SureChEMBL each month. George finished by giving an overview of EBI’s work, using the RDKit, to analyze the chemical space of a patent to identify the key compounds. More details on this are available here. SureChEMBL is available from www.surechembl.org.

Christopher Southan (University of Edinburgh) gave a talk entitled “Deuterogate: causes and consequences of automated extraction of patent-specified virtual deuterated drugs feeding into PubChem.” Chris talked about the issues raised by the large number of deuterated compounds being added to PubChem, which are primarily being extracted from the USPTO patent Complex Work Units (CWUs). ChemDraw files are associated with each image in a patent. Many of these are simply deuterated versions of existing drugs and it is highly unlikely that the vast majority of these have been synthesized, meaning that they present a growing source of virtual compounds. While there was a surge in deuterated compounds disclosure five years ago, the number of compounds being disclosed in recent years has declined, while still remaining significant.

Lutz Weber (OntoChem) gave a talk entitled “Evaluating U.S. patent full-text documents with chemical ontologies.” Lutz described OntoChem’s UIMA-based OCMiner pipeline for document annotation. The system supports a wide variety of entity types, for example, chemistry, proteins, anatomy, species, diseases, and cell lines. Lutz presented benchmarks on the ChEBI patent set, showing high precision (96%) on long names, but he cautioned about the use of the corpus in general, as for shorter entities, 65% of the system’s false positives were omissions in the corpus rather than true false positives. Lutz presented advances in OCMiner’s formula detector. He then described some of the uses of chemical ontologies, both structure- and usage-based, for example, knowing that a compound is an anti-infective implies that it is an antibacterial or an antiviral. Looking specifically at the challenges of structure-based ontology classification, he presented a comparison between OntoChem’s SODIAC system and the similar software Classyfire (from the University of Alberta), highlighting some of the challenges in harmonizing ontology representation. Finally Lutz presented some use cases for chemical ontologies, for example, homonym resolution, document classification, and anaphora resolution.

Valery Tkachenko (Royal Society of Chemistry, RSC) gave a talk entitled “Text-mining to produce large chemistry datasets for community access.” He covered the RSC’s recent collaboration with NextMove Software on text-mining melting points and NMR from the U.S. patent literature. Over 200,000 melting points were extracted and used to build a model to predict melting points. This model’s errors were comparable to observed experimental error in the patent data. The predictions were also more precise than a previous model built using smaller, more curated data sets. Valery demonstrated trends in 1H-NMR frequency over time and the number of NMR spectra extracted. Finally, he described the RSC’s upcoming experimental data repository, which will have specific support for various different experimental properties, for example, chemical reactions, measured properties (melting points, logP, etc.), and spectra.

Richard West (Northeastern University) gave a talk entitled “Identifying chemical species in combustion models.” He talked about the challenges of interpreting the CHEMKIN format, which is prevalent in combustion research. The format is fixed width and relies on nicknames to identify the chemical species involved. Unfortunately these nicknames are frequently ambiguous, and different nicknames may be used in different models to describe the same species. Richard has worked on a system for deducing the intended structure of the chemical species from the nickname. This is done by solving the constraints on what the compound can be, based on the reactions it takes part in (he used the analogy of solving a Sudoku puzzle). By using this technique Richard has been able to identify cases where the same species has erroneously been included in a model twice, and examples where the same species has vastly different predicted combustion energies in different models.

Tong-Ying (Tony) Wu (Linguamatics) gave a talk entitled “Text mining the chemical literature to find chemicals in context.” Tony talked about using Linguamatics I2E to extract data from tables. This included association of compounds numbers with compounds, and resolving compounds referenced by compound number in tables. More tricky cases (e.g., resolving the meaning of “+++” when used as a measure of activity) are also supported. Linguamatics have identified patterns allowing the high precision identification of chemical compounds and are planning to use them to construct annotated corpora from English and Chinese text.

Daniel Lowe (NextMove Software) gave a talk entitled “Unlocking chemical information from tables and legacy articles.” He talked about how he uses grammars to describe chemical properties, such as melting points, and NMR spectra. These grammars are then used to generate a state machine to recognize the property efficiently, and a multi-state machine parser to parse the properties efficiently. He talked about some of the challenges of table extraction from U.S. patents. For example, the XML provided by the USPTO describes the appearance of the tables rather than the semantics, so a multi-line row is presented as multiple rows requiring heuristics to give the intended semantics. Finally, Daniel talked about promising work on extracting melting points and NMR from post-2,000 Royal Society of Chemistry journal articles and the difficulties of extracting data from the older articles, for example, headings and paragraphs need to be perceived from PDF, with OCR errors (especially in important compounds and important symbols like the degree symbol).

Igor Filippov (VIF Innovations) gave a talk entitled “Chemical structure identification and retrieval with OSRA.” OSRA is a tool allowing the conversion of chemical structure diagrams to computer readable structure formats. Igor discussed the segmentation procedure used to detect which parts of an image contain a chemical diagram. The precision of this process has been significantly improved in the most recent version (v2.01) with minimal loss of recall. OSRA, notably, also supports the recognition of chemical reactions from diagrams. Future improvements to OSRA will be driven by feedback from the user community.

Bryn Reinstadler (IBM Almaden) gave a talk entitled “P-OSRA: translating polymer images to text using extensions of open source software.” Bryn discussed the need for building databases of polymer structures. IBM is tackling this in multiple ways, with one group working on source and structure-based polymer chemical name to structure conversion, while Bryn’s work covers the conversion of polymers represented as images. P-OSRA is an extension of OSRA, allowing the repeat brackets that are frequently found around repeat units to be precisely recognized. The system works by removing the brackets then passing the bracket-less structure to OSRA. After recognition the position of the bracket is used to infer which atoms it refers to, and an extension of SMILES is used to capture this information. Multiple repeat brackets as found in, for example, block polymers, are also supported.

Aniko Valko (Keymodule) gave a talk entitled “Practical case studies of the application of CLiDE for the efficient extraction of chemical structures from documents.” CLiDE is a tool for extracting structures from chemical structure diagrams presented as images. This can either be done semi-automatically, in which case results are manually checked and can be edited (CLiDE Standard or Professional), or, alternatively, fully automatically for bulk extraction (CLiDE Batch). Aniko presented four case studies of CLiDE’s use: a WO patent, a Japanese patent, a European patent and a journal article. These highlighted some of the more challenging cases that CLiDE supports, for example, identification and extraction of structures from tables. A common cause of issues was the source images being of poor quality, for example, with unclean lines or even lines of missing pixels in the images! CLiDE uses a filter to ignore structures that appear to have been generated from non-structure diagram sources, for example, graphs, and text. This feature was shown repeatedly to be highly discriminatory at filtering out bad structures often removing 90% of the “garbage” structures. While R-group labels are supported, more complex Markush features are not yet supported, for example, positional variation, frequency variation, and attachment point indicators.

Daniel Lowe, Symposium Organizer