Hunting for Hidden Treasurers: Chemical Information in Patents and Other Documents

There is a huge chemical space in scientific and legal documents, such as chemical patents, journal articles, internal documents, and other publications, that is an important resource of intellectual property, but due to historical reasons and technical limitations, much of this space is not indexed or digitized. How to extract this information and to make use of it has long been a challenging task. This symposium included a series of discussions of current developments to analyze chemical space in documents, which can benefit not only scientists in the pharmaceutical industry and academia, but also individuals in cheminformatics, publishing, patent laws and government agencies. 

Since there has not been a similar symposium before, the participation at this one was overwhelming, with twenty two abstract submissions. All talks were grouped into three half-day sessions. The Sunday morning session was focused on Markush structure analysis in chemical patents, Sunday afternoon was focused on exemplified structure analysis in patents, and Monday afternoon was focused on chemical information in non-patent documents. All sessions were organized and chaired by David Deng (ChemAxon). The agenda and abstracts of all sessions are available here. (With the permission from the authors, some presentation slides are available online. The links are inserted under the author names.)

Although this time all CINF meeting rooms were far away from the convention center and the COMP sessions, it did not deter attendees. All three sessions were well attended with 40-50 participants.

Sunday Morning: Markush Structures in Patents

Markush structures are widely used in chemical patents to define large chemical spaces, and they contain essential chemical information for patent analysis. However, the flexibility and complexity of Markush structures preclude easy transformations from patent document to digital format. Currently, two organizations have systematically indexed most chemical patents: Thomson Reuters and Chemical Abstracts Service. After the opening remarks, the symposium started with talks of representatives from both organizations.

Donald Walter (Thomson Reuters) talked about how Thomson Reuters indexes Markush structures, and the coverage of their Markush database. Also, he demonstrated how one can enumerate, filter and search this database using ChemAxon's Markush technology. Roger Schenck (Chemical Abstracts Service) described how CAS builds its contents from patents and literature, and gave illustrative examples on how CAS treats inconsistencies in the documents and translated literature.

In addition to the two giants in patent Markush indexing there are also smaller and independent organizations who index Markush structures on their own. Without mishap, Jayaraman Packirisamy (Sristi Biosciences) would have reported how his company indexed Markush database of natural products, e.g. a cancer database of over 1500 Markush scaffolds of almost all cancer targets from patents. The curation is also done with ChemAxon's Markush technology. Unfortunately, Packirisamy could not come to the conference to deliver the presentation in person.

After the first three presentations had introduced the complex nature of Markush structures and its tedious process of indexing, someone wondered if the indexing can be done automatically. In this context, Josef Eiblmaier (InfoChem) talked about ChemProspector, a five-year project to automatically extract Markush structures from patent documents. ChemProspector uses image recognition technology to extract the Markush scaffold, then scans the text to extract chemical name entities as R-group definitions and retrieve Markush structure variations. For nested R-group recognition, ChemProspector obtains satisfactory results for level-1R-groups and reasonable results for level-2R-groups. However, deeply nested R-groups (level-3 and beyond) are still very challenging to retrieve accurately.

After fours talks on Markush curation, the next three presentations dealt with patent analysis.  

Daniel Lowe (NextMove Software) described a system for automatically downloading patent applications from various sources, correcting and extracting relevant chemical information, indexing and storing the results in a searchable database. These structures can be used to identify novel scaffold or as keys to cluster patents.

David Cosgrove (AstraZeneca) gave an overview of a new system for encoding and searching Markush structures and a structure activity relationship analysis of chemical patents. The Periscope system uses a new language (MIL) to describe a Markush structure and has a graphic interface to display Markush structures. After exemplified structures and activity values are extracted, structures go through R-group decomposition. The R-group fragments and the activities are then used for Free-Wilson analysis yielding an improved result.

Christopher Kibbey (Pfizer) discussed his research on patent structure analysis at Pfizer. His team uses reduced graphs and generates fragment fingerprints to present a structure. These reduced graphs are compatible with Markush structure variations. They can be used to overlay structures, provide "similarity-like" score and do "substructure-like" matches. To generate a representative subset of a Markush library, his group chooses "level enumeration" which uses only the first instance of each R-group definition during enumeration. Combined with reduced graph, a Markush library can be easily compared to a query structure, which provides valuable IP assessment.

Sunday Afternoon: Exemplified Structures in Patents

Besides Markush structures, a patent also contains many exemplified structures and prophetic structures. They are often scattered in the documents as images or texts. The Sunday afternoon session discussed developments in technologies, such as OSR (image to structure), OCR (text to structure), text mining, and others, to extract and analyze these structures from patents. An interesting observation was made that seven out of the eight speakers were representing European companies in this half-day session. Does this mean Europe is leading in patent analysis?

The first two talks discussed OSR technology. Rostislav Chutkov (GGA) presented Imago, the open-source OSR toolkit. Advanced structure features, such as crossing bonds, abbreviated groups, and R-groups, are supported. Also, results from Imago can be improved by tuning the method with a training set of images and structures. Aniko Valko (Keymodule) introduced the latest development in CLiDE. From version 3.2.0 to 5.5.4 major improvements have been achieved with less run time. Now CLiDE is better at retrieving atom labels, functional groups as formula, stereochemistry, and structures in tables, and at removing noise.

The next three talks were about OCR technology and name-to-structure conversion.

Roger Sayle (NextMove software) talked about automatic spelling correction after OCR. Due to the limitation of the OCR technology, texts converted from non-text documents often contain errors. Effective automatic “spelling” correction can significantly improve chemical entity extraction. The same technology can also be applied to protein target names or even non-alphabetic entities such as CAS Registry Numbers.

Lutz Weber (OntoChem) spoke about automated SAR extraction from patents. First, chemical information, including structures, compound classes, and biological effects, is extracted from patent texts. Second, relationships about the compounds and effects are analyzed for their syntax with an automated tool. Last, the normalized relationship n-tuples are generated, and a structure activity relationship can be derived for search engines.

Daniel Bonniot (ChemAxon) provided an update on ChemAxon’s patent mining technology. Based on ChemAxon's Naming technology, Daniel and his colleagues have developed “Document to Structure,” a tool to extract all chemical information from images and text in documents. As a powerful tool for patent mining it works with non-searchable PDF, and all converted structures are returned with their locations in the document. Another tool “Document to Database” can pull documents from file systems and extract all chemical and biographical information. A free website has been setup to demonstrate extracting information from web pages and documents.

Alex Klenner (Fraunhofer SCAI) presented his research on the exploration and visualization of chemical information in patents. After pre-processing documents with ChemoCR and Tesseract, images and text are converted into structures. All structures are “stamped” into the original PDF as “pop-up” displays along with hyperlinks to public web services. Additionally, all retrieved structures are stored in a ChemAxon JChem database, enabling structure search and filtering options. This workflow can access grid resources for parallel processing.

Nicko Goncharoff (SureChem) presented the SureChem database of 12 million unique structures from US, EP, WO and JP patents. These structures are automatically extracted from patent images using CLiDE, and from text using OPSIN and ChemAxon’s Naming. The system also uses ChemAxon’s Structure Checker and Standardizer for structure validation, and is hosted on Amazon Cloud with ChemAxon’s JChem Cartridge for searching. All structures have been made publicly available in PubChem.

Amy Kallmerten (PerkinElmer) presented Structure Genius, a system that extracts structures from images in documents. All structures are indexed and stored in the centralized database for search and analysis.

Monday Afternoon: Chemical Information in Non-Patent Documents

Patent mining can be quite challenging, but extracting chemical information from other scientific documents, e.g. internal document database at a global corporation, is not any easier. The last half-day session was dedicated to analyzing chemical information in all documents.

The session started with an overview of the challenges in chemical literature mining by Vidyendra Sadanandan (Molecular Connections). Different chemical entity recognition applications were summarized, and challenges in chemical text mining were outlined. Typical challenges include typographical errors, image format, terminology inconsistency, legal uncertainty, access costs, etc.

As the two major players in literature indexing, Thomson Reuters and Chemical Abstracts Service, both offer comprehensive literature searching. Robert Stembridge (Thomson Reuters) talked about the challenges of collaborations between the information scientist and the chemists, and Thomson Reuters database search result visualization. Jim Brown (FIZ Karlsruhe) spoke about numeric property searching in STN databases.

David Sharpe (Royal Society of Chemistry) spoke about extracting information from literature and correcting the errors therein. Two user cases were presented: the first, Project Prospect that processes literature documents and generates enhanced HTML, and the second, fixing chair form of sugars/cyclohexanes.

Abraham Heifets (University of Toronto) presented SCRIPDB, a publicly-accessible database of chemical structures and reactions. It contains over 10 million compounds found in over 100,000 patents granted since 2001. A case study of using this database for synthetic accessibility analysis was discussed.

Guenter Grethe (for Akos GmbH) introduced CWM Global Search, which is a single user interface allowing for federated search over more than 60 scientific databases and drug discovery data sources publicly available on the internet. The search query can be chemical structures or names, CAS Registry Numbers, or free text.

SharePoint has been widely adopted as a repository for unstructured data within the enterprise. However, it lacks chemistry storage and search features. The last two talks in this symposium were about enabling chemical information extraction and searching in SharePoint.

Tamas Pelcz (ChemAxon) presented JChem for SharePoint (JC4SP), which allows many ChemAxon applications to be used in SharePoint. The user may import/view structures, and calculate properties in SharePoint list and blogs. Powered by ChemAxon’s Document to Structure, JC4SP can also extract chemical information (names, SMILES, InChIs, CAS Registry Numbers, structure images, embedded structure objects, and even corporate IDs) from various document types. The extracted structures are indexed and searchable.

Rudy Potenzone (PerkinElmer Informatics) presented Search Genius, which can be used with SharePoint for chemical searching. It uses Microsoft FAST Search to identify and index embedded structures in documents. Search Genius can also be inserted into a SharePoint or E-Notebook front end for federated searching.


Various approaches to automate chemical information extraction and analysis were reported, and the challenges were well discussed at this symposium. It is of no doubt that chemical information in documents is well hidden, and a treasure hunt faces many challenges. Sometimes satisfactory or even acceptable results cannot be obtained particularly when dealing with chemical patents and/or Markush structures. However, a great number of minds have been working real hard to build comprehensive databases and to develop powerful tools in this field, and more will certainly become available.

The symposium will probably be reconvened in a couple of years. Hopefully, with the improvement of computing power and algorithms, we will hear more successful user stories.

David Deng, Symposium Organizer

Recorded content from six CINF symposia and poster sessions held at the Fall 2012 ACS National Meeting is at:
Free access for the ACS Members registered for the 2012 Fall National Meeting,
Paid access for ACS Members not registered for the Meeting and non-ACS Members.