Global Opportunities in Chemical Information

Rachelle Bienstock kicked off the session by asking whether emerging markets will really save pharma. She cited statistics that emerging markets, currently $154B or 18% of worldwide revenue, are forecast to rise to $487B or a 37% share by 2020. JACS spotlights are now being translated into five languages, with Chinese at the top of the list.

Roger Sayle (NextMove Software) described his work building automatic translation of Chinese chemical names. Non-English chemistry is showing up frequently. Even some large pharmaceutical companies with ELNs that are supposed to be written in English are finding non-English pages in their archives. A Google search for “benzoic acid” hits only a few more pages than a search for the equivalent Chinese name. Patent applications now often appear first in non-English countries because of business or processing reasons.

Automated translation of Chinese is possible because IUPAC’s strong morphological structuring is preserved across language. Software can identify subparts and translate them, then put it back together based on the IUPAC structuring.

In text mining the challenge is to find the beginning and the end of a chemical name. In the latest version of LeadMine using NextMove’s software, 42% of simple patents written in Chinese were recognized and converted (vs. a benchmark of 86% for recognizing the original English). Image documents, however, are still not scanned successfully in most cases.

Tom Blackadar (Binocular Vision) shared his experience of living in Asia for the past 5 years. Tom began studying Chinese intensively about 2 years ago when he moved to Shanghai. He related his personal and interesting tale of the challenges and rewards of starting a small consulting company (operating as a U.S. company) to bring expert informatics practices to the developing Chinese market and link pharmaceutical companies with partners. Tom is focusing largely on western companies and Contract Research Organizations in China. He discussed many of the legal hurdles he needed to overcome. People were very impressed by his slide collection of necessary red stamped legal documents! However, Tom emphasized the need for data management and the gaps in IP and therefore the valuable niche that his company can fill in the future in China.

Brian Hitson (U.S. Department of Energy, Office of Scientific and Technical Information) talked about the efforts of worldwidescience.org to build a multilingual search system for chemistry and other sciences. OSTI provides public access to the Department of Energy’s unclassified information, as well as restricted access to classified and sensitive information for appropriate people. OSTI has been a pioneer in creating “aggregators” for federated search of multiple sites. Science.gov launched in early 2000s integrates information from twelve federal agencies. Worldwidescience.org takes this to the international level, searching databases in many different countries. Started in 2007 as partnership between U.S. DOE and the British Library, it moved in 2008 to multilateral governance. The system’s goal is to do true searches of the “deep web” index of other search engines that can really find most of the science.

Recent developments include multilingual translations that are the first one-to-many and many-to-one multilingual translations. One search query fires off ten different searches based on Microsoft Translator machine translations. “Science Cinema” uses Microsoft Research Audio Visual Indexing System (MAVIS) to recognize and index audio content. Once a hit is found, the user can go directly into the place in the video where the interesting part occurred. The next step is to attack "big data." They will search the metadata and then connect the user to the landing page to explore the data in its own format.

Jignesh Bhate (Molecular Connections) talked about business opportunities and challenges in India. Molecular Connections is India’s largest informatics company with over 900 employees located in Bangalore and Chennai. It focuses on indexing, abstracting, and text mining.

India is a big consumer of content, with 17% annual growth rate. The country has a huge business of service providers and multinational company sites. Indian private industry R&D spending is still only 25%, but growing rapidly. Indian research output is significant and growing, whereas US output is shrinking. Medicine and pharma are contributing over 25% of the total research output. India dominates offshoring of content production, with over 84% of the world’s total. This business generates $800M per year and is growing at 20%. The predictions are that value-add will be added to cost, with quality and TOT (terms of trade) as key metrics.

Jignesh pointed out several business challenges: India has many differences in cultures and languages; bureaucracy and corruption are significant obstacles; Indians are very sensitive to hierarchy; they focus on relationships and face-to-face contact, so phone calls get more used than email. Despite these challenges, the macro story is so compelling that you cannot go wrong. It sometimes feels like a “drunken man's stupor,” but you can get to the goal.

Andy McFarlane (Thomson Reuters) cited that in 2011 China became #1 country in patents, with over half a million, 23% year-on-year growth. Commercial providers historically added value on top of information coming directly from the patent office. Now the information often comes through a translator or intermediator. There are challenges in scrubbing data, such as rationalizing different translations and spellings of names. India has four patent offices that issue overlapping patent numbers. Derwent World Patents Index has comprehensive English-language coverage, including Asia, with normalized company names. Thomson Reuters tries to focus on consistency of terms. Technology focus shows that India is particularly oriented towards chemistry patents.

Tom Blackadar, Jignesh Bhate and Rachelle Bienstock, Symposium Organizers

Image
In 2009, CAS, the world’s authority for chemical information, reported that China was, for the first time, leading all nations in publication of chemical patent applications (http://www.cas.org/news/media-releases/china-leads-patents). Three years later, chemical patent information from Asia continues to be a significant source of disclosed chemistry, with patent applications from China’s State Intellectual Property Office (SIPO) still increasing. This is important to the chemical information and research communities, as CAS is reporting that in 2012, more than 70% of new substances in the literature are from patents.