Technical Program

CINF Technical Program Highlights: An Interview with Rachelle Bienstock

Rachelle Bienstock

Dr. Rachelle J. Bienstock received her undergraduate degree in Chemical Engineering from The Cooper Union in New York City and her PhD in Chemistry from The University of Michigan in Ann Arbor, Michigan. Following postdoctoral studies at the University of Texas Southwestern Medical Center (Dallas), involving NMR and molecular modeling of constrained peptide analogs and peptidomimetics, she joined The National Institute of Environmental Health Sciences, (NIEHS), Research Triangle Park, NC, as a molecular modeler and computational chemist. Her main research interests are protein structure and protein complex prediction methodologies, computational and structure-based ligand design methods and protein-protein and protein-ligand docking studies.

Svetlana Korolev: Rachelle, could you tell us about your professional service interests that brought you to CINF? What were your motivations to volunteer for the CINF Program Chair 2011-12? Are you a member of any other ACS divisions or other professional societies?

Rachelle Bienstock: Rajarshi Guha had posted a message on the Computational Chemistry List (CCL) server indicating that he was looking for people who would like to organize and suggest interesting symposium topics for the CINF Division program for National ACS Meetings. So I suggested a symposium on Computational Methods for Fragment Based Ligand Design for the Salt Lake City Meeting in 2009, which was so popular and well-attended that it was followed by a second part symposium in 2010 at the San Francisco meeting. Rajarshi then suggested that I might like to assume the Program Chair position at the end of his term. It seemed like a good idea as I had participated in some Program Committee meetings and enjoyed contributing topics to the CINF program at National Meetings. I am also a member of the COMP, MEDI and BIOL Divisions of ACS and the American Association for the Advancement of Science (AAAS).

SK: The Fall 2012 ACS National Meeting was the last meeting for you as the CINF Program Chair and, coincidently, the CINF program was organized at its fullest length from Sunday morning to Thursday afternoon inclusively. Would you assess the last meeting as the most successful program during your tenure? What were the CINF technical program highlights in Philadelphia?

RB: The CINF program at the Philadelphia meeting was a bit of a challenge because not only did we have a substantial program, but we were blessed with two renowned Skolnik winners: Dr. Peter Murray-Rust and Prof. Henry Rzepa. Both Peter and Henry wanted to honor many of their collaborators and colleagues with the opportunity to speak so we had a rather extensive and packed Skolnik symposium all in a single day! Additionally, we had some speakers give their addresses via the Internet and had a real time twitter feed about the meeting ongoing simultaneously. While it was a challenge, it was really pioneering for an ACS Meeting.

Some of the other programming highlights from the Philadelphia Meeting were: the one-and-a-half day session on patents held jointly with the Division of Chemistry and the Law, and The Chemistry Small Business Divisions; the sessions on drug repurposing featuring Chris Lipinski as a speaker; on the history of chemical information; and on the new developments in electronic lab notebooks. Our program was really interesting to members and most of our sessions had standing room only attendance packed into tiny rooms in the Marriott. Even the Thursday morning session traditionally the “General papers” but subtitled “chemical databases, drug discovery and chemical structure representation” had a substantial audience.

SK: In the registration statistics of the ACS National Meetings over the past five years 2008-12, we see spikes in the numbers of attendees at the Spring meetings in San Diego (16,859 in 2012) and San Francisco (18,064 in 2010) and that correlates with the peaks of the CINF abstract submissions for the meetings. Could you comment on a magic formula of “Spring + San city = Success” and on the factors that the CINF Program Committee considers when programming for spring versus fall meetings, and for specific locations?

RB: One of the significant problems due to the current economic downturn is that many companies are no longer sponsoring employees for travel to conferences. Because San Francisco and San Diego have significant numbers of local attendees the travel problem is circumvented. At the recent meetings in Denver, Anaheim and Philadelphia we saw significant numbers of speakers asking for sponsored travel funds and then withdrawing their talks if the funds are not received. CINF is now frequently looking at the local venue and trying to involve individuals in the local area in CINF programming so that travel will be less of an issue. For the upcoming Indianapolis meeting, we are trying to involve David Wild and faculty members at Indiana University and people at Eli Lilly and Company so that travel funds will be less of an obstacle. However, we try not to let location impinge too heavily on the quality and content of our programming.

SK: Rachelle, in your highlights of the CINF Program at the 2012 Spring ACS Meeting in San Diego you observed that “no conference is now going without mobile chemistry.” The first InChI Symposium took place in San Diego, too. Could you mention some other emerging themes for CINF programming? How successful has the ACS thematic programming been in influencing the CINF programming during your tenure?

RB: As you mentioned, the development of mobile applications and their usefulness in chemistry has had a significant impact on CINF programming. Additionally, recent ACS thematic programming has focused heavily on medical and health themes like in Philadelphia. We had symposia on “Science and the Law” with emphasis on regulation in health aspects and “cheminoformatics in the hands of the medicinal chemist” as well as “cheminformatics opportunities in personalized medicine and chemogenomics.” At the San Diego meeting, one of our CINF symposia was featured in the LIFE thematic flyer included in the registration packets mailed to all attendees. Aside from mobile chemistry, open publishing, and medicinal and biological cheminformatic applications, I expect to see themes focused on new materials/nanomaterials and cheminformatics, semantic web and chemical database linking in the future.

SK: Rachelle, you organized two notable CINF symposia on a topic of fragment-based ligand design at Spring ACS National Meetings in 2009 and 2010, which resulted in publishing of the ACS Symposium Book Series Library Design, Search Methods, and Applications of Fragment-Based Drug Design in September, 2011. Could you share with us your experience of the symposium book publishing? Were there any difficulties? Are you planning to continue programming on this topic at future ACS Meetings?

Image

RB: Certainly publishing a book with ACS was a rewarding experience, especially since it was published and available so expediently as an E-book. However the greatest difficulty, which I had not anticipated, was persuading participants in the symposia to write chapters for the book! Because of the demands for the material in chapters to be novel and peer-reviewed, like a research article, many researchers preferred to publish in a journal rather than in an ACS symposium series volume. As fragment-based design and computer methodologies associated with growing and linking, and developing of novel fragments evolve, CINF will revisit this topic. It certainly was a popular topic and the symposium sessions were well attended.

SK: Rachelle, let’s review some challenges and support venues for Program Chairs. (There used to be a Planning for Program Chair Conference organized by ACS). How do Program Chairs collaborate with other divisions and with MPPG? How important is the role of the CINF fundraising efforts for financial support of speakers in order to put together high quality programs?

RB: Other than from Rajarshi Guha, the previous CINF Program Chair, I had little assistance. The PACS system is not too user-friendly for organizing the program, although Robin Green and Farai Tsokodayi at ACS provided excellent support. Collaborating with other divisions really involves personally reaching out to the other program chairs. We have been fortunate that other divisions, particularly MEDI, COMP, CHAL, CHED, and SCHB, wanted to build bridges and co-sponsor programming with CINF. Fundraising and financial support has been a challenging issue in the current economic climate. We also had some difficulties with speakers, who are accustomed to having all their expenses paid when they come to present at a conference. Graham Douglas has worked very hard in securing sponsorship for events.

SK: As Program Chair have you been getting any data about CINF programming from ACS? Have you heard any other feedback on the program? Would you like to share your recommendations with CINF members considering their involvement in CINF programming?  

RB: I have not received any feedback from ACS regarding CINF programming. However, I have been told informally from our members and other attendees that the program was interesting and of high quality. Many people commented to me on the diversity and the breadth of the program. They liked the fact that we did not repeat the same topical sessions at every meeting. The topics are really influenced by Committee members and Division members in general. Our programming is only as good as the varied input we receive from our members and attendees. We try to cast a broad net and invite all CINF members to participate in suggesting topics or organizing symposia. We welcome all suggestions and participants to the Program Committee either by attending the Committee meetings or emailing your suggestions to the Program Chair.

SK: Rachelle, who is going to be your successor as the CINF Program Chair 2013-14? Could you give us a sneak preview of the CINF technical program planned for the 2013 National Meetings?

RB: Jeremy Garritano, Chemistry and Chemical Engineering Librarian at Purdue University, will be my successor. He is already busily organizing the Spring 2013 New Orleans Meeting. Since he is from the “librarian” side of the CINF membership as opposed to the “computational” side of CINF membership, I’m sure that Jeremy will bring a new flavor to CINF programming. We already have an exciting meeting planned with sessions on scholarly communication, advances in virtual high–throughput screening, public databases, library cafes, challenges for libraries in global universities, multiparameter optimization, linking bioinformatic and cheminformatic data, foodinformatics (in line with the ACS theme), as well as on finding information about food chemistry and safety. I am even organizing a symposium myself on computational de novo “rational” design of proteins and peptides.

SK: Are you planning on contributing to the CINF Division in any other role after completing your term as CINF Program Chair?

RB: Well, I am planning on organizing and chairing a symposium in New Orleans, and I hope to continue to be a participant in CINF and a contributor to CINF programming without the responsibilities of being Program Chair.

SK: Rachelle, thank you very much for sharing with us your experiences as the CINF Program Chair 2011-12.

Proposed CINF Program for the Spring 2013 ACS National Meeting

Advances in Virtual High-Throughput Screening

Joel Freundlich,
Sean Ekins

Advances in Visualizing and Analyzing Biomolecular Screening Data

Deepak Bandyopadhyay

Balancing Chemistry on the Head of a Pin: Multi-Parameter Optimization

Edmund Champness,
Matthew Segall,

CINF Scholarship for Scientific Excellence (poster)

Guenter Grethe

Computational de novo ("rational") design of proteins/peptides

Rachelle Bienstock

Food for Thought: Alternative Careers in Chemistry

Donna Wrublewski,
Patricia Meindl

FoodInformatics: Applications of Chemical Information to Food Chemistry

Jose Medina-Franco,
Karina Martinez Mayorga

General Papers; Sci-Mix (poster)

Jeremy Garritano

Going Global: Challenges for Libraries in Global Universities

Andrea Twiss-Brooks,
David Martinsen

Library Cafes, Intellectual Commons and Virtual Services

Leah McEwen, Norah Xiao,
Olivia Bautista Sparks, Teri Vogel

Linking Bioinformatic Data and Cheminformatic Data

Ian Bruno,
John Overington

Public Databases Serving the Chemistry Community

Antony Williams,
Sean Ekins

Scholarly Communication: New Models New Media, New Metrics

David Martinsen,
William Town

 

Science and the Law

Analytical Data in Support of Regulation in Health, Food and the Environment

This one-day symposium examined the interaction between legislation and the underlying science which supports legislation, both in the development and application/compliance phases. In particular, the use of analytical methods and data in the regulation of health, food, and the environment has a major impact on the drafting of new legislation and on the public debate that typically precedes any new legislation. Existing databases used by legislators and those responsible for implementing legislation were considered in each sector.  In addition, consideration was given to the impact of science on the regulation of new areas, such as functional foods, and the appropriate fora for the regulator and regulated industry to discuss technical issues.

Consumers face a barrage of product claims each day. These claims create consumer expectation of safety and product performance and, assuming they are accurate, facilitate well informed choice.  But increased scrutiny of claims, especially where the claim involves potential health outcomes, means that claim substantiation and the science behind it are more important than ever.

Speakers in the symposium said that greater collaboration is needed to ensure that product claims are based on the best available scientific evidence. “What we need is not science for science’s sake, but science for society’s sake,” said David Richardson. He also said that “regulators must ensure that any claims are based on the best available scientific evidence and using the best tools and methods available in order to ensure the highest standards for consumers, while at the same time fostering and advancing innovation in the products they regulate. This can only be achieved if all interested parties, whether they be NGOs, academics, regulators or industry scientists, are brought together to advance regulatory science and leverage its potential to promote and protect public health.”

Other speakers in the symposium discussed claims made in the realm of regulated products, ranging from the very familiar, health and nutritional claims on food, to the newer, less familiar territory of potential future claims around “modified risk” tobacco products, which according to the Family Smoking Prevention and Tobacco Control Act of 2009 (FSPTCA) also comes under the remit of the U.S. Food and Drug Administration (FDA). The FDA is currently the only regulator with a mandate to evaluate submissions to place modified risk tobacco products on the U.S. market.

David Richardson, a food scientist at Reading University in the UK, says that the use of claims in the food and dietary supplement market is widespread. But supermarket shelves may look very different as regulators crack down on false and misleading claims. In the European Union (EU) for example, regulation published in May this year (Regulation EC No 432/2012) concerning the well-established nutrient functions of vitamins and minerals could see hundreds of nutritional and health claims become illegal. The European Commission has published a list of more than 1600 unauthorized claims on the EU Register. All labels and commercial communications must comply with the regulation by 14th December 2012 following a six-month transitional period. The result, he says, is that we must achieve an important balancing act between overcoming challenges in generating and presenting scientific data to justify a claim and achieve conclusive evidence of cause and effect, while not reducing a company’s willingness to invest in new research and innovation or impacting international trade.

But whereas health claims on food and nutritional supplements are commonplace, the possibility of making reduced-risk claims is a new prospect for modified risk tobacco products which manufacturers might seek to place on the U.S. market. Under the FSPTCA, companies can apply to the FDA to market lower-risk tobacco products in the U.S. providing they provide scientific proof that marketing of the product will not only reduce harm to individual users, but also benefit the population as a whole. The FDA has started to draft guidance on the kind of scientific and other data needed to forecast and monitor a proposed modified risk tobacco product’s potential impact on public health, including: product characterization; the amount of human exposure to harmful constituents; perceptions about the product and effects on human health. “Right now, this is a very new area of science and there is a shortage of established regulatory science to help assess the health risks of modified-risk tobacco products,” says Christopher Proctor, Chief Scientific Officer at British American Tobacco. “We need reliable credible evidence to further this area of science and fill the gaps identified by the regulators. The FDA has set out a large number of research questions that they want answered in order to help them create the scientific underpinning to regulations, and it is going to take a considerable multi-disciplinary research effort, involving a range of research contributors, to complete the science they need.”

The sentiment was echoed by Rodger Curren, of the Institute for in vitro Sciences in the US, who said that transparency, data sharing, and active communication between scientists, industry and regulators is the best way to ensure the intelligent application of science to regulatory policy. Rodger described how such an approach was key to the successful use and acceptance of new in vitro ocular models for hazard testing of anti-microbial cleaning products. He said that this approach could be modeled for use in the safety assessment of other consumer products, thus supporting the 3Rs (reduce, reuse, and recycle) approach and avoiding new animal experimentation.

Among the other speakers, Istvan Pelczer (Frick Chemistry Laboratory, Princeton) discussed “Honey analysis by high-sensitivity cryo-13C-NMR” detection of the fraudulent production of honey possible by this technique.  

Judith Currano (Chemistry Library, University of Pennsylvania) shared her thoughts on “Hunting and gathering: locating information on the cusp between science and legislation.” Her talk used a case study approach to examine methods of finding information on the science and legislation dealing with food, drugs, and the environment.

K. Scott Phillips (Division of Chemistry and Materials Science, FDA) presented research on contact lenses at FDA in a talk entitled “Contact lens materials and multipurpose solutions: lessons learned from laboratory research.” This talk discussed research efforts in the areas of materials chemistry and bioanalytical chemistry, the project's contribution to current regulatory science knowledge, and potential implications that the data have for public health.

Lucinda Buhse (Division of Pharmaceutical Analysis, FDA) enlightened us on “Rapid screening methods for pharmaceutical surveillance.” An ever-increasing percentage of products and ingredients is now coming from overseas, potentially increasing consumer exposure to poor quality, counterfeit, and adulterated pharmaceutical products. In response to this situation, the FDA has developed rapid and portable screening methods to assess the quality and safety of pharmaceutical products at ports of entry.  

George Lunn (Office of New Drug Quality Assessment, FDA) informed us about “Analytical procedures and the regulation of new drug development.” The information that should be submitted to the FDA is governed by the Food, Drug, and Cosmetic Act, Title 21 of the Code of Federal Regulations, and various guidances. This talk focused on these requirements and recommendations.

Thomas Hartung (School of Public Health, Johns Hopkins University) described some ground breaking work on “Mapping the human toxome for new regulatory tools.” The lecture summarized the lessons learned from the development, validation and acceptance of alternative methods for the creation of a new approach for regulatory toxicology. Beside the technical development of new approaches, a case was made that we need conceptual steering, an objective assessment of current practices by evidence-based toxicology (modeled on evidence-based medicine), and implementation into legislation. 

The session closed with a presentation from Frederick Stoss (Silverman Library, University at Buffalo - The State University of New York) entitled “Environmental databases: a trip down memory lane and new journeys in the 21st century.” This presentation compared the “environmental” content of several STEM bibliographic databases.

Bill Town, Symposium Organizer

Hunting for Hidden Treasurers: Chemical Information in Patents and Other Documents

There is a huge chemical space in scientific and legal documents, such as chemical patents, journal articles, internal documents, and other publications, that is an important resource of intellectual property, but due to historical reasons and technical limitations, much of this space is not indexed or digitized. How to extract this information and to make use of it has long been a challenging task. This symposium included a series of discussions of current developments to analyze chemical space in documents, which can benefit not only scientists in the pharmaceutical industry and academia, but also individuals in cheminformatics, publishing, patent laws and government agencies. 

Since there has not been a similar symposium before, the participation at this one was overwhelming, with twenty two abstract submissions. All talks were grouped into three half-day sessions. The Sunday morning session was focused on Markush structure analysis in chemical patents, Sunday afternoon was focused on exemplified structure analysis in patents, and Monday afternoon was focused on chemical information in non-patent documents. All sessions were organized and chaired by David Deng (ChemAxon). The agenda and abstracts of all sessions are available here. (With the permission from the authors, some presentation slides are available online. The links are inserted under the author names.)

Although this time all CINF meeting rooms were far away from the convention center and the COMP sessions, it did not deter attendees. All three sessions were well attended with 40-50 participants.

Sunday Morning: Markush Structures in Patents

Markush structures are widely used in chemical patents to define large chemical spaces, and they contain essential chemical information for patent analysis. However, the flexibility and complexity of Markush structures preclude easy transformations from patent document to digital format. Currently, two organizations have systematically indexed most chemical patents: Thomson Reuters and Chemical Abstracts Service. After the opening remarks, the symposium started with talks of representatives from both organizations.

Donald Walter (Thomson Reuters) talked about how Thomson Reuters indexes Markush structures, and the coverage of their Markush database. Also, he demonstrated how one can enumerate, filter and search this database using ChemAxon's Markush technology. Roger Schenck (Chemical Abstracts Service) described how CAS builds its contents from patents and literature, and gave illustrative examples on how CAS treats inconsistencies in the documents and translated literature.

In addition to the two giants in patent Markush indexing there are also smaller and independent organizations who index Markush structures on their own. Without mishap, Jayaraman Packirisamy (Sristi Biosciences) would have reported how his company indexed Markush database of natural products, e.g. a cancer database of over 1500 Markush scaffolds of almost all cancer targets from patents. The curation is also done with ChemAxon's Markush technology. Unfortunately, Packirisamy could not come to the conference to deliver the presentation in person.

After the first three presentations had introduced the complex nature of Markush structures and its tedious process of indexing, someone wondered if the indexing can be done automatically. In this context, Josef Eiblmaier (InfoChem) talked about ChemProspector, a five-year project to automatically extract Markush structures from patent documents. ChemProspector uses image recognition technology to extract the Markush scaffold, then scans the text to extract chemical name entities as R-group definitions and retrieve Markush structure variations. For nested R-group recognition, ChemProspector obtains satisfactory results for level-1R-groups and reasonable results for level-2R-groups. However, deeply nested R-groups (level-3 and beyond) are still very challenging to retrieve accurately.

After fours talks on Markush curation, the next three presentations dealt with patent analysis.  

Daniel Lowe (NextMove Software) described a system for automatically downloading patent applications from various sources, correcting and extracting relevant chemical information, indexing and storing the results in a searchable database. These structures can be used to identify novel scaffold or as keys to cluster patents.

David Cosgrove (AstraZeneca) gave an overview of a new system for encoding and searching Markush structures and a structure activity relationship analysis of chemical patents. The Periscope system uses a new language (MIL) to describe a Markush structure and has a graphic interface to display Markush structures. After exemplified structures and activity values are extracted, structures go through R-group decomposition. The R-group fragments and the activities are then used for Free-Wilson analysis yielding an improved result.

Christopher Kibbey (Pfizer) discussed his research on patent structure analysis at Pfizer. His team uses reduced graphs and generates fragment fingerprints to present a structure. These reduced graphs are compatible with Markush structure variations. They can be used to overlay structures, provide "similarity-like" score and do "substructure-like" matches. To generate a representative subset of a Markush library, his group chooses "level enumeration" which uses only the first instance of each R-group definition during enumeration. Combined with reduced graph, a Markush library can be easily compared to a query structure, which provides valuable IP assessment.

Sunday Afternoon: Exemplified Structures in Patents

Besides Markush structures, a patent also contains many exemplified structures and prophetic structures. They are often scattered in the documents as images or texts. The Sunday afternoon session discussed developments in technologies, such as OSR (image to structure), OCR (text to structure), text mining, and others, to extract and analyze these structures from patents. An interesting observation was made that seven out of the eight speakers were representing European companies in this half-day session. Does this mean Europe is leading in patent analysis?

The first two talks discussed OSR technology. Rostislav Chutkov (GGA) presented Imago, the open-source OSR toolkit. Advanced structure features, such as crossing bonds, abbreviated groups, and R-groups, are supported. Also, results from Imago can be improved by tuning the method with a training set of images and structures. Aniko Valko (Keymodule) introduced the latest development in CLiDE. From version 3.2.0 to 5.5.4 major improvements have been achieved with less run time. Now CLiDE is better at retrieving atom labels, functional groups as formula, stereochemistry, and structures in tables, and at removing noise.

The next three talks were about OCR technology and name-to-structure conversion.

Roger Sayle (NextMove software) talked about automatic spelling correction after OCR. Due to the limitation of the OCR technology, texts converted from non-text documents often contain errors. Effective automatic “spelling” correction can significantly improve chemical entity extraction. The same technology can also be applied to protein target names or even non-alphabetic entities such as CAS Registry Numbers.

Lutz Weber (OntoChem) spoke about automated SAR extraction from patents. First, chemical information, including structures, compound classes, and biological effects, is extracted from patent texts. Second, relationships about the compounds and effects are analyzed for their syntax with an automated tool. Last, the normalized relationship n-tuples are generated, and a structure activity relationship can be derived for search engines.

Daniel Bonniot (ChemAxon) provided an update on ChemAxon’s patent mining technology. Based on ChemAxon's Naming technology, Daniel and his colleagues have developed “Document to Structure,” a tool to extract all chemical information from images and text in documents. As a powerful tool for patent mining it works with non-searchable PDF, and all converted structures are returned with their locations in the document. Another tool “Document to Database” can pull documents from file systems and extract all chemical and biographical information. A free website Chemicalize.org has been setup to demonstrate extracting information from web pages and documents.

Alex Klenner (Fraunhofer SCAI) presented his research on the exploration and visualization of chemical information in patents. After pre-processing documents with ChemoCR and Tesseract, images and text are converted into structures. All structures are “stamped” into the original PDF as “pop-up” displays along with hyperlinks to public web services. Additionally, all retrieved structures are stored in a ChemAxon JChem database, enabling structure search and filtering options. This workflow can access grid resources for parallel processing.

Nicko Goncharoff (SureChem) presented the SureChem database of 12 million unique structures from US, EP, WO and JP patents. These structures are automatically extracted from patent images using CLiDE, and from text using OPSIN and ChemAxon’s Naming. The system also uses ChemAxon’s Structure Checker and Standardizer for structure validation, and is hosted on Amazon Cloud with ChemAxon’s JChem Cartridge for searching. All structures have been made publicly available in PubChem.

Amy Kallmerten (PerkinElmer) presented Structure Genius, a system that extracts structures from images in documents. All structures are indexed and stored in the centralized database for search and analysis.

Monday Afternoon: Chemical Information in Non-Patent Documents

Patent mining can be quite challenging, but extracting chemical information from other scientific documents, e.g. internal document database at a global corporation, is not any easier. The last half-day session was dedicated to analyzing chemical information in all documents.

The session started with an overview of the challenges in chemical literature mining by Vidyendra Sadanandan (Molecular Connections). Different chemical entity recognition applications were summarized, and challenges in chemical text mining were outlined. Typical challenges include typographical errors, image format, terminology inconsistency, legal uncertainty, access costs, etc.

As the two major players in literature indexing, Thomson Reuters and Chemical Abstracts Service, both offer comprehensive literature searching. Robert Stembridge (Thomson Reuters) talked about the challenges of collaborations between the information scientist and the chemists, and Thomson Reuters database search result visualization. Jim Brown (FIZ Karlsruhe) spoke about numeric property searching in STN databases.

David Sharpe (Royal Society of Chemistry) spoke about extracting information from literature and correcting the errors therein. Two user cases were presented: the first, Project Prospect that processes literature documents and generates enhanced HTML, and the second, fixing chair form of sugars/cyclohexanes.

Abraham Heifets (University of Toronto) presented SCRIPDB, a publicly-accessible database of chemical structures and reactions. It contains over 10 million compounds found in over 100,000 patents granted since 2001. A case study of using this database for synthetic accessibility analysis was discussed.

Guenter Grethe (for Akos GmbH) introduced CWM Global Search, which is a single user interface allowing for federated search over more than 60 scientific databases and drug discovery data sources publicly available on the internet. The search query can be chemical structures or names, CAS Registry Numbers, or free text.

SharePoint has been widely adopted as a repository for unstructured data within the enterprise. However, it lacks chemistry storage and search features. The last two talks in this symposium were about enabling chemical information extraction and searching in SharePoint.

Tamas Pelcz (ChemAxon) presented JChem for SharePoint (JC4SP), which allows many ChemAxon applications to be used in SharePoint. The user may import/view structures, and calculate properties in SharePoint list and blogs. Powered by ChemAxon’s Document to Structure, JC4SP can also extract chemical information (names, SMILES, InChIs, CAS Registry Numbers, structure images, embedded structure objects, and even corporate IDs) from various document types. The extracted structures are indexed and searchable.

Rudy Potenzone (PerkinElmer Informatics) presented Search Genius, which can be used with SharePoint for chemical searching. It uses Microsoft FAST Search to identify and index embedded structures in documents. Search Genius can also be inserted into a SharePoint or E-Notebook front end for federated searching.

Summary

Various approaches to automate chemical information extraction and analysis were reported, and the challenges were well discussed at this symposium. It is of no doubt that chemical information in documents is well hidden, and a treasure hunt faces many challenges. Sometimes satisfactory or even acceptable results cannot be obtained particularly when dealing with chemical patents and/or Markush structures. However, a great number of minds have been working real hard to build comprehensive databases and to develop powerful tools in this field, and more will certainly become available.

The symposium will probably be reconvened in a couple of years. Hopefully, with the improvement of computing power and algorithms, we will hear more successful user stories.

David Deng, Symposium Organizer

Recorded content from six CINF symposia and poster sessions held at the Fall 2012 ACS National Meeting is at:
http://presentations.acs.org/common/sessions.aspx/Fall2012/CINF
Free access for the ACS Members registered for the 2012 Fall National Meeting,
Paid access for ACS Members not registered for the Meeting and non-ACS Members. 

 

Cheminformatics and Drug Repurposing

The symposium took place on Sunday, August 19, 2012 from 1:30 PM until approximately 5:30 PM in the Philadelphia Marriott Downtown hotel. Seven speakers from industry, academia and other research labs shared their leading expertise in the area with over 50 attendees. The speakers presented opinions, case studies, and perspectives of this increasingly attractive topic to the Chemical Information community and other research areas.

Chris Lipinski (Consultant for Melior Discovery, USA) opened the symposium providing a broad perspective of the current status of the success of drug discovery efforts. Chris raised the question: “What is wrong with drug discovery?” He then reflected that one of the failures of drug discovery efforts is the current single-target approach. In this context, drug repurposing or drug repositioning is based on an alternative multi-targeted approach. Chris pointed out that a vast number of resources that are available in the public domain are promising for drug repurposing efforts in industry and other research organizations.

Thomas Freeman (Boehringer-Ingelheim, USA) spoke about the current areas of improvement of drug discovery from an industry point of view. He emphasized that the biology involved in the drug development efforts is highly complex and then discussed three major approaches to mine the vast amount of accumulated data for drug repurposing: biological, chemical, and textual data mining. Freeman presented case studies that exemplify the success of these approaches.

Iwona Weidlich (University of Maryland, USA) discussed a comprehensive study conducted in an academic setting to identify approved drugs with HCV RNA polymerase activity. She described a general computational approach to develop predictive QSAR models for molecules in a training set and then apply such models to mine databases of approved drugs. Iwona also covered key aspects of the database preparation, generation of predictive models, and shared personal experiences that face academia conducting drug repurposing projects.

Joshua Swamidass (Washington University in St. Louis, USA), also from an academic point of view, talked about opportunities and challenges that drug repurposing faces. He presented a rigorous statistical-based approach to predict potential biological activities starting from sparse and incomplete data matrices of chemical databases annotated with biological activity across different targets.

Antony Williams of the Royal Society of Chemistry and one of the lead developers of ChemSpider, discussed in detail the Open PHACTS project, a large initiative to mine publicly available databases such as PubChem, ChemSpider, BindingDB, for drug repurposing using cheminformatics tools. He also discussed issues of standards and data validation in available databases which is one significant issue with which OpenPHACTS will deal. The slides from his presentation are available at his slideshare.

Dongsup Kim (Department of Bio and Brain Engineering, KAIST, Republic of Korea), presented the development and applications of the drug-drug relationship score (DRS) for use to predict drug pairs that share common targets. This drug relationship score can then successfully be applied as a predictive method for new target identification and has successfully predicted pharmacological effects.

Mohamed Abdul Hameed (Biotechnology High Performance Computing Software Applications Institute, USA) closed the symposium presenting the results of a general target fishing strategy based on 3D similarity searching. He described both the validation of the approach using a well-known data set of decoy compounds and then the application of the validated approach for drug repurposing.

Two papers could not be presented due to unavoidable travel conflicts for the presenters, Richard Cramer (Tripos, USA) and Chenglong Li (The Ohio State University, USA).

José Medina-Franco and Rachelle Bienstock, Symposium Organizers

Future of the History of Chemical Information

Yes, you read it correctly; we are wondering where the venerable story of chemical information is bound. Consider the impact on chemical research of machine-readable documentation over the past 50+ years, and systematic chemical nomenclature the 100+ years before that. Consider the generations of chemists who built this discipline through their scholarly exchange and navigating the politics of their time. Bend those lenses around to look forward and consider what of the current day will most influence the progress of the chemical enterprise and its information in 50 years. What can we learn from our history to help us focus our endeavors to make future history? As we chart our way forward, what are the important principles for chemistry and chemical information, in particular, that we all in the information profession need to keep clear, front and center? These questions were the drivers of a CINF symposium at the recent ACS Meeting in Philadelphia.

We heard from a diverse panel of knowledgeable information professionals what the landscape of today could lead and distill to, based on what we have learned from various perspectives over 100+ years, about chemistry, information, and most importantly, the people involved in it all. Twelve speakers gave reflective analyses based on their respective areas of expertise, tying it to essential issues for CINF with implications for the fellow Divisions of Chemical Education (CHED) and History of Chemistry (HIST) as well. Links to the presentation slides for most talks are included in this report and also available at: /node/347 (abstract numbers 47-51 & 59-65). My impressions and reflections on the impact of the future of the history of our chemical information are represented below.

Peter Rusch, currently the Chair of and the CINF Liaison to, the ACS Committee on Nomenclature, Symbols and Terminology, set the tone of the day by “cantilevering history.” He aptly illustrated how cantilevering, much like in bridge building, is a critical aspect of the work of information professionals, and never is really done. His on-point, prescriptive “prospective retrospection” suggested that those practicing the unrecognized “central science” with its “unobvious” information will need to keep vigilant to the integrity of the science. Important principles to consider are seemingly self-evident, but not to be overlooked in any scenario: price/performance, chemical integrity, personal contact and conversation, and of course, good information habits.

Delving into the long history of chemical nomenclature and structure representation were two talks based on a symposium held at the Royal Society of Chemistry in London in November of 2010 (http://www.rsc.org/Membership/Networking/InterestGroups/CICAG/meetings.asp, scroll down to “Celebrating the History of Chemical Information”).

Bill Town gave a thoughtful walk through the histories of confusing nomenclature and eventually more specified compound classification. Early alchemical history was fraught with persecution, resulting in layers of confusion between warring desires of useful classification and secrecy. It took several hundred years to work through multiple systems until the atomic theory and more accurate analysis pulled together understanding. As the need for granularity increased, different nomenclatures and classifications appeared appropriate for organic compounds, inorganic compounds and the elements. Scientists finally started grappling with standardization in the 19th century.

Phil McHale delivered an entertaining evolution of structure representations, from early recognition of atoms and aromatics, through complexities of stereochemistry and delocalized bonds, to implications of Markush generics. Computerized systems depend on clear notation to support robust compound RSVP (register, search, view, print/publish) and have served up a variety of coding schemas based on fragments for substructure searching or linear notation for unambiguous identification. Current structure representation techniques focus on informatics applications, including calculation, prediction, analysis, and leveraging the networked environment through enhancing traditional information formats, linking diverse information streams, and pushing molecular manipulation potential into a variety of social communication venues.

Steve Heller picked up the story of structure representation with a primer on the emerging InChI standard, IUPAC’s algorithm-based, open source International Chemical Identifier system. The idea of the InChI is to enable linking across the very diverse landscape of chemical notation, and definitely gives a twist on future thinking, pushing information publishers and vendors into thinking beyond their current systems and focus on transferrable deliverables. This approach is compatible with any registry or indexing system, but the challenge for InChI will be encouraging support and cooperation across the information industry to implement and develop further specifications as the chemical and computational landscapes continue to evolve.

Guenter Grethe traced the evolution of chemical reaction information from early alchemy focusing heavily on methodology. Desire for control brought on more scientific-like approaches to experimentation and the need for more systematic explanation. Printed sources were characterized by complex indexes and vetted methodology. The diversity of information related to reactions lends itself to endless creativity in computational approaches, including synthesis design, which predated reaction information retrieval. Early synthesis design programs used a variety of algebraic, knowledge-based or numeric approaches; later algorithms relied on reaction information. The real challenge with any reaction tool is interacting with the chemists using the systems and classification remains an important mental indexing tool for chemists. RInChI is currently under development and may help navigate some of the many wrinkles that still persist across systems. Guenter’s call to honor “the intelligence and creativity of…chemists” is good aspiration as we hurtle into the future.

The afternoon session started off with two information services having long histories of innovation in chemical searching, Web of Science and Chemical Abstracts. Vijay Bhatia and Roger Schenck both focused on the future of evaluation and analysis in information systems at the chemical level. Current trends indicate increasing abundance of chemical information of diverse types and sources and chemically robust systems will need to enable scientists across disciplines to sift through the cornucopia more actively and intellectually, and reach decisions. Search and delivery have vastly improved in quality and efficiency over decades and scientists now need sophisticated tools supporting various informatics techniques. Not all information is created equivalent in content or quality and not in all contexts, especially in such intertwining, cross-disciplinary areas as chemical biology.

The next two talks considered the role of chemical information incorporating basic knowledge into learning. Through a historical tour of chemical information education, Adrienne Kozlowski delivered a strong sentiment to revive the focus on information skills in education, reminding us that CINF originated in CHED. Bruce Lewenstein focused on the central role of textbooks in chemical education. With this form in particular there are warring factors under the hood that influence what is presented to students, including considerations of economy, education as industry, adoption-rejection, and different takes on basic subjects by different types of scientists. A lively audience discussion considered Internet-based tools and data flows for chemical education, trending towards increased availability of materials, a divergence of large one-stop tools and many specialized approaches, and the mobile environment that lends itself to smaller discrete steps, or “apps.” A general concern emerged throughout the day that with less tedious activities required to search, find and work with chemical information, there is in effect less practice and less re-enforcement with students about this important aspect of chemistry research.

Engelbert Zass delivered a rigorous retrospective of the interaction of chemists and their information in tandem with the technical developments of access and use over time. We are at a unique point in this history where career information specialists have directly experienced many approaches to stitching together the pieces necessary for robust chemical searching. Some interesting patterns emerge when considering the long view: there are many core fundamental steps that the tools of any day need to address and the data sources need to be well-structured to support this retrieval; chemists themselves need to weigh in scientifically at many of these steps, the searching process is as unique and critical to chemical research as the individual scientists; and this intellectual engagement has ironically been most often accomplished through usually tedious “work-arounds.” Engelbert gave a passionate call that the vigilance of information professionals today needs to be no less; there are as many dangers in today’s searching systems demanding multi-step complicated “work-arounds” and the primary responsibility for searching has again shifted back to chemists themselves as in the previous era of printed sources.

A unique and thought-provoking contribution to the consideration of the future of the history of chemical information was provided by Jeff Seeman’s focus on chemists’ information. As a chemist-historian interested in the unfolding of chemistry through the people who practice and produce it, Jeff seeks information from archival sources as well as the published literature and searching tools. A series of powerful stories around some of the classic discoveries in chemistry gleaned from “primary data” sources illustrated the ongoing importance of considering the past in light of the present and future, for practicing chemists and historians alike. The past is a moving target depending on the vagaries of technology, economics, politics and how researchers choose to build on it; continued access to this past is a concern for all involved. Chemists themselves should be aware of and engage in thoughtful record keeping of their correspondence, data and other aspects of their research process, especially as the daily interactions around research become increasingly ephemeral in the digital environment.

Robert Buntrock brought the symposium together completing the bridge analogy connecting seekers and information. Through a whirlwind tour of the diverse variety of information sources and a dizzying array of print and early machine “interfaces,” the core principles of good information seeking remain the same, from keeping current to experimental design to comprehensive literature reviews and competitive analysis. With the advent of greater access and options for searching online, it is more critical than ever before for information professionals to support chemists. While the construction techniques need updating to meet the technologies, information professionals continue to bridge the same abyss between practicing chemists and the information they need.

Overall it was a great team perspective on how we’ve arrived to the present day; and how even less well prepared I feel than ever before...but inspired. I don’t have any answers. I am still deep in the middle of it all, not quite long enough to fully appreciate where we have been with the intersection of computers, and not quite naive enough to jump into every idea that washes through. I am especially interested in the players: amid international and government players how much of a role will the industry continue to have in shaping information? Is there really a future for the academic side and is this best focused through computer science and information theory approaches, or do we need to bring in an ethnographic approach, or just more chemists? With enhanced data access, linking, parsing and re-mixing just on the horizon, what new complexities and abilities will chemists and their science encounter? The impression is a perfect storm of centripetal forces; and I am looking forward to pushing this momentum into the murky landscape rich in potential for high-value information.

Leah McEwen, Symposium Co-Organizer

Herman Skolnik Award Symposium

Honoring Henry Rzepa and Peter Murray-Rust

Introduction

This one-day symposium was remarkable for its record number of speakers (23 in all, plus one withdrawn and one replaced by a demonstration). Despite the number of performers, and some unfortunate technical faults, the whole event proceeded on schedule and without serious mishap. Henry Rzepa’s own talk was an opening scene-setter. He told a 1992 tale of some molecular orbitals explaining the course of a chemical reaction in 1992. The color diagram of these lacked semantics, and when it had been sent by fax to Bangor, it even lost its color. Months later the work was published,1 but the supporting information (SI) is not available for this article, and even if it were available electronically, would it be usable? So, how can it be mined for useful data or used as the starting point for further investigation?

By 1994 Henry and his colleagues had recognized the opportunities presented by the World Wide Web.2,3 The data for a later article4 do survive in the form of Quicktime and MPEG animations on the Imperial College Gopher+ server, but they are semantically poor, i.e., they are interpretable by humans, but not by computer. The X-ray crystallography data are locatable using the proprietary identifier HEHXIB allocated by the Cambridge Crystallographic Data Center. Open identifiers such as the IUPAC International Chemical Identifier, InChI, are preferable. It would be better if we had access to semantically-rich data that allows reanalysis of the key intermolecular interactions (described in Henry’s blog entry of July 5, 2012, http://www.ch.imperial.ac.uk/rzepa/blog/?p=7027). The answer is a hand-crafted XML document with the SI as a “datument:” a superset of the main article.5 Molecules and spectra are expressed in Chemical Markup Language (CML)6 and presented using a Java applet and scalable vector graphics (SVG). The underlying data for the article are still semantically alive today.

More recently, Henry has used electronic SI as a data repository for the main article.7 The molecules are expressed in CML and a Jmol applet is used as the presentation layer in the style of an explorable story-board. Quick-response (QR) access to the data in 2011 allowed a re-investigation, with revised conclusions. Datasets should be deposited in digital repositories,8 using CML where possible, and assigned a handle (equivalent to a digital object identifier, DOI). Metadata can be generated from automated scripts and can be harvested for re-injection into other repositories. In Peter Murray-Rust’s Chempound (https://journals.tdl.org/jodi/article/view/5873/5879), the Resource Description Framework (RDF) allows SPARQL (http://www.w3.org/TR/rdf-sparql-query/) semantic queries of data. The repository figshare (http://www.figshare.com) allows users to upload any file format so that figures, datasets and media can be disseminated in a way that the current scholarly publishing model does not allow. Most journals treat such data-rich objects as “gold” Open Access, but there are not yet many articles with such data and you may not have permission to mine them, or even know how to find them. Perhaps gold data need their own DOIs in figshare, SPECTRa8 etc.

Steve Bachrach’s Computational Organic Chemistry blog (http://comporgchem.com/blog/) is data-rich, discussion-rich, and archivable. In other work, device-agnostic HTML5 components have been rendered natively in a browser or the epub3 Reader (the new shrink-wrapper), enabling a mobile ecosystem. Talks later in the symposium enlarged on the topics introduced by Henry.

Visualization

The first invited talk was by Bob Hanson of St. Olaf College who described two open source Java applets, Jmol and JSpecView, that are used for interactive access to molecules and spectra. Jmol is a viewer for chemical structures in 3D. JSpecView, a viewer for spectral data in the JCAMP-DX format, reads a variety of spectral data types, and has recently been integrated into Jmol. Bob also discussed a proposal for a JCAMP file extension, JCAMP-MOL (http://chemapps.stolaf.edu/jmol/docs/misc/Jmol-JSpecView-specs.pdf), that allows Jmol and JSpecView to read molecular structures, spectra and associated correlation data all from the same file. Two new user-defined data labels add 3D Jmol-readable models to the file and also associate spectral bands with specific IR and Raman vibrations, MS fragments, and NMR signals. The purpose of JCAMP-MOL is to allow for a single file that can be read either by the standalone Jmol application or by twin Jmol and JSpecView applets on a Web page. Clicking on an atom or selecting an IR/Raman vibration in Jmol highlights a band or peak or fragment on the spectrum. Clicking on the spectrum highlights one or more atoms, starts an IR vibration, or displays an MS fragment in Jmol. The specification was implemented successfully in Jmol 12.2.18 early in 2012.

The next speaker was Josef Polak of iChemLabs, the company that produces the ChemDoodle chemical structure environment (http://www.ichemlabs.com/products) focusing on 2D graphics and publishing (a product which, incidentally, was used to create all of the posters, pamphlets and conference books at this ACS National Meeting). Josef described how HTML5 adds new functionality in the browser. Java applets and third-party plug-ins such as Flash are being replaced  by HTML5 and WebGL, not least in the open source ChemDoodle Web Components, a Javascript chemical graphics and cheminformatics library allowing users to present publication quality 2D and 3D graphics and animations for chemical structures, reactions and spectra. Beyond graphics, this tool provides a framework for user interaction to create dynamic applications through Web browsers, desktop platforms and mobile devices such as the iPhone, iPad and Android devices. The power of mobile technologies was well demonstrated in Josef’s presentation when both projectors failed simultaneously: Josef continued, unfazed, while Kevin Theisen of iChemLabs walked around the room showing the slides on his iPad. The ChemDoodle Web Components library is being used by Henry Rzepa in datuments,5 in the user interface to Jmol, Open Babel   (http://openbabel.org/wiki/Main_Page) and ChemSpotlight (http://chemspotlight.openmolecules.net/), and in various educational applications.

Authoring and ELNs

Alex Wade of Microsoft Research talked about the Chemistry Add-in for Word, “Chem4Word” (http://research.microsoft.com/en-us/projects/chem4word/), a joint initiative of Microsoft Research and the University of Cambridge, the goals of which are to simplify the task of authoring a chemical document and to do so in such a way that the document is semantically meaningful, facilitating downstream tasks such as publisher’s workflow, entity extraction and semantic applications. Chem4Word is an open source tool that chemically enables Word, allowing direct search of structural repositories and insertion of structures directly into documents. Structures can be locally manipulated within Word and are stored in CML format. Alex explained the nature of Office Open XML files, and demonstrated the chemical editing and re-use cycle: loading structures into Word, from a gallery in Chem4Word itself or from PubChem (http://pubchem.ncbi.nlm.nih.gov/), editing structures, getting CML data back out of a document, and using and sharing the data in Chemistry for SharePoint.

A talk by Jeremy Frey of the University of Southampton also concerned the sharing of data. His team’s first approach to the semantic electronic laboratory notebook (ELN) was the Smart Tea project,9 so-called because, in order to gain a better understanding of the chemist’s experimental design and execution process, the team made tea as a chemistry experiment. This early work, at the start of the e-science revolution, pushed the boundaries of the use of RDF, schemas and ontologies. “More Tea” used a tablet interface and RDF World, but these hardware and software technologies still did not have the necessary power. LabTrove (http://www.labtrove.org/) is a more flexible ELN and data management system facilitating the capture of information and the use of this information in a collaborative environment. Jeremy’s team has implemented a system (“Blogjects”) to “blog” information from instruments: the Smart Research Framework (SRF) LabBroker middleware gets the data into the trove before the users even look. “Tweetjects” is another option. The ELN pages can now be read by both humans and computers, using XHTML (http://www.w3.org/TR/xhtml1/) and (RDFa http://www.w3.org/TR/xhtml-rdfa-primer/). Barcodes can be incorporated, too, and LabTrove can be linked to SharePoint, using RSS, Atom, and the Open Data protocol (OData http://www.odata.org/).

The difference between Jeremy’s system and other approaches is that the data are associated with the proposed scientific endeavor prior to or at the point of creation rather than by annotating the data with commentary after the experiment has taken place. This means that scientists and their peers can recreate and adapt the experiment repeatedly having already automated the processes and instrument settings. Prospective provenance describes a scientific experiment that will be enacted; retrospective provenance describes the scientific experiment that was enacted. Recording provenance allows the experiment itself to be embedded within the literature.

One weakness of the current system is the lack of support for existing external vocabularies and data models. Blog3 (and TeaTrove3) will have even greater user focus and semantic rigor. Blog³ provides an extensible plug-in architecture that enables authentication and authorization; in-line preview and search-engine indexing for all data; an integrated vocabulary and schema-editing environment; and export of all data in a variety of formats.

Simon Coles, also of Southampton University, continued the theme, talking about the ELN in academia. The Dial-a-Molecule Grand Challenge (https://connect.innovateuk.org/web/dial-a-molecule1/) addresses the problem of efficiently making molecules in days, not years. ELNs could be a response to this challenge. Other drivers are information overload, and government and funding agency initiatives to encourage researchers to share data openly. Repositories such as Dryad (http://datadryad.org/) and figshare (http://www.figshare.com) allow data to be published in their own right. Citation of data through DataCite (http://www.datacite.org), for example, promises attribution and recognition for data publication.

An academic ELN should support a range of data acquisition techniques at different scales; promote access to data, sharing and reuse; enable discovery of results in related disciplines; facilitate access to data underpinning publications; enhance communication across the community; and support long-term preservation. ELNs currently on the market are primarily concerned with the protection of intellectual property and are very poor at supporting academic practice. The solution is to turn the ELN into a publishing platform in its own right with a protocol by which a range of existing platforms and resources can make the content available, based on simple, structured metadata. A number of repositories and alliances already exist and a number of people involved in them got together to produce a “lowest common denominator” solution, easy to implement on any platform, that can nevertheless be made more sophisticated at a later stage.

The multi-layered approach included a knowledge layer, with “core” metadata, an information layer, with “contextual” metadata, and a computation layer with “detail” metadata. Through the knowledge layer, users can discover what is being made available, whether it is of interest, and whether it can be accessed. The information layer determines the granularity at which data should be made available, and the computation layer determines whether the information can be processed automatically. Two case studies illustrate the entry point for layers two and three. One is LabTrove (described by Jeremy Frey earlier). The other is an extension to the IDBS e-Workbook plug-in that enables deposition of 2D structures directly into the Royal Society of Chemistry (RSC) database ChemSpider (http://www.chemspider.com). This could be extended to more content, such as spectra, reactions and properties. Simon’s team is developing examples of automatic accessing and processing of data in ELNs layers two and three, and is encouraging wider academic use of ELNs. They will also mine theses and patents and investigate getting data out of old notebooks. The semantic ELN, Blog3, described by Jeremy Frey, and “iPad in the Lab” are other works in progress.

Blogs

Continuing the blog theme, Steven Bachrach of Trinity University listed a number of examples. Peter Murray-Rust’s blog (http://blogs.ch.cam.ac.uk/pmr/), Derek Lowe’s In The Pipeline (http://pipeline.corante.com/), Paul Bracher’s ChemBark (http://blog.chembark.com/) and The Chemistry Blog (http://www.chemistry-blog.com) provide opinion and news. Some blogs such as James Ashenhurst’s Master Organic Chemistry http://masterorganicchemistry.com are for teaching. Paul Docherty’s Totally Synthetic (http://totallysynthetic.com/blog/) and Steve Bachrach’s own Computational Organic Chemistry http://comporgchem.com/blog/ publish article reviews. Henry Rzepa’s blog (http://www.ch.ic.ac.uk/rzepa/blog/) features original research. Blog aggregators include Egon Willighagen and Peter Maas’ Chemical Blogspace (http://cb.openmolecules.net/) and Jan Jensen’s Computational Chemistry Highlights (http://www.compchemhighlights.org/).

Two recent examples illustrate post-publication peer review by blog. As a result, initially, of blogging in Totally Synthetic, a paper on reduction by sodium hydride10 was withdrawn for scientific reasons; and a paper with claims about dinosaurs in space11 was criticized for self-plagiarism and exaggerated claims in several blogs before being withdrawn by the author on the grounds of similarity to his earlier publications. Steve himself has good reasons other than altruism for blogging. He surveys the literature to provide currency to his book and assist in writing the second edition. His blog also forms the basis of a series of review articles for the RSC and demonstrates the use of blogging in chemical communication. Blogging faces pressure from other social media, but it is hard to envisage Twitter as an effective chemical communication medium. Altmetrics (an alternative to journal Impact Factors) and journal review overlay may establish a professional benefit to blogging in future.

Statistics and Property Prediction

Egon Willighagen at Maastricht University gave his presentation remotely. His take-home message was that you can improve your property prediction, training, and validation by adopting semantic pipelines.12 This means using open look-up lists, dictionaries, and ontologies; removing format limitations; linking to data from other domains; and using calculation provenance. CML is semantic, flexible, and embeddable in HTML and RSS, but it is limited to XML. JavaScript Object Notation (JSON, http://www.json.org/) and Terse RDF Triple Language (Turtle,        http://www.w3.org/TeamSubmission/turtle/) are alternative formats to XML for transmitting data between a server and a Web application. They enable linked data. RDF is an open standard, independent of format and database technology, and embeddable in HTML. It can be queried using SPARQL (http://www.w3.org/TR/rdf-sparql-query/). A federated query extension allows execution of queries distributed over different SPARQL endpoints.

One application is a computational toxicity assessment platform13 generated from integration of two open science platforms related to toxicology: Bioclipse, which combines a scriptable, graphical workbench environment for integration of diverse sets of information sources, and OpenTox, a platform for interoperable toxicology data and computational services. A second application (unpublished) is Egon’s work on nanotoxicity carried out in Stockholm last year, using SPARQL to link a wiki to the R statistics environment. Another project in progress is the Open Pharmacological Concepts Triple Store (Open PHACTS, http://www.openphacts.org),14 a knowledge management project of the Innovative Medicines Initiative (IMI, http://www.imi.europa.eu/).

Rajarshi Guha of NIH discussed the benefits of integrating cheminformatics with statistical software, specifically the Chemistry Development Kit (CDK, (http://sourceforge.net/apps/mediawiki/cdk/index.php?title=Main_Page) and R. R is an environment for modeling that contains many prepackaged statistical and mathematical functions. It is also a matrix programming language that is good for statistical computing. Cheminformatics capabilities include statistics and machine learning and R is well suited to these. There is thus a case for “cheminformatics in R.”

CDK provides chemical and more complex objects, input and output of various molecular file formats, fingerprint and fragment generation, rigid alignments, pharmacophore searching, substructure searching, SMARTS support, and molecular descriptors. Rajarshi has implemented CDK (http://github.com/rajarshi/cdk; http://sourceforge.net/projects/cdk/) in R using the rJava package, providing access to variety of CDK classes and methods, and idiomatic R. Currently in rcdk you can access atoms and bonds and get certain properties and 2D and 3D coordinates, but since rcdk does not cover the whole CDK API you might need to drop down to rJava level, and make calls to the Java code, in some cases.

Rajarshi outlined some applications. The fingerprint package implements 28 similarity and dissimilarity metrics, allowing enrichment studies and comparison of datasets.15 2D structure images can be visualized. A typical QSAR workflow can be followed.

The PubChem (http://pubchem.ncbi.nlm.nih.gov/) and ChEMBL (https://www.ebi.ac.uk/chembl/) databases can also be accessed directly within R using their public APIs. Published QSAR models may even become reusable: reproducible data mining is encouraged because DB and HTTP access ensures that an analysis can always be up to date if required.

Open Chemistry

In the final talk of the morning session, Marcus Hanwell of Kitware criticized the proliferation of black box, proprietary codes in chemistry. There is a need for open tools and open standards and more papers should be including data. The Open Chemistry project (http://www.openchemistry.org/) is a collection of open source, cross platform libraries and applications for the exploration, analysis and generation of chemical data. Kitware is developing three independent applications: the Avogadro2 structure editor, Molequeue, for running local and remote jobs, and ChemData for storing, annotating and searching data. Avogadro (http://avogadro.openmolecules.net/)16 is an open source molecule editor and visualizer designed for cross-platform use in computational chemistry, molecular modeling, bioinformatics, materials science, and related areas. The Avogadro library is a framework providing a code library and application programming interface (API) with 3D visualization capabilities. The Avogadro application provides a rich graphical interface using dynamically loaded plug-ins through the library itself. The application and library can each be extended by implementing a plug-in module in C++ or Python. By using the CML file format as its native document type, Avogadro seeks to enhance the semantic accessibility of chemical data types. HDF5 (http://www.hdfgroup.org/HDF5/) will be used to store “heavy data” (e.g., for quantum mechanics). Kitware distributes its products using the very open Berkeley Software Distribution (BSD) license.

Artificially Intelligent Chemists

Peter Murray-Rust opened the afternoon session with some thoughts on building artificially intelligent chemists. He was helping to build a knowledge base for the Dial-a-Molecule Grand Challenge (https://connect.innovateuk.org/web/dial-a-molecule1/), but found that many publishers were unwilling to allow him to mine their content. There was interest in artificial intelligence (AI) in the 1970s, but over the next 35 years little progress was made. Some early examples are Ralph Christoffersen’s work on quantum pharmacology17 and Malcolm Bersohn’s work on retrosynthesis.18 In those days knowledge bases depended on look-up, heuristics, rules, logic, brute force, tree pruning and computing chemical reality. Nowadays most of the tools we need are available but the will to use them is not there. Peter presented a diagram of the 2012 knowledge base, and perception and communication of the transformed knowledge. Knowledge is represented in CML, ontologies and other domains. AI means putting all the components together.

Peter discussed a chemical application of John Searle’s Chinese room thought experiment (http://en.wikipedia.org/wiki/Chinese_room). The experiment supposes that there is a program that gives a computer the ability to carry on an intelligent conversation in written Chinese. If the program is given to someone who speaks only English to execute the instructions of the program by hand, then in theory, the English speaker would also be able to carry on a conversation in written Chinese. However, the English speaker would not be able to understand the conversation. Here are Frog and Zog asking Magic Chemical Panda a chemical question and getting an answer:

Image

There is no “Magic Chemical Panda” in Peter’s box (http://vimeo.com/48280639). Chemical names are found by look-up and if the precise name is not found, the rule book is used to manipulate symbols and relate ethanoic to ethanoate, say. The Open Parser for Systematic IUPAC Nomenclature (OPSIN) name to structure software,19 is a symbol manipulation system with a rule base.

Peter’s team has also worked on CML and Chem4Word in the intelligent laboratory: Ami,20 uses image recognition, voice recognition, sensors and RFID tags. Peter continues to capture semantics “by stealth” and he uses patents because publishers have prevented him from mining the journal literature. “Open” means really open and not pretending that your API is open. It is possible to make revenues from open source software: Kitware, ChemDoodle and GGA Software have proved this.

Computational Chemistry and NMR

Peter has been working with the Environmental Molecular Sciences Laboratory (EMSL) at Pacific Northwest National Laboratory (PNNL) on enriching the NWChem open source computational chemistry software (http://www.nwchem-sw.org/) with CML. Wibe (“Bert”) de Jong was unable to present a talk about this in person, but Marcus Hanwell deputized. NWChem now generates semantic data, enabling Avogadro to extract and visualize NWChem semantic output. The team has completed a CML generator for Gaussian basis function based quantum methods based on the FOX library (http://fox-toolkit.org/), using an infrastructure based on PNNL’s Extensible Computational Chemistry Environment data generator (http://blogs.ch.cam.ac.uk/pmr/2011/11/02/searchable-semantic-compchem-data-quixote-chempound-fox-and-jumbo/). Work currently in progress aims to get all NWChem data stored into CML output file, to reduce the CML data by avoiding replication, and to integrate CML with the appropriate format for bigger data blocks. Then plane wave capability will be made semantically rich. Another goal is to use Peter’s JumboConverter to convert old NWChem output files into CML, and store them in MyEMSL. The CML CompChem dictionary and conventions are being extended to enable integration of NWChem and NMR data which can be accessed and visualized in MyEMSL through EMSLHub.

In another EMSL talk, Karl Mueller addressed the subject of NMR data. EMSL is collaborating with the Australian Commonwealth Scientific and Industrial Research Organization (CSIRO) in an NMR project. PNNL has about 12 very large NMR instruments, but the data have not been captured well in the past. Karl gave one example of an experiment in which he was involved.21 He showed diagrams of the workflows for translating and processing raw data from an experiment and for simulating and processing raw data from calculations in Gaussian, NWChem, etc. He also showed some screenshots from a potential MyEMSL Workbook for NMR. The team initially planned to continue updating the JCAMP NMR dictionary with relevant terms and definitions, to update the JCAMP parser, to test the output and to begin working on code to extract binary data for Agilent and Varian.

To make further progress the development of a repository for NMR data must address three important issues: the large number of different NMR experiments in existence, many with multiple versions and variations; the intricate processing steps often required to convert raw time domain data into usable spectra (and the need for a detailed record); and the large number of divergent NMR data formats. A proper record of an NMR experiment must contain original digitized numerical values, information about the source instrument, and saved instrument parameters, all in a standardized file format. The processed spectrum (as saved by the experimentalist) should include software version and processing parameters, in a standardized file format. The standardized file for a high-level experiment description should include sample, pulse sequence, magnetic field, detected isotope, decoupled and undetected isotopes, pulse times, delays, phase cycles, and temperature, etc., and interpretation, and instrument-parameter to experiment-parameter translation.

New approaches such as blogging are also of interest so Karl has been collaborating with Jeremy Frey in the use of LabTrove. To put all these approaches together, community buy-in, and partnerships are being developed with other national facilities in multiple countries, other NMR data model efforts, and NMR spectrometer companies.

Natural Language Processing and the Semantic Web

Lezan Hawizy was indisposed on the day of the symposium and a video was shown of her presentation about natural language parsing for semantic science. ChemicalTagger is an open source package for “understanding” organic chemistry experiments, developed by Peter Murray-Rust’s group, using natural language processing (NLP) approaches. Tools available include Open Source Chemistry Analysis Routines (OSCAR)22 and OPSIN.19 ChemicalTagger converts flowing text into structured text. Processes such as dissolve phase, purify phase and yield phase are marked up in the chemical procedure. Components of ChemicalTagger include tokenizers, which split a sequence of text into individual tokens; taggers, which assign parts of speech to each token; a parser which groups tagged tokens into phrases; and a role identifier which assigns roles to the parsed phrases. Taggers include OSCAR for chemical entities, RegEx for chemistry-related entities, and OpenNLP (http://opennlp.apache.org/) for English entities. The parser has a rule-based grammar for molecules, amounts etc. The role identifier assigns action roles (e.g., “dissolving”) to phrases, and roles such as “solvent” to molecules. The role identifier was evaluated using 50 experimental paragraphs by comparing the effort of four annotators with each other and with ChemicalTagger, using the Dice coefficient to measure similarity. There was about 90% agreement between human and machine tagging.

Daniel Lowe has expanded the work to chemical reactions. The software identifies experimental sections, uses ChemicalTagger with an additional OPSIN tagger to produce structured data, associates chemical entities with quantities, assigns chemical roles, and carries out atom-atom mapping. Daniel extracted 424,621 reactions from 65,034 patent documents. Hannah Barjat has developed an additional tagger, ACPTagger, for use with the open access journal Atmospheric Chemistry and Physics.  Lezan showed some visualization features of the resulting system, including geolocations mapped onto a map of the world.

Materials informatics requirements are substantially different from small molecule informatics: while structural representations of small molecules often contain enough information for the development of structure-property relationships, this is frequently not the case for complex materials. Often an account of the provenance of a material must be added to the chemical representation of a material. Additionally, materials data are usually generated in “native vernaculars:” non-portable formats, which do not easily allow for data exchange. To make these data widely accessible, they must be converted to formats with both human as well as machine comprehensible standard syntax and semantics.

Nico Adams of CSIRO has used a complete Semantic Web toolstack, from XML dialects to axiomatically rich ontological models in Web Ontology Language OWL       (http://www.w3.org/standards/techs/owl#w3c_all), in the development of modern materials information systems. Nico showed an example of Polymer Markup Language (PML) and the ChemAxiom ontology for polymerization and he produced graphical representations that describe a chemical procedure. Synthetic robots produce a log file that can be decoded by the manufacturer, but Nico had to put in some effort to convert the information into an ontology and graph. Unfortunately the robot does not know what chemistry went into the robot. This has to be caught elsewhere. Nico uses ChemicalTagger.

Janna Hastings of EBI started her talk with her conclusions: classification conveys the type for data; the Semantic Web makes data of all types available, open and interlinked; and classification using OWL ontologies dramatically enhances the potential of the chemical Semantic Web. The subject and object in an RDF triple are types. Molecules are small and three-dimensional. Their structures can vary according to their environment. We say they have the same type when they share important properties. All caffeine molecules have type caffeine. There are many different ways to represent a molecule: by InChI, by a reference number, by a ball and stick model, and so on. None of these is, in itself, a molecule; all these describe and approximate. All data are representations. Science aims to make discoveries of general rules about the things that the data are about. Classification puts the scientific knowledge into the data. RDF is a technology for data representation and OWL is a technology for classification.

Ontologies encode expert domain knowledge in a hierarchically-organized format that a machine can process. One such ontology for the chemical domain is ChEBI.23 ChEBI provides a classification of chemicals based on their structural features and a role- or activity-based classification. An example of a structure-based class is “pentacyclic compound” (compounds containing five-ring structures), while an example of a role-based class is “analgesic”, since many different chemicals can act as analgesics without sharing structural features.

ChEBI has been applied to annotation of chemicals in biological contexts and for diverse tasks of chemical discovery including metabolic network gap prediction, but its growth has been limited to the throughput of manual annotation. A recent publication23 describes the requirements for structure-based, automated classification; the analysis of structure-based features of chemical classes in ChEBI; and mapping to existing OWL-based technology and cheminformatics-based approaches. Another publication24 describes feature and maximum common substructure detection for a group of chemicals, asserts class definitions logically using OWL and SMARTS, and demonstrates automated classification using OWL reasoning.

Exploration and Analysis

In the pre-Google era, Henry’s team wrote an indexing and search engine called ChemDig;25  in the post-Google era, Geoffrey Hutchison at the University of Pittsburgh has built ChemSpotlight (http://chemspotlight.openmolecules.net), using Spotlight (the desktop search feature of Apple’s OS X operating system) plus Open Babel26 and about 300 lines of code. ChemSpotlight is a metadata importer plug-in for Mac OS X, which reads common chemical file formats using the Open Babel chemistry library. Spotlight can then index and search chemical data: molecular weights, formulas, SMILES, InChI, fingerprints, etc. The data are kept as native files with a separate index. The current version (with about 800 more lines of code) allows freely rotatable 3D views of molecules and 2D views of ChemDraw and molfile formats, thanks to the ChemDoodle WebComponents. Geoffrey refers to ChemSpotlight as an “undatabase” because it has no (visible) database or SQL. It stores fingerprints, and number of atoms, bonds, and residues, PDB and SDfile keywords and properties, calculation keywords, and calculation results. Geoffrey presented a new genetic algorithm approach with Spotlight for designing new molecules for organic heterojunction solar cells, by calculating electronic and optical properties, and a synthetic score, for virtual libraries of more than a million compounds. His take-home message was that “undatabases” and ChemSpotlight, integrated into user-friendly tools, work well for big data.

Brian McMahon of the International Union of Crystallography (IUCr) talked about crystallographic publishing in the semantic age. The Semantic Web adds value (and meaning) to data in IUCr journals online through linking, allowing navigation, search, provenance, accreditation and access to related data and literature. Dynamic textual annotation of IUCr article content currently gives links to the Online Dictionary of Crystallography and the IUPAC Gold Book. The layout in HTML tables implies some semantics and can communicate meaning to another application (e.g., Jmol to highlight a selected bond).

The Crystallographic Information File (CIF)27 information interchange standard has informed the structural content of CML. CIF was designed from the outset as an extensible standard, and now covers many areas of crystallography. It forms the basis for integrated data and publishing workflows linking laboratories, data repositories, publishers and databases, and has been an important factor in improving the quality of published crystal structures. The CIF publishing editor pubCIF (http://journals.iucr.org/services/cif/publcif/) is a desktop application for formatting and validating CIFs. CIF acts as a vehicle for article submission; checkCIF (http://checkcif.iucr.org) can be used to validate the structural model. An enhanced figures toolkit (http://submission.iucr.org/jtkt) brings an article alive by creating Jmol enhanced figures. The CIFs in SI for non-IUCr articles on the Web can be loaded into the IUCr visualization tool. The metadata about instrument, refinement etc. is available. CheckCIF can be run on the SI. Brian concluded his presentation with some charts showing where CIF sits in the data flow in crystallography and the publication flow in IUCr journals.

Kitware has developed a new open-source application, ChemData (part of the Open Chemistry project), to facilitate the exploration and analysis of large chemical datasets. Kyle Lutz described the program features of which include a variety of 2D plotting techniques, such as traditional scatter plots, parallel coordinates charts, and scatter plot matrices. Similarity relations between molecules can be explored using a range of graph-based visualization methods. Multiple querying and filtering functions allow users to locate molecular data relevant to their work.

ChemData is a native C++ application built with the user interface framework Qt (http://qt-project.org/). It uses the NoSQL database MongoDB (http://www.mongodb.org/) as a semantic data store, focusing on cheminformatics and assessment of chemical properties such as QSAR data. Computational chemistry data are stored directly in the file store, and semantic data are extracted to facilitate search and analysis. ChemData uses the Visualization Toolkit (VTK, http://www.vtk.org/) for 2D and 3D dataset visualization. Molecular structure, geometry, identifiers and descriptors are stored as a single “ChemicalJSON” object. JSON is used as the data interchange format, rather than XML/CML, because it is more compact, it is the native language of MongoDB, and it is easily converted to a binary representation. Initial work is in progress for using Web-based visualization and analysis tools. ParaViewWeb (http://paraviewweb.kitware.com/PW/) accesses the MongoDB database and will provide a collaborative remote Web interface for 3D visualization with ParaView as a server. ParaView (http://paraview.org/) is an open-source, multi-platform data analysis and visualization application.

InChI and Databases on the Web

Stephen Heller, the project manager for InChI (http://www.iupac.org/home/publications/e-resources/inchi.html), outlined the significance of this standard. InChI is a non-proprietary, machine-readable string of symbols which enables a computer to represent a compound in a completely unequivocal manner. InChIs are produced by computer from structures drawn on screen with existing structure drawing software, and the original structure can be regenerated from an InChI with appropriate software. InChI is not a registry system. It is not a replacement for any existing internal structure representations; it is in addition to what one uses internally. Its main value to most organizations is in linking information. Like a barcode, it is not designed to be read by humans. The InChIKey has been designed so that Internet search engines can search and find the links to a given InChI. To make the InChIKey the InChI string is subjected to a compression algorithm to create a fixed-length string of upper-case characters. Steve showed examples of Google searches for an InChI and an InChIKey, and of Henry Rzepa’s QR smartphone app for InChI.

The InChI Trust (http://www.inchi-trust.org/), a UK charity, was formed to develop and improve on the current InChI standard, further enabling the interlinking of chemistry and chemical structures on the Web. InChI is a truly international project with programming in Moscow, computers in Germany, incorporation in the UK, and a project director in the United States. Collaborators from over a dozen countries, from academia, pharma, publishing, and the chemical information industry, have all offered senior scientific staff to develop the InChI standard. InChI is a success because organizations need a structure representation for their content so that it can be linked to and combined with other content on the Internet. InChI provides an excellent return on investment. It is a public domain algorithm that anyone, anywhere, can freely use.

ChemSpider (http://www.chemspider.com/) would not have been possible without InChI. Valery Tkachenko of the RSC put it into perspective. We live in the world of Web 2.0; a connected world of social networks, mobile communications and Internet TV; a big data world with semantic content and new interfaces. Data is king and NoSQL is the new data model approach. Data flows in and can be structured, searched, linked and navigated. Data and code are distributed and self-sustained in the cloud. Federated systems take precedence over standalone solutions. Sophisticated human computer interfaces and pervasive machine-to-machine interfaces prevail. Yahoo, Google, Facebook and YouTube are huge islands on the Internet map; why are chemical domains so insignificant?

ChemSpider is a database and search engine for small organic molecules, their properties, names and synonyms, and spectra. It is an aggregator of information from online resources as well as a host of data extracted from RSC scientific articles. Over the past five years more than 26 million chemicals, together with a diverse array of associated data, have been deposited. The online database is open to community deposition, annotation, and curation and, as a result, has expanded into a rich resource to contribute to a Semantic Web of chemistry. ChemSpider provides access to its data via Web Services and as RDF. There is an extensive infrastructure: a computer farm and components. Standard interfaces such as Simple Object Access Protocol (SOAP), Representational State Transfer (REST), JSON, RDF and SPARQL are used. Automated validation and standardization procedures are now being developed. ChemSpider provides the chemistry services supporting the Open PHACTS project (http://www.openphacts.org/),14 a semantic project serving the life sciences community to facilitate the linking of chemical and biology data and enable drug discovery.

Chemistry is also available in Wikipedia. Martin Walker of the State University of New York at Potsdam described DBpedia, a project to extract data from Wikipedia, such as the substance information in a ChemBox or DrugBox. Traditionally these boxes were used simply for cutting and pasting, but the Wikipedia team has made a machine-friendly version using formats such as SMILES and InChI. Now ChemBoxes are more like a database, and it is easier to pull data out. The InChIs for complex molecules can be very long, and this was a hindrance to their use in Wikipedia until “show/hide” became available. “Table creep” could be a problem in data pages; the answer is to put data on a supplementary data page. 

Data validation lets the user know if the data are correct. Curation is the ongoing process of fixing errors. In 2008 a validation exercise was initiated and, in collaboration with CAS, 3,500 substances have been validated as having the same name, structure, and CAS Registry Number (CAS RN). Validated entries carry a green check mark. Every old version of an article, with a RevID, is preserved for posterity and can potentially serve as a permanent record of a validated version. To protect validated fields, a bot patrols the pages and logs dubious CAS RN edits, in a system developed by Dick Beetstra of Eindhoven University. Structures present more of a problem since they are loaded from an external file on Wikimedia Commons which can be “invisibly” changed, but, since fall 2010, a modified bot has been looking out for such changes.

Another example of data-rich chemistry in a wiki is RSC’s LearnChemistry wiki which aims to enrich RSC educational content with data from ChemSpider, and then make it open for educators to contribute their own content. ChemSpider provides data on structures, physical properties, spectra, etc. Martin and his colleagues wanted to make the data presentation more suitable for students, including high school students, and cut out all the content that beginner students would not use. LearnChemistry includes laboratory experiments, tutorials and guides, substance pages, quizzes, and project and collaboration pages. Users can share their own educational materials such as homework problems and laboratory procedures.

Conclusion

Bobbie Glen of the University of Cambridge summed up Peter and Henry’s contributions to the Semantic Web of chemistry. Traditionally, science involved two main pillars: theory to generate hypotheses and experimentation to test them. In modern science, theories are complex, data volumes are large, and experimental teams are often international collaborations. We can add a third pillar, e-Science, to manage these new realities of science.28 For e-science we need open data and standards; glue ware for computation and analysis, interfaces that encompass the “system;” access control to data and intellectual property, collaboration methods that allow analysis, dialogue and data exchange; data and data analysis tools for “big data”; scalable, physically realistic algorithms; infrastructure (networks, high performance computing, and data storage), and metadata and semantics to put it all in context. Biology, chemistry and patents have “big data,” e.g., 429,512,389,024 nucleotide bases, 60,475,000 chemical substances and 150,000,000 pages of European patents. The connections present big opportunities for innovation, but also great challenges. Navigation through all this information is not easy.

Most real chemicals do not exist as connection tables; the sticky, brown stuff in the reaction vessel is not a SMILES. The next generation of chemical information tools should capture the history of the materials and the manufacturing process which went to make up the substance, as well as measured and predicted properties, and that is just a beginning. Peter and Henry’s work with CML,6,29 opens up opportunities to do just this, once we capture the data.30 The first step is the automated lab: data capture using the human senses integrated into robotic data capture. Everything should be stored (minor omissions often mean an unrepeatable experiment) and a knowledge framework is needed (semantics) that gives meaning to the data: any result has to be put in the context of the experiment.

Bobby gave a few examples. The solubility of caffeine varies by orders of magnitude in the literature. Single values tell nothing useful: we need the metadata to tell us what the material was and how the solubility was measured. How flufenamic acid is made determines the aqueous solubility because there are two polymorphs, made under different conditions, with different solubilities. 6,6’-Dinitro-2,2’-diphenic acid exhibits atropisomerism (the conformation is twisted to reveal an enantiomeric structure), so how the material was synthesized needs to be included in the data. Different atropisomers of a compound have different biological activities. Some bicyclo[3.2.0]heptan-6-one derivatives have two forms in a single crystal and in solution because of transannular interactions: how should this “dynamic” molecular structure be represented? Chemistry is not best served by 20th century descriptions of molecules and materials. CML allows the addition of vital metadata within a semantic framework, which adds context, reproducibility and knowledge.

References

  1. Baird, M. S.; Al Dulayymi, J. R.; Rzepa H. S.; Thoss, V. An Unusual Example of Stereoelectronic and Entropic Control in the Ring Opening of 3,3 Disubstituted-1,2-Dichloro-Cyclopropenes. J. Chem. Soc., Chem. Commun, 1992, 1323-1325.
  2. Rzepa, H. S.; Whitaker B. J.; Winter, M. J. Chemical Applications of the World-Wide-Web. J. Chem. Soc., Chem. Commun. 1994, 1907-10.
  3. Casher, O.; Chandramohan, G.; Hargreaves, M.; Leach, C.; Murray-Rust, P.; Sayle, R.; Rzepa, H. S. Whitaker, B. J. Hyperactive Molecules and the World-Wide-Web Information System. J. Chem. Soc., Perkin Trans 2, 1995, 7-11.
  4. Camilleri, P.; Eggleston, D. S.; Rzepa, H. S.; Webb, M. L. Intermolecular interactions responsible for the absence of chiral recognition: aromatic C–H http://www.rsc.org/images/entities/h2_char_22ef.gifO hydrogen bonding in the crystal structure of 3-chloro-9,13-dibutylamino-1-hydroxypropyl-6-trifluomethylphenanthrene propan-2-ol solvate hydrochloride. J. Chem. Soc., Chem. Commun. 1994, 1135-1137.
  5. Rzepa, H. S. Chemical datuments as scientific enablers. J. Cheminf. 2012, 4, in press. http://www.ch.ic.ac.uk/rzepa/datument/.
  6. Murray-Rust, P.; Rzepa, H. S.; Wright, M. Development of chemical markup language (CML) as a system for handling complex chemical content. New J. Chem. 2001, 25, 618-634.
  7. Marshall, E. L.; Gibson, V. C.; Rzepa, H. S. A Computational Analysis of the Ring-Opening Polymerization of rac-Lactide Initiated by Single-Site β-Diketiminate Metal Complexes:  Defining the Mechanistic Pathway and the Origin of Stereocontrol. J. Am. Chem. Soc. 2005, 127, 6048–6051.
  8. Downing, J.; Murray-Rust, P.; Tonge, A. P.; Morgan, P.; Rzepa, H. S.; Cotterill, F.; Day, N.; Harvey, M. J. SPECTRa: the Deposition and Validation of Primary Chemistry Research Data in Digital Repositories. J. Chem. Inf. Model. 2008, 48, 1571-1581.
  9. Hughes, G.; Mills, H.; de Roure, D.; Frey, J.; Moreau, L.; Schraefel, m. c. [sic]; Smith, G.; Zaluska, E. The semantic smart laboratory: a system for supporting the chemical eScientist. Org. Biomol. Chem. 2004, 2, 1-10.
  10. Wang, X.; Zhang, B.; Wang, D. Z. Reductive and Transition-Metal-Free: Oxidation of Secondary Alcohols by Sodium Hydride. J. Am. Chem. Soc. 2011, 133, 5160–5160.
  11. Breslow, R. Evidence for the Likely Origin of Homochirality in Amino Acids, Sugars, and Nucleosides on Prebiotic Earth. J. Am. Chem. Soc. 2012, 134, 6887-6892.
  12. Willighagen, E. L.; Wehrens, R.; Buydens, L. M. C. Molecular chemometrics. Crit. Rev. Anal. Chem. 2006, 36(3-4), 189-198.
  13. Willighagen, E. L; Jeliazkova, N.; Hardy, B.; Grafstrom, R. C; Spjuth, O. Computational toxicology using the OpenTox application programming interface and Bioclipse. BMC Research Notes 2011, 4, 487.
  14. Williams, A. J; Harland, L.; Groth, P.; Pettifer,S.; Chichester, C.; Willighagen, E. L; Evelo, C. T.; Blomberg, N.; Ecker, G.; Goble, C.; et al. Open PHACTS: Semantic interoperability for drug discovery. Drug Discovery Today 2012. Available online June 6, 2012.
  15. Guha, R.; Schürer, S. C. Utilizing high throughput screening data for predictive toxicology models: protocols and application to MLSCN assays. J. Comput.-Aided Mol. Des. 2008, 22, 367–384.
  16. Hanwell, M. D.; Curtis, D. E.; Lonie, D. C.; Vandermeersch, T.; Zurek, E.; Hutchison G. R. Avogadro: an advanced semantic chemical editor, visualization, and analysis platform. J. Cheminf. 2012, 4, 17.
  17. Christoffersen, R. E.; Angeli, R. P. Quantum Pharmacology. In New World Quantum Chem., Proc. 2nd Int. Congr.; Pullman, B, Parr, R., Eds.; Reidel: Dordrecht, The Netherlands, 1976; pp 189-210.
  18. Bersohn, M. Syntheses of drugs proposed by a computer program. In Computer-Assisted Drug Design; ACS Symposium Series 112; American Chemical Society: Washington, DC, 1979; pp 341-352.
  19. Lowe, D. M.; Corbett, P. T.; Murray-Rust, P.; Glen, R. C. Chemical name to structure: OPSIN, an open source solution. J.Chem. Inf. Model. 2011, 51, 739-753.
  20. Brooks, B. J., Thorn, A. L.; Smith, M.; Matthews, P.; Chen, S.; O’Steen, B.; Adams, S. E.; Townsend, J. A.; Murray-Rust, P. Ami - the chemist’s amanuensis. J. Cheminf. 2011, 3, 45.
  21. Bowers, G. M.; Ravella, R; Komarneni, S.; Mueller K. T. NMR Study of Strontium Binding by a Micaceous Mineral. J. Phys. Chem. B, 2006, 110, 7159-7164.
  22. Jessop D. M.; Adams, S.; Willighagen, E. L.; Hawizy, L.; Murray-Rust, P. OSCAR4: a flexible architecture for chemical text-mining. J. Cheminf. 2011, 3, 41.
  23. Hastings, J.; Magka, D.; Batchelor, C.; Duan, L.; Stevens, R.; Ennis, M.; Steinbeck, C. Structure-based classification and ontology in chemistry. J. Cheminf. 2012, 4, 8.
  24. Chepelev, L. L.; Hastings, J.; Ennis, M.; Christoph Steinbeck, C.; Michel Dumontier, M. Self-organizing ontology of biochemically relevant small molecules. BMC Bioinformatics 2012, 13, 3.
  25. Gkoutos, G. V.; Leach, C.; Henry S. Rzepa, H. S. ChemDig: new approaches to chemically significant indexing and searching of distributed web collections. New J. Chem. 2002, 26, 656-666.
  26. O'Boyle, N. M.; Banck, M.; James, C. A.; Morley, C.; Vandermeersch, T.; Hutchison, G. R. Open Babel: an open chemical toolbox. J. Cheminf. 2011, 3, 33.
  27. Hall, S. R.; Allen, F. H.; Brown, I. D. The Crystallographic Information File (CIF): a New Standard Archive File for Crystallography. Acta Crystallogr. 1991, A47, 655-685.
  28. The Fourth Paradigm. Data-Intensive Scientific Discovery. Hey, T., Tansley, S, Tolle, K., Eds.; Microsoft Research: Redmond, WA; 2009.
  29. Murray-Rust, P.; Rzepa, H. S. CML: Evolution and Design. J. Cheminf. 2011, 3, 44.
  30. Glen, R. C. Computational chemistry and cheminformatics: an essay on the future. J. Comput.-Aided Mol. Des. 2012, 26, 47-49.

 

Wendy Warr, Symposium Reporter

 

Global Opportunities in Chemical Information

Rachelle Bienstock kicked off the session by asking whether emerging markets will really save pharma. She cited statistics that emerging markets, currently $154B or 18% of worldwide revenue, are forecast to rise to $487B or a 37% share by 2020. JACS spotlights are now being translated into five languages, with Chinese at the top of the list.

Roger Sayle (NextMove Software) described his work building automatic translation of Chinese chemical names. Non-English chemistry is showing up frequently. Even some large pharmaceutical companies with ELNs that are supposed to be written in English are finding non-English pages in their archives. A Google search for “benzoic acid” hits only a few more pages than a search for the equivalent Chinese name. Patent applications now often appear first in non-English countries because of business or processing reasons.

Automated translation of Chinese is possible because IUPAC’s strong morphological structuring is preserved across language. Software can identify subparts and translate them, then put it back together based on the IUPAC structuring.

In text mining the challenge is to find the beginning and the end of a chemical name. In the latest version of LeadMine using NextMove’s software, 42% of simple patents written in Chinese were recognized and converted (vs. a benchmark of 86% for recognizing the original English). Image documents, however, are still not scanned successfully in most cases.

Tom Blackadar (Binocular Vision) shared his experience of living in Asia for the past 5 years. Tom began studying Chinese intensively about 2 years ago when he moved to Shanghai. He related his personal and interesting tale of the challenges and rewards of starting a small consulting company (operating as a U.S. company) to bring expert informatics practices to the developing Chinese market and link pharmaceutical companies with partners. Tom is focusing largely on western companies and Contract Research Organizations in China. He discussed many of the legal hurdles he needed to overcome. People were very impressed by his slide collection of necessary red stamped legal documents! However, Tom emphasized the need for data management and the gaps in IP and therefore the valuable niche that his company can fill in the future in China.

Brian Hitson (U.S. Department of Energy, Office of Scientific and Technical Information) talked about the efforts of worldwidescience.org to build a multilingual search system for chemistry and other sciences. OSTI provides public access to the Department of Energy’s unclassified information, as well as restricted access to classified and sensitive information for appropriate people. OSTI has been a pioneer in creating “aggregators” for federated search of multiple sites. Science.gov launched in early 2000s integrates information from twelve federal agencies. Worldwidescience.org takes this to the international level, searching databases in many different countries. Started in 2007 as partnership between U.S. DOE and the British Library, it moved in 2008 to multilateral governance. The system’s goal is to do true searches of the “deep web” index of other search engines that can really find most of the science.

Recent developments include multilingual translations that are the first one-to-many and many-to-one multilingual translations. One search query fires off ten different searches based on Microsoft Translator machine translations. “Science Cinema” uses Microsoft Research Audio Visual Indexing System (MAVIS) to recognize and index audio content. Once a hit is found, the user can go directly into the place in the video where the interesting part occurred. The next step is to attack "big data." They will search the metadata and then connect the user to the landing page to explore the data in its own format.

Jignesh Bhate (Molecular Connections) talked about business opportunities and challenges in India. Molecular Connections is India’s largest informatics company with over 900 employees located in Bangalore and Chennai. It focuses on indexing, abstracting, and text mining.

India is a big consumer of content, with 17% annual growth rate. The country has a huge business of service providers and multinational company sites. Indian private industry R&D spending is still only 25%, but growing rapidly. Indian research output is significant and growing, whereas US output is shrinking. Medicine and pharma are contributing over 25% of the total research output. India dominates offshoring of content production, with over 84% of the world’s total. This business generates $800M per year and is growing at 20%. The predictions are that value-add will be added to cost, with quality and TOT (terms of trade) as key metrics.

Jignesh pointed out several business challenges: India has many differences in cultures and languages; bureaucracy and corruption are significant obstacles; Indians are very sensitive to hierarchy; they focus on relationships and face-to-face contact, so phone calls get more used than email. Despite these challenges, the macro story is so compelling that you cannot go wrong. It sometimes feels like a “drunken man's stupor,” but you can get to the goal.

Andy McFarlane (Thomson Reuters) cited that in 2011 China became #1 country in patents, with over half a million, 23% year-on-year growth. Commercial providers historically added value on top of information coming directly from the patent office. Now the information often comes through a translator or intermediator. There are challenges in scrubbing data, such as rationalizing different translations and spellings of names. India has four patent offices that issue overlapping patent numbers. Derwent World Patents Index has comprehensive English-language coverage, including Asia, with normalized company names. Thomson Reuters tries to focus on consistency of terms. Technology focus shows that India is particularly oriented towards chemistry patents.

Tom Blackadar, Jignesh Bhate and Rachelle Bienstock, Symposium Organizers

Image
In 2009, CAS, the world’s authority for chemical information, reported that China was, for the first time, leading all nations in publication of chemical patent applications (http://www.cas.org/news/media-releases/china-leads-patents). Three years later, chemical patent information from Asia continues to be a significant source of disclosed chemistry, with patent applications from China’s State Intellectual Property Office (SIPO) still increasing. This is important to the chemical information and research communities, as CAS is reporting that in 2012, more than 70% of new substances in the literature are from patents.

Legal, Patent, and Digital Rights Management in Publishing

The symposium took place on Thursday afternoon and featured five presentations. David Gange (Altimedia) gave a talk entitled “How to find references that inherently anticipate pharmaceutical patents.” References that “inherently anticipate” a patent can invalidate a patent on the basis of novelty, so they are of a great interest to pharmaceutical companies, both offensively and defensively. The speaker gave examples of how broad compound claims can be affected by questions of metabolites, crystal polymorphs, hydrates, optical isomers, metabolites and other intermediates in real world patent law cases.

In her talk “Digital rights drain? Implications for library services,” Leah McEwen (Cornell University) enumerated some of the many digital rights issues that impinge on libraries including problems of identifying the owners of copyright and authentication of users. One of the cases she highlighted was the problems with theses, where at many institutions thesis authors traditionally signed over reproduction rights to UMI (now ProQuest). But electronic distribution is much more “public” than microform distribution, with an impact on republication as journal articles or books. The rise of entrepreneurship among faculty and students has also complicated intellectual property questions.

Donna Wrublewski (University of Florida) in her talk “Digital rights management and e-books: Perspectives from a research library” presented on other key areas where digital rights management affects library services, including conservation and preservation of materials (can you legally copy a digital object for preservation and can you copy it into a new format?), interlibrary loan and consortial lending, and discovery services (can a library create a full-text index to a copyrighted work?). With formats evolving (and in some cases, becoming obsolete), libraries are forced into a “tech support” role. In many cases, the applicable law is too new for its interpretations as they affect libraries to be clear.

Judith Currano (University of Pennsylvania) pointed to the “Problems of preserving digital content.” Judith summarized the legal status of digital content preservation as “This is a gray area.” Digital rights management is an addition to intellectual property law that seeks to prevent piracy and illegal copying…but it doesn’t work. The combinations of hardware and software that have been tried inhibit legitimate users without stopping piracy. In the meantime, libraries are faced with changing formats, with old formats and the hardware that reads them becoming obsolete, and with license agreements that are ambiguous or over-restrictive as to what can be done.

David Parker (Momentum Press) concluded with “Finding an alternative to restrictive digital rights management: The Momentum Press approach.” David reviewed some of the business models which have evolved in the transition from print publishing to electronic publishing, including various forms of open access publishing. The Momentum Press model makes e-books available for a one-time fee, with perpetual access. No third party aggregators are involved and there is no subscription or maintenance fee.

Charles Huber, Symposium Co-Organizer 

Before and After Lab: Instructing Students in ‘Non-Chemical’ Research Skills

A Symposium at the 2012 Biennial Conference on Chemical Education

Organized by: Judith Currano (University of Pennsylvania), Andrea Twiss-Brooks (University of Chicago), Grace Baysinger (Stanford University)

When students think about chemistry, one of the first things that comes to mind is the laboratory, but while the experiments done in the lab are the “meat” of the chemical research process, they are sandwiched between literature searching, reading, and acquisition of funding on one side, and publication and presentation on the other. These highly nuanced topics are frequently overlooked by educators and students alike, although they are crucial for success in both academic and industrial research. The “Before and After Lab” session at the Biennial Conference on Chemical Education, held on the campus of The Pennsylvania State University, July 29-August 2, 2012, brought together chemistry instructors to share their best practices in teaching students what to do before they enter and after they leave the laboratory. Presenters discussed how to teach students to search and read the scientific literature, to consider ethical issues inherent in scientific research, to understand the peer review process, to effectively manage literature references, and to present their work. The program with abstracts was posted: AM session, PM session. The following talks were presented during the full day symposium [links to slides are included below where available]:

Andrea Twiss-Brooks, Symposium Co-Organizer

Multidisciplinary Program Planning Group

Image

The Executive and Full Committee of MPPG, a subcommittee of the ACS Committee on Divisional Activities, met in Philadelphia during the recent National ACS Meeting to discuss the thematic programming activities of present and future ACS Meetings. For Philadelphia, the thematic organizer Xinqiao Jia, University of Delaware, did an extraordinary job in putting together an excellent program and the ACS staff again created and widely published an eye-catching logo symbolizing the meeting’s theme “Materials for Health and Medicine.” Divisional symposia related to the theme were advertised on flyers and in the meeting program, as were the “Kavli Foundation Innovation in Chemistry Lecture and the Plenary Session. The Kavli lecture on “Chemistry in medicine: From the discovery of angiogenesis to the development of controlled drug delivery systems and the foundation of tissue engineering” was presented to a full house by Robert Langer of MIT. The plenary session, again to a standing room only audience, consisted of four presentations by eminent scientists addressing the theme of the Meeting: Jacqueline K. Barton, California Institute of Technology; Chad A. Mirkin, Northwestern University; Buddy D. Ratner, University of Washington; and John T. Santini, Jr., On Demand Therapeutics, Inc.

The future of thematic programming at ACS meetings looks bright. More and more technical divisions organize symposia related to the theme of a meeting, often co-sponsored by other divisions indicating the interdisciplinary nature of chemistry. We definitely have seen a strong upwards trend in the last few meetings. As per charter, themes for the next three years have been approved and organizers are in place for 2013 and 2014. The CINF Program Committee should look closely at the themes and available synopses to work together with the thematic program chairs to organize companion symposia. Any symposium within a given theme will provide valuable publicity to the division.

Here are the themes for future meetings:

Meeting

Dates

Theme

Program Chair

Spring 2013 New Orleans

April 7-11

Chemistry of Energy and Food

James Seiber, UC Davis

Fall 2013 
Indianapolis

September  8-12

Chemistry in Motion

Robert Weiss,
University of Akron

Spring 2014
Dallas

March 16-20

Chemistry of Energy/
Advanced Materials for New Opportunities

Michelle Buchanan,
Oak Ridge National Lab
Nitash Balsara, UC Berkeley

Fall 2014
San Francisco

August 10-14

Chemistry and Stewardship of the World

Robin Rogers,
University of Alabama

Spring 2015
Denver

March 22-26

Chemical Resources:
Extraction, Refining and Conservation

TBD

Fall 2015
Boston

August 16-20

History of Innovations:
From Discovery to Application

TBD

Spring 2016
San Diego

March 13-17

Computers in Chemistry
(tentative)

 

Fall 2016
Philadelphia

August 21-25

Chemistry and Education
(tentative)

 

 

Important News: Recently, the Kavli Foundation signed an agreement with ACS to sponsor a second lecture series “The Kavli Foundation Emerging Leader in Chemistry Lecture” at future ACS National Meetings for the period 2013–2015. Divisions will be asked to nominate up to two candidates. The Kavli Emerging Leader Lecturer must be a distinguished younger scientist who is highly regarded by his or her peers for significant contributions to an area of chemistry or related multidisciplinary area. The nominee(s) have to be 40 years of age or less and fewer than 10 years after completion of his or her PhD at the time of nomination. The nominees do not have to be members of the nominating division. Division’s secretaries will send out the “Call for Nomination” and submit the nominations to MPPG, who will manage the selection process. A template for nominations will be sent out by the Chair of MPPG.

Feel free to contact me at ggrethe@att.net if you have any questions regarding MPPG.

Guenter Grethe, Member, Executive Committee Multidisciplinary Program Planning Group