Substance Identifiers, Addressing the Challenges Presented by Chemically Modified Biologics: The Role of InChI & Related Technologies

This symposium was organized by Keith Taylor and Steve Heller and held on Sunday, August 16, 2015, at the ACS National Meeting in Boston, MA. Three papers were presented:

CINF 1. Generating canonical identifiers for glycoproteins and other chemically modified biopolymers by R. Sayle, J. May, N. O’Boyle, and E. Bolton

CINF 2. Towards addressing informatics challenges presented by antibody drug conjugates by S. C. Sukuru, T. Zhang, L. Tumey, E. Muszynska, M. Tran, and F. Loganzo

CINF 3. Representation of chemically modified proteins in the Substance Index SPL files by Y. Borodina, and G. Schadow.

These three papers dealt with three distinct phases in the development of chemically modified biologics (CMBs).

Chemically modified biologics are very important therapeutic agents; five of the top 10 drugs by sales value are chemically modified biologics. They bring unique opportunities and unique risks. It is important, as it is with all drugs, that they are identified uniquely and that associated data are made reliably accessible.

The first paper was presented by Roger Sayle (NextMove Software) and co-authored by John May, Noel O’Boyle, and Evan Bolton. In this paper Roger explored options that are currently available for identifying chemically modified biologics and discussed complementary approaches to biologics registration; one based upon expressive all-atom representations, another on tracking deltas to a reference database of protein sequences. The main characteristic of chemically modified biologics compared with conventional drug entities is their size; Humira, the top selling chronic hepatitus B (CHB) drug, consists of 1,330 amino acids and has a molecular weight of approximately 148 kDa. This alone challenges many conventional cheminformatics tools; InChI in its standard form is limited to 999 atoms and clearly cannot handle a typical CMB.

Chemical modifications can take many forms. They may be done deliberately by chemists in a laboratory, or passively during storage where methylation reactions occur. Roger gave an important example of the importance of tracking glycosylation in the biologic. Clinical trials for Erbitux (cetuximab) were conducted in California. Adverse reactions were observed in 1% of patients. Fortunately (in one respect) this percentage was high enough to require that the first dose be given in a controlled, clinical environment. Much higher numbers of reactions, and more serious ones, including anaphylactic shock, were recorded in a number of U.S. states, particularly in the South West. Many of these patients had been exposed to tick bites and developed an immune response that responded to the glycosylation pattern on Erbitux with serious consequences. This demonstrates the importance of tracking all aspects of a CMB’s make-up.

Roger then discussed the existing technologies for identifying CMBs. Many, such as Hierarchical Editing Language for Macromolecules (HELM) and PDB, are based on standardized monomer dictionaries that then prove difficult to maintain. At first sight, the limited number of building blocks available in a natural biologic makes the task seem simple, but once chemical modification is included, and especially when skilled chemists get involved, the number of building blocks and sidechain modifications become limitless. Roger next raised the subject of canonicalization, the process by which many varied inputs are converted to the same unique representation. Canonicalization is important in the naming of entities. A simple example is ethanol which can be written as CH3CH2OH, or HOCH2CH3. A chemist would normally use the first format, but when 1,330 residues (and approximately 12,500 heavy atoms) are involved, there is scope for creativity.

In Roger’s opinion, a chemical identifier should be independent of the input representation or file format, and there should be equivalence between small molecules, peptide and proteins, which are best determined by a single identifier, preferably the existing standard InChI. Currently, InChI has an atom limit, but Roger was able to use a modified version of the algorithm without the atom limit.

He provided data from the peptide UTP10_KLULA. It has a sequence of 1,774 amino acids, with 28,509 atoms. An InChI and an InChI Key could be generated in 73.2 seconds. Roger then converted the sequence to a SMILES string and using OEChem’s SMILES Canonicalization Time; the canonical SMILES was generated in 0.4 seconds. This demonstrates that there is much scope to improve the performance of the InChI algorithm.

Roger then moved on to discussing a new database structure based on a Directed Acyclic Graph (DAG) for characterization and searching of biologics. This approach enabled the building of a DAG for all 540,546 protein sequences in uniprot_sprot, which contains over 192 million amino acids. This data structure allows close analogues to be identified much faster than using NCBI blastp. For example, all 540,546 sequences can be queried against this database (i.e., all-against-all) in about 9 minutes 30 seconds on a single core on a laptop and the sequence from PDB 1CRN (crambin 46AA) is canonically named as [L25I]P01542 in 0.002 seconds.

Roger concluded with the statement that “InChI for large molecules” can be achieved, and remain compatible with small molecule InChI identifiers, through the evolution of ever better canonicalization algorithms. In addition, he directed a jab at journal reviewers who claim that the run-time of canonicalization algorithms is a non-issue, and not an area ripe for improvement; these reviewers are very mistaken.

ImageThe second paper was presented by Chetan Sukaru (Pfizer), and co-authored by Tianhong Zhang, Lawrence Tumey, Elwira Muszynska, Megan Tran, and Frank Loganzo. The team has developed a novel in silico tool called Antibody Conjugate Tracker (ACT). ACT is designed to characterize each Antibody Drug Conjugates (ADC) efficiently, and its molecular components, namely the antibody, linker-payload and payload. The ACT provides a unique in silico environment with structured metadata that enables comprehensive data analytics on ADCs. Based on their experiences with the ACT, the authors proposed novel descriptors to parse and analyze Antibody Drug Conjugates data that could improve our understanding and accelerate the discovery of potential therapeutic ADCs. ADCs at Pfizer are given corporate IDs based on the parent antibody, the linker technology, and the drug payload.

Using the ACT data, it is trivial to visualize the distribution of ADCs quickly for select conjugation chemistries and purposes.


Structured metadata in ACT enables in vivo data comparison of a given linker-payload across different antibodies (program/antigen specific).

In the concluding remarks, Chetan stated that despite the rising interest in ADCs, there is still a gap in the informatics infrastructure/technology to support their discovery and development. With the ACT, Chetan and his colleagues have attempted to bridge the gap and address some of the informatics challenges presented by ADCs. The deliverable is that the structured metadata incorporated in the Antibody Conjugate Tracker not only keep track of all the ADCs, but also enhance in silico data analytics and visualization. Finally, a similar approach with standardized descriptors could help develop substance identifiers for other chemically modified biologics, too.

The final paper was presented by Yulia Borodina (FDA), co-authored by Gunther Schado, who described the ongoing work and challenges involved in incorporating CMBs into the FDA’s Substance Index Structured Product Labeling (SPL) Files. Currently the FDA’s Substance Registration System (SRS) contains information on 98,000 substances. The following entities are represented: small molecules, polymers, biopolymers, plant parts, tissue parts, vaccines, etc. The information (chemical structures, names, protein and nucleic acid sequence, taxonomic information) is highly curated. Each substance has a Unique Ingredient Identifier (UNII).

The SRS contains over 1,500 proteins (2% of all registered substances). Some are considered to be confidential, but over 1,100 are in the public domain and are targeted for public release.

ImageYulia provided an informative bar chart that showed the increasing role of proteins in marketed drugs.

The challenges that need to be addressed are: reliable electronic exchange of protein information, and the unique identification of protein substances. The following information has to be captured for each CMB:

  • Amino acid sequences of chains
  • Covalent connections between/within chains
  • Modifications of natural amino acids
  • If amino acid is modified by a synthetic polymer, structure and characteristics of that polymer
  • Sites and type of glycosylation
  • Structure of glycan
  • The frequency of modification (which may be an average)
  • Co-factor enzyme interactions.

Dinutuximab and dinutuximab beta differ only by their glycan composition, as do erythropoietin (EPO) alpha, beta, delta and omega.

The information will be published in the Substance Index SPL files as an XML document that is Health Level 7 compliant. It has been adopted by the FDA as a mechanism for exchanging product information electronically and it has also been adopted by ISO 11238 as an exchange standard for medicinal substances. The information is available through FDA Online Label Repository and DailyMed. It is up to date (new information or changes are added daily), and it is free.

Yulia then described the XML markup that is being used for CMBs. Currently, about 800 files have been generated for proteins that do not have polymeric modifications. After review, these files will be posted on the SPL website:

An SPL Implementation Guide with Validation Procedures is under update and will be made available at: Further work is underway to update the SPL model to handle synthetic polymers and protein-polymeric conjugates.


The three papers covered many of the challenges of handling CMBs from the fundamental cheminformatics, through managing them in the laboratory, to finally registering and publishing the information with the FDA. The session was well attended, especially considering that it started on Sunday at 8:30am. The audience was somewhat variable, but we had an average of 20-25 attendees.

Keith Taylor, Symposium Organizer