Two decades of open chemical data at the Developmental Therapeutics Program (DTP) at the National Cancer Institute (NCI)

Daniel Zaharevitz

The talk by Daniel Zaharevitz of NCI also covered freely available chemical and biological data. A history of DTP/NCI was posted on the Web on the 50th anniversary of the Cancer Chemotherapy National Service Center (CCNSC), which was set up in 1955. Until 1990, transplantable mouse tumors were used and gram quantities of test substances were needed. After that, human tumor cell lines in culture (the “NCI-60” cell lines) were used and only milligram quantities of test substances were needed.

The philosophy behind the National Chemotherapy Program7 was one of hundreds of independent investigators who were not required to collaborate. Indeed, over the last ten years, 42,301 compounds have been submitted from 1,477 different groups. Consequently data and decision making have been compartmentalized, and data systems development has reflected this compartmentalization. There was little pressure to apply any standardization.

From the 1970s until 2000 the Drug Information System was part of the CIS Structure and Nomenclature Search System (SANSS). Since 2000 there has been a Web interface for compound submission, accepting structures in only molfile format. Before 1994 there was no policy for making chemical structures publicly accessible. Data release was avoided if possible because of the costs and difficulties involved, and because there was no perceived advantage. In 1994, we made 127,000 structures for which there was a CAS REGISTRY Number available via FTP, after SANSS connection tables had been converted to molfiles, and CORINA had been used to generate 3D coordinates. Since 2000, molfiles have been extracted from a newer internal system, and structures are released about once a year on a Web page. In June 2016 there were 284,176 open NCI structures, but there are many versions of “NCI structures” around, including multiple depositions in PubChem.

DTP compound submissions are now performed online. The submitter must register as a user and the submission must include structures, which are subjected to consistency checks (with the Chemistry Development Kit, CDK), and stereochemistry consistency checks (with InChI). A material transfer and screening agreement is signed electronically, and, nowadays, the confidentiality period is limited to three years. Submitters are given access to screening results and to COMPARE analysis. Researchers can request samples or plated sets from a collection of about 100,000 compounds, if they submit a material transfer agreement electronically, and pay for shipping.

There is no science without communication, including communication with a more general audience, as well as with those immediately involved. Despite the barriers to widespread communication, it is important to do something. Note also that good communication of data is hard work, and attention to detail is critical.

The earliest plans8 for PubChem recognized the need for significant resources to store and disseminate data. NLM was a natural choice for this function, and Steve Bryant was brought in early in the implementation process. Evan Bolton came in when the nuts and bolts implementation started. When PubChem went live, about a third of the structures and all of the biological data were from DTP. In less than 20 years the world of open chemical structures has gone from about 100,000 compounds in a single file to millions of structures being freely available in a searchable database.

In the future, more applications will be built based on PubChem data. “Chemical awareness” should be integrated into the publication process, especially peer review. In future, data consistency will be improved, and we will be more able to know the context for structures and data, and to find out which similar structures are known and which assays have been run on them. Researchers will use predictive tools more as a measure of surprise than as a substitute for measurements.