SPL and openFDA resources of open substance data

Yulia Borodina

Yulia Borodina is in the Office of Health Informatics at the U.S. Food and Drug Administration (FDA/OHI). Her talk concerned “bulk” open data. Machine-readable data are extracted from text or legacy databases, harmonized, and coded in a machine readable format. To provide data interoperability you need a data standard, and then you harmonize the data according to the standard, and ensure that the standard is publicly available (and, ideally, freely available). Unfortunately, you may have to wait 50 years until the community adopts the standard. To support data reuse you can provide direct downloads and Application Progamming Interfaces (APIs), and let the user decide how to select and analyze the data.

Structured Product Labeling (SPL) is a document markup standard approved by Health Level Seven (HL7) and adopted by FDA as a mechanism for exchanging product and facility information. It covers health informatics, cheminformatics, and bioinformatics. It has many applications: Yulia concentrated on substances. SPL is a universal (not data-specific) exchange standard, with reusable data types, coded data elements, and data-specific validation procedures. Drug manufacturers and distributors submit SPL to FDA, and FDA makes a product SPL file with substance, pharm class, billing unit, and product concept index files. Data are output to the FDA Online Label Repository, the National Library of Medicine’s DailyMed website, and the public data warehouse, openFDA.

Substances in products can be small molecules, proteins, nucleic acids, polymers, organisms, parts of organisms, or mixtures. Definitions of non-confidential substances from the FDA Substance Registration System are available in SPL format, with unique ingredient identifiers (UNII). The data for over 50,000 chemical substances, and over 5,000 biological ones, are compliant with the Identification of Medicinal Products (ISO IDMP 11238) standard, and are available from DailyMed and openFDA. The IDMP standard defines “what” (e.g., proteins are to be defined by sequence) and the SPL standard defines “how” (e.g., UNII, molfile, InChI, and InChIKey for small molecules). Yulia showed the content of some SPL Substance Index Files for various types of substance. SPL data have been integrated into PubChem.

The concept of openFDA is to index high-value, high-priority, and scalable public datasets (e.g., medical device reports, drug adverse events, and food recall enforcement reports), to format and document the data in developer- and consumer-friendly standards, and to make those data available via a public-access portal that enables developers to use them in applications quickly and easily. openFDA allows direct downloads and APIs. Substance and Pharm Class SPL index files can be downloaded, and some substance SPL fields associated with a product label are available in JavaScript Object Notation (JSON) format via API. openFDA allows users to carry out statistical applications around adverse events, such as the likelihood ratio test-based method for signal detection in drug classes. Interactive open-source applications available on https://open.fda.gov/analytics/ demonstrate how openFDA APIs can be used for epidemiological research, combined with powerful statistical tools built by the openFDA community.