Adoption and Impact of an Identifier Policy – AstraZeneca

Find out how AstraZeneca deploy a policy for identifiers to construct a FAIR infrastructure across the enterprise.

  • A Uniform Resource Identifiers (URI) policy for the enterprise
  • A pilot server for persistent URIs

Overview

Initiated in 2012, AstraZeneca implemented a Uniform Resource Identifiers (URI) policy that describes how URIs need to be constructed to facilitate cross-enterprise Findability, Interoperability and Reuse of digital objects. Significant adoption benefits occur in information domains where one needs to utilize data across multiple sources and where one may not have control over information architecture within these sources. Focus areas taking advantage of this approach include clinical studies, translational medicine and competitive intelligence programs.

Process

AstraZeneca developed a URI Policy in 2012 with the help of 3 Round Stones Inc.  Implementing a URI policy has a number of benefits.  It helps to prevent ambiguity as URIs are used to name things, and URIs can be globally unique.  It facilitates decentralized management as URIs are unique within a domain.  It enables global accessibility as HTTP URIs enable lookup (not just an identifier when also able to resolve location as a Uniform Resource Locator) and URI can be dereferenced, providing machine readable representation.

The goals of the policy are to:

  • Determine the objects within AZ data space that require persistent identifiers.
  • Determine AZ URI Scheme/s and related governance structure.
  • Provide guidance on Best Practices for minting URIs for AZ objects.
  • Determine how persistent objects should be served across the organization in support of maximum re-use.

Using previous work of industry domain leaders and the World Wide Web Consortium [1, 2, 3], the policy describes a base URI structure, a variation of the base structure to be used for vocabularies, schemata and ontologies, the URI structure for instance identification, the (optional) URI structure for data source identification, the URI structure for syntactical variation requests and the URI structure for the identification of concepts.

For URI types, AstraZeneca has chosen to use 303 URIs (also called “slash” URIs).  We also follow a URI opacity approach, where the agents accessing URIs should not parse or otherwise read into the meaning of URIs.

The basic URI structure follows a template in Backus-Naur Form (BNF):

SCHEME PURPOSE FUNCTION DOMAIN ‘/’ ( SCHEMA ‘/’ CONCEPT (EXT)* | CATEGORY (‘/’ SOURCE)* ( ‘/’ TOKEN )+ (EXT)+ )

With the following terms and definitions:

Term Definition
SCHEME ‘http://’  or ‘https://’
PURPOSE A business purpose, e.g. ‘data’, ‘vocab’
FUNCTION A business function name, e.g. ‘research’ or ‘rd’
DOMAIN ‘.astrazeneca.net’
SCHEMA A schema/model for a domain of knowledge
CONCEPT A particular concept in a schema, e.g. ‘indication’
CATEGORY FUNCTION-specific term to identify a category, e.g. ‘study’
SOURCE A source system identifier, e.g. ‘MLCS’
TOKEN A particular entity or instance, e.g. D5890C00003 or DOID/2841
EXT “” | ‘.’ (rdf | n3 | json)

For example, an identifier for a clinical study with the short title CP200, originating from the Biomarker Data Mining (BDM) system within the former Medimmune Division would be constructed as below:

Here the scheme is “http” but it could also be “https”.  The purpose is “data” denoting that this identifier is for an instance of this resource.  The function is “rd” denoting that this resource belongs to research and development.  The domain is “astrazeneca.net” denoting that this resource is expected to be resolved within the AstraZeneca intranet.  The category is “study”, which in practice, also is defined as a vocabulary term.  The optional source is the BDM system and the token “CP200” is the local identifier within that source.  Of note, when the source term is present, only attributes of that resource within that source system are expected to be part of the digital object and not those attributes contained within other systems.  Lastly, the optional ext term is not used meaning the agent using the URI will rely on content negotiation to determine the serialization of the data returned upon resolution.

Within AstraZeneca, when clinical studies reach a maturation point, they are assigned a D-Code, which becomes the authoritative study identifier.  In this example, harmonization from a subsidiary system (e.g. Medimmune’s CP200 in the BDM system) would create an identifier D2830C00001 that would then “mint” the new authoritative URI:

Upon resolving this URI and passing through authorization & authentication gates, a user (or agent) would expect to receive all attributes from all subsystems mapping to this authoritative URI, including BDM and ClinicalTrial.gov URIs (here NCT00946699).

As Clinical Study is also considered a Master Data Entity, a shortened version of the URI is allowed that will direct users to the AstraZeneca Clinical Trial application [4]:

As noted previously, the purpose term of the URI policy allows it to cover concepts in vocabularies as well as data resources.  The category study in these examples has the URI:

The main difference with vocabulary URIs and data URIs is the purpose, schema and concept terms.  Similar to data URIs, multiple subsystem URIs can point to the authoritative vocabulary URI.

Outcomes

The URI policy has been implemented to support two projects in AstraZeneca: Integrative Informatics, a translational science knowledge graph; and AZ Clinical Trials, an internal application offering a simple way to find information on past and current studies.

Integrative Informatics uses the policy to mint URIs for data catalog records and for digital objects using our core and translational science ontologies.  All ontologies also follow the URI policy.  When research and clinical study metadata is created using our Smart Form applications it not only follows the policy but is also “born FAIR” using not only our own ontologies but aligns with community best-practice ontologies in the relevant domains.  This enables the creation of FAIR-data compliant datasets that are part of a growing, federated knowledge graph.

AZ Clinical Trials uses the URI policy to eliminate redundancy in trial identifiers, promoting a one-stop-shop approach for clinical trial metadata.  Considering that many of our study and investigation product assets are of external origin, that biopharma companies need to constantly acquire, merge and divest assets with multiple partners, having a common identifier system is foundational to data stewardship efforts in this space.

<<Return to all Use Cases>>

References and Resources

  1. 3 Round Stones and Callimachus on GitHub https://github.com/3-Round-Stones/callimachus
  2. Tim Berners-Lee (2006 July 27). Linked Data – Design Issues, W3C Design Note, http://www.w3.org/DesignIssues/LinkedData.html
  3. Paul Davidson (ed.) (2009 October 9). Designing URI Sets for the UK Public Sector, A report from the Public Sector Information Domain of the CTO Council’s cross-Government Enterprise Architecture, Interim paper, Version 1.0, https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/60975/designing-URI-sets-uk-public-sector.pdf
  4. Leo Sauermann and Richard Cyganiak (eds.) (2008 December 3). Cool URIs for the Semantic Web, W3C Semantic Web Education and Outreach Interest Group Note, http://w3.org/TR/cooluris/
  5. Kerstin Forsberg and Daniel Goude. Study URI.  PhUSE EU Connect 2018 https://www.lexjansen.com/phuse/2018/tt/TT09.pdf

At a Glance

Team
  • Internal Business Analyst
  • Information Architect
  • Data Engineer
  • SME from partnering firm
Timline
  • Roughly three months
  • Part of a larger linked data architecture pilot
Deliverables & benefits
  • A URI policy
  • A pilot server for persistent URIs
Authors
  • Tom Plasterer, Data Science & AI, BioPharmaceuticals R&D, AstraZeneca, Boston, Massachusetts, USA (LinkedIn)
  • Kerstin Forsberg, Data & Analytics, R&D IT, AstraZeneca, Gothenburg, Sweden (LinkedIn)
FAIR cookbook recipes