Unified Biomedical Knowledge Graph (UBKG)

Data Model


Purpose of the UBKG

A Unified Biomedical Knowledge Graph (UBKG) is a neo4j knowledge graph database that combines information from the vocabularies managed by the Unified Medical Language System (UMLS) with information from other data sources.

Information from the UMLS includes:

Examples of data sources that are not in the UMLS include:

The purpose of the UBKG is to organize and to consolidate the assertions that data sources state for biomedical concepts.

An assertion is a statement of a relationship between two concepts. Examples of assertions include:

Concepts and synonymy

Like the UMLS Metathesaurus, the UBKG is structured around concepts. A concept is a discrete codification of an idea–e.g., the idea of a cell.

In general, an idea will be represented in more than one data source using codes: for example, the idea of cell is associated with code 4421005 in SNOMEDCT_US and C12508 in NCI.

The UBKG establishes synonymy between data sources in the representation of concepts. If the code for an entity in one data source can be considered to encode the same entity as a code in another data source, then the codes in the different data sources are synonyms. Assertions involving the codes that are synonyms can be considered as assertions involving their shared concept. Assertions can thus be consolidated across data sources.

The UBKG extends the UMLS Metathesaurus’s synonymy architecture by integrating information from sources outside the UMLS.

Example

Two data sources assert relationships about cats. One source uses the code cat to represent the concept cat; the other data source uses the code feline.

The data sources assert relationships between their respective codes and other codes.

img.png

Because cat and feline can be considered code synonyms for the concept cat, it is possible to consolidate assertions from the two sources using concepts instead of codes. (The animal and mouse codes are also associated with concepts; however, these codes are not synonyms for a shared concept, and so have separate concept associations.)

The assertions of the two data sources can be consolidated at the concept level.

Because of synonymy, either of the codes cat or feline can be used as the subject for an assertion. The assertion is actually between the associated concepts.

img_1.png

The concept subgraph

An idea in the UBKG can be considered on two levels of knowledge:

The UBKG supports both levels of knowledge by representing each concept with a concept subgraph that associates the concept with the synonymous encodings of different data sources. In other words, a concept subgraph represents both a concept and all the ways that the concept can be represented in the UBKG.

The following image illustrates the general form of a concept subgraph. In general, a concept will be encoded in multiple data sources; however, the image illustrates the encoding of a concept in a single data source for clarity. img_2.png

A concept subgraph features both nodes and edges (relationships).

Nodes (entities) of a concept subgraph

Concept node

A Concept node is the originating node of a concept’s subgraph.

Each Concept node has a property representing the Concept Unique Identifier (CUI) for the concept. CUIs for concepts imported from the UMLS are alphanumeric strings starting with “C”–e.g., C007634. If a code from a non-UMLS data source cannot be associated with a UMLS concept, the UBKG generation framework will create a Concept node for the code with a CUI that has format Code in vocabulary CUI–e.g., SNOMEDCT_US:4421005 CUI.

Code node

A Code node represents the codification of a concept in a data source. For example, the SNOMEDCT_US vocabulary uses code 4421005 to represent the concept of cell; the associated concept has CUI C007634.

Each Code node has the following properties:

Term node

A Term node represents a term that can be associated with either a code or concept. Both Code nodes and Concept nodes can have multiple terms associated with them–- e.g., preferred terms, synonyms, etc.

(Term synonymy is different from the code synonymy for concepts discussed previously.)

The only property of a Term node is name.

Definition node

A Definition node represents a definition of a concept. Although a source (e.g, a vocabulary) assigns a definition to a code, the UBKG links the Definition node to the Concept node associated with the Code node from the source.

The Definition node has three properties:

Semantic node

Concepts from the UMLS are associated with Semantic nodes. Semantic nodes are translations of the Semantic Network types for concepts.

The Semantic node has three properties:

Relationships in the Concept subgraph

A concept subgraph contains relationships; however, the relationships of a concept subgraph are different from relationships that are in assertions in that they link concepts with the metadata nodes of type Code, Term, Definition, and Semantic.

In each concept subgraph,

Term type relationships

In the UMLS, terms are categorized by term type: e.g., PT (for preferred term); SY (for synonym); etc.

The UBKG translates term types into relationships (edges). For example, in the UMLS, the preferred term for a code is the term with term type PT; in the UBKG, the preferred term is the Term node that has a relationship of PT with the code node.

PT and PREF_TERM relationships

The preferred term for a Code has a term type of PT. The preferred term for a Concept has a term type of PREF_TERM. A term can be both the PT for a Code and the PREF_TERM for a Concept.

Business rules of the concept subgraph

PT_SAB

A number of non-UMLS sources define relationships between entities in another source–e.g., UBERON defines relationships between entities in PATO, CL, etc.

A source may define a preferred term for a code in another source that is different from the preferred term used by the source that is the steward of the code: e.g., UBERON might define the preferred term for a PATO code as “cells” instead of “cell”.

The UBKG reserves the PT (preferred term) relationship for the preferred term for a code defined by the steward of the code: e.g., only UBERON can define the preferred term for a UBERON code.

If a source defines a preferred term for a code from another source and either the other source had not been ingested into the UBKG prior to the ingestion of the source or the preferred term for the code in the source is different than that defined by the other source, the UBKG assigns that Term a relationship of PT_SAB, where SAB is for the source that is not the steward of the code.

Preferred Concept

Ideally, each Code node would associate with only one Concept node in the UBKG. However, some codes are associated with multiple concepts–both in the data exported from the UMLS and in non-UMLS sources.

The UBKG associates each Code with all Concepts that are specified by data sources.

However, to mitigate duplicate references, the UBKG attempts to find a single preferred concept for each code. The UBKG identifies the preferred concept for a code by means of the CUI property of the relationships between the Code node and its preferred term node: the CUI property of the Term with relationship PT is the CUI of the preferred concept.

Assertion (Concept-Concept Relationship) model

In the UBKG, the assertions that data sources state between codes are represented in the UBKG as relationships between the concepts with which codes associate.

The UBKG represents each assertion that a data source makes involving a code with a relationship that involves the Concept node of the code’s associated concept subgraph. The original code from the data source can be resolved from the code’s concept subgraph.

Inverse assertions and the Relations Ontology (RO)

The UBKG represents each assertion from a data source as a pair of bi-directional relationships between two concepts:

The UBKG derives inverse assertions when importing from data sources; in fact, the edge file from a data source should only contain forward assertions.

When deriving inverse assertions, the UBKG refers to the Relations Ontology, which defines a large number of bidirectional pairs of assertion predicates.

In a previous example, the concept of cat was the subject of two assertions–isa and chases. The UBKG might represent the bidirectional sets of assertions as follows:

img_3.png

Properties in the UBKG

Discussions of graph databases often distinguish between

The UBKG data model combines elements of a knowledge graph with those of a property graph. The UBKG is essentially a knowledge graph with a small, constrained set of properties.

Concept subgraph properties

The majority of the properties in the UBKG support concept subgraphs. These node and edge properties were discussed previously:

Element Property Description
Concept node CUI Concept Unique Identifier
Code node SAB Source ABbreviation
  CODE Code in the SAB
  CodeID Concatenation of SAB and CODE
Term node name Text of term
Definition node SAB Source ABbreviation
  DEF Text of definition
  ATUI Attribute Unique Identifier
Semantic node STN Semantic Tree Number
  TUI Type Unique Identifier
  name Text for semantic type
Code:Term relationships CUI Concept Unique Identifier

Node properties

The UBKG generation framework allows data sources to define the following properties for a node in the node file:

Edge properties

The UBKG generation framework allows data sources to define a property named evidence_class for an assertion in the edge file.

Assertion relationship properties

When the UBKG creates an assertion relationship between concepts, it assigns to the SAB property of the relationship the SAB for the data source. For example, the isa relationships asserted by the UBERON ontology have the SAB value UBERON.