Unified Biomedical Knowledge Graph (UBKG)


The Unified Biomedical Knowledge Graph (UBKG) is a knowledge graph infrastructure that represents a set of interrelated concepts from biomedical ontologies and vocabularies.

The UBKG infrastructure involves…

The UBKG combines information from the National Library of Medicine’s Unified Medical Language System (UMLS) with sets of assertions (also known as triples, or subject-predicate-object relationships) from ontologies or vocabularies outside the UMLS, such as:

An important goal of the UBKG is to establish connections between sets of assertions. For example, if information on the relationships between proteins and genes described in one set of assertions can be connected to information on the relationships between genes and diseases described in another set of assertions, it may be possible to identify previously unknown relationships between proteins and diseases.

Introduction to assertions

An assertion is the statement of a relationship between two entities. If you are new to assertions, consult the Basics page.

Components of the UBKG

The primary components of an instance of the UBKG are:

Sources and Source Abbreviations (SABs)

The publisher (also known as the owner or steward) of a set of assertions is identified with a Source Abbreviation (SAB).

Examples of SABs include:

Source Contexts

The content of an instance of the UBKG depends directly on the assertion information that is imported into the instance of the ontology knowledge graph database. The source context (or just context) for an instance of the UBKG describes a collection of sets of assertions, each of which is identified by SAB.

The Contexts page describes of a number of UBKG contexts.

UBKG Source framework

The UBKG source framework is a combination of manual and automated processes that obtain the base set of nodes (entities) and edges (relationships) that comprise the UMLS CSVs. The UMLS CSVs can be loaded into neo4j to populate a UMLS context of the UBKG (UMLS-Graph).

The result of the Metathesaurus and Semantic Network downloads is a set of files in Rich Release Format (RRF). The RRF files contain information on source vocabularies or ontologies, codes, terms, and relationships both with other codes in the same vocabularies and with UMLS concepts.

The RRF files can be loaded into tables in a data mart. (The University of Pittsburgh’s manages its UMLS content in its Neptune data mart.)

A python script then executes SQL scripts that perform Extraction, Transformation, and Loading (ETL) of the RRF data into a set of twelve temporary tables. These tables are exported to CSV format in files that become the UMLS CSVs.

Solution Architecture

The following diagram illustrates the source framework workflow.

Source_framework

UBKG Generation framework

The UBKG generation framework extends the UMLS context by integrating additional assertions from sources outside the UMLS.

Scripts in the generation framework:

Once the generation framework compiles a set of ontology CSVs for a UBKG context, the CSVs can be imported into a neo4j database to populate an instance of the UBKG.

The generation framework accepts assertion files in a number of formats. The Formats page specifies formats that the generation scripts accept.

PheKnowLator and OWLNETS

When the assertion data source is an OWL file, the generation framework uses the Phenotype Knowledge Translator (PheKnowLator) package. PheKnowLator converts information from an OWL file into the OWL-NETS (OWL NEtwork Transformation for Statistical learning) format.

Solution Architecture

The generation framework is a parameterized ETL script that:

The following diagram illustrates the basic workflow, showing four cases:

  1. The OWLNETS script that uses PheKnowLator to work with OWL files
  2. A custom script (shown here is a script that works UniProtKB)
  3. The SKOWLNETS script that works with SimpleKnowledge data sources
  4. Files in the UBKG edges/nodes format

generation_framework

UBKG API

It is possible to obtain information from an instance of the UBKG ontology graph databse by executing Cypher queries. However, the complexity and likely size of a UBKG neo4j instance requires both knowledge of the UBKG schema and caution to avoid problems such as runaway queries.

The UBKG API is a REST API with endpoints that abstract common types of queries that can be executed against a instance of the UBKG neo4j knowledge graph database.

A description of the UBKG API can be found here.

Source repositories

The source for the infrastructure components of the UBKG (i.e., everything except assertion data from the UMLS and other SABs) is stored in public GitHub repositories in the x-atlas-consortia organization.

Repository Components Description
ubkg-etl generation framework; source framework ETL scripts to build sources that are imported into a UBKG neo4j instance
ubkg-neo4j UBKG neo4j Tools to build and deploy a Docker container hosting a neo4j instance populated with a UBKG
ubkg-api UBKG API REST API to query a UBKG instance
ubkg-docs documentation Source used to generate the content on this GitHub Docs site.