Unified Biomedical Knowledge Graph

Accepted Assertion File Formats


Objectives

This guide describes how to format a set assertions so that it can be added to the Unified Biomedical Knowledge Graph database.

The guide includes recommendations for optimizing and deepening the integration of a set of assertions into the UBKG to establish new relationships among entities and cross-references among ontologies.

Audience

This guide is intended for users who are subject-matter experts in what biomedical assertions they might want to represent (e.g., genes and their products) with a set of assertions, but not necessarily conversant with either ontological concepts or knowledge graphs.

Glossary

The Glossary page describes terms that this guide uses that are relevant to assertions or knowledge graphs.

Guiding Principles for Integrating a Set of Assertions

If the UBKG is to connect sets of assertions from different SABs, the sets of assertions should, as much as possible, represent entities and relationships similarly.

To the degree possible,

  1. Entities should be encoded with codes from published biomedical ontologies and vocabularies. For example, genes should be encoded in a standard vocabulary such as HGNC.
  2. Relationships should be represented with properties from the Relations Ontology.
  3. Codes for entities should be cross-referenced to UMLS CUIs (via the node_dbxrefs column, described below).

Format Options

The UBKG generation framework can work with files from a variety of sources, including:

OWL files

The generation framework can work with OWL files in serializations including:

Files can be online or stored locally.

Custom data formats

Some data sources are available in custom formats, such as files in FTP sites. The UBKG development team can build custom scripts that extract and translate assertion content.

UBKG Edges/Nodes format

The rest of this document will discuss the UBKG Edges/Nodes format.

PubChem

Additional information to guide ingestion of PubChem information, a type of UBKG Edge/Node file, is here.

UBKG Edge/Node Format

Files in UBKG Edge/Node format describe the entities and relationships of a set of assertions that is to be integrated into the UBKG.

A file set of UBKG Edge/Node files consists of two Tab-Separated Variables (TSV) files:

Compression

UBKG Edge/Node files can be compressed in GZIP:

  1. Each file should be in its own GZIP file.
  2. The filename of the GZIP should correspond to the file name of the contents.

In other words, if files are compressed, there should be files named

Source Abbreviations (SABs)

A Source Abbreviation (SAB) is an uppercase acronym that identifies a set of assertions. The SAB is based on the UMLS SAB, which the UMLS uses to identify a vocabulary or domain. The UBKG extends the SAB identifier to include sets of assertions that are not in the UMLS.

The SAB is synonymous with an ontological namespace.

In general, a set of assertions can employ multiple SABs, of two types:

The steward of a set of assertions is, in general, not the steward of all the codes involved in the assertions.

Example

Consider the following subset of assertions from the Phenotypic Quality Ontology (PATO). (Predicates have been translated from their labels in Relations Ontology.)

subject predicate object
UBERON:0000457 branching part of UBERON:0001532
PATO:0001776 subClassOf PATO:0001544
CL:0000101 capable of GO:0050906
PATO:0001894 subClassOf UBERON:0000061

This set contains three types of assertions:

The types of SABs in this set of assertions:

In other words, PATO is the steward of a set of assertions that involve codes maintained by other stewards (e.g., UBERON).

When ingesting a set of assertions, it is necessary to define a preferred SAB to represent the steward of the set–e.g., an initiative, project, or institution.

edges.tsv

The edges file lists the triples (subject node - predicate - object node) that constitute a set of assertions.

Fields

Field Corresponding element in UBKG Accepted formats Examples
subject Code node OBO Principle 3 conforming (see note below) IRI for a concept in a published ontology http://purl.obolibrary.org/obo/UBERON_0004086
    Code for the concept in the format SAB {space} code in ontology OR SAB {underscore} code in ontology. UBERON 0004086, UBERON_004086
predicate relationships For hierarchical relationships, the IRI http://www.w3.org/2000/01/rdf-schema#subClassOf OR the string “isa” http://www.w3.org/2000/01/rdf-schema#subClassOf
    For non-hierarchical relationships, an IRI for a relationship property in RO http://purl.obolibrary.org/obo/RO_0002292
    Custom string that complies with neo4j relationship naming rules (see relationship labels below) drinks milkshake of
object Code node same as for subject  
evidence_class (optional) string Statement specific to an SAB to classify evidence -0.016084092

Requirements for nodes

The identifier (code) for a node must include the SAB for the code’s steward.

OBO Principle 3

The preferred form of identifier for a node is an IRI that conforms to Principle 3 for URI/Identifier Space. UBKG recognizes IRIs in this format implicitly.

An example of a IRI with the preferred format is: http://purl.obolibrary.org/obo/UBERON_0004086

In this example, UBERON is the SAB for the code.

For codes in a set of assertions that correspond to entities that have already been identified in a standard biomedical ontology, the IRI is preferred. For new codes that have not been already been encoded in an ontology, the identifier must include the SAB that represents the code’s steward.

Special characters and delimiters

UBKG uses complex rules for nodes for parsing SABs from nodes. The following characters are reserved for use as delimiters:

  1. space
  2. colon
  3. underscore

For best results in your nodes, use a colon, an underscore, or a space, to separate the SAB and code. For example, the following are all acceptable representations:

  1. UBERON:004086
  2. UBERON_0004086
  3. UBERON 004086

Do not use any of the reserved characters in node IDs.

Recommendations for edges

The preferred source of relationship (predicate) information is the Relations Ontology (RO). Reasons for this include:

  1. RO is a general reference for relationships, and is therefore likely already to have a standard relationship defined that is suitable.
  2. RO defines inverse relationships, especially those that may not be obvious.

It is possible, nevertheless, that RO does not contain a relationship that is specific enough for an assertion, so a custom relationship will be needed. When defining a custom relationship, we recommend that the label be short and concise. (This is easier said than done, of course. Defining concise relationships is the hard part of modelling assertions.)

Special characters and delimiters

The preferred format for a predicate is an IRI with format

http://purl.obolibrary.org/obo/RO_code

The format RO:code is accepted, but not preferred.

Relationship label format

The neo4 naming rules specify that relationship names:

  1. Contain only alphanumeric characters or the underscore.
  2. Start with a alphabetic character.

Although a relationship can include special characters or start with a number, Cypher queries that refer to the relationship must escape the relationship name with backticks. To avoid the need for backticks, relationships in the UBKG are reformatted as follows:

Characters Change Example
hyphen replaced with underscore abc-def to abc_def
other non-alphanumeric characters replaced with underscores abc (def) to abc_def
leading numeric character preceded with ‘REL_’ 1abc to REL_1abc

nodes.tsv

The nodes.tsv file provides metadata on entities.

Fields

Field Corresponding element in UBKG Accepted formats Examples
node_id Code node OBO Principle 3 conforming IRI for a concept in a published ontology http://purl.obolibrary.org/obo/UBERON_0004086
    Code for the concept in the format SAB {space} code in ontology UBERON 0004086
node_label Term node, Preferred Term (PT) relationship Text string Ventricles of hindbrain
node_definition (optional) Definition node, DEF relationship Text string One of the system of communicating cavities in the brain ….
node_synonyms (optional) Term node; Synonym (SYN) relationship Pipe-delimited list of synonyms See Example for synonyms below
node_dbxrefs (optional) Cross-references Pipe-delimited list of references to cross-referenced concepts. Each cross-reference should be in format SAB:code or UMLS:CUI See Example for dbxrefs below
value (optional) Numeric value numeric 20
lowerbound (optional) Lower bound of range for values numeric 5
upperbound (optional) Upper bound of range for values numeric 100
unit (optional) Unit of measure for value. This is currently not encoded to a concept. string mm

Example for synonyms

region of ventricular system of brain|brain ventricles|cerebral ventricle

Example for dbxrefs

umls:c0007799|fma:78447

The UMLS cross-reference in the example is to a CUI; the fma cross-reference is to a code.

Requirements for nodes

The IRI requirements for nodes in the nodes file are identical to those for nodes in the edges file.

Requirements (business rules) for a set of assertions

  1. A node identified in edges.tsv must satisfy at least one of the following criteria:
    • The node is defined in nodes.tsv. (Generally, this is for codes that are maintained by the steward of the set of assertions–i.e., the steward maintains both nodes and edges.)
    • The node already exists in the UBKG. These nodes can be part of the UMLS or from a previously ingested set of assertions.

The UBKG generation framework will ignore nodes that do not satisfy at least one of these criteria.

  1. If a triple in edges.tsv refers to a node from a non-UMLS ontology, the non-UMLS ontology will need to be ingested into the UBKG first. For example, because the Mammalian Phenotype Ontology (MP) includes nodes from the Cell Ontology (CL), CL should be integrated into the UBKG before MP. This often improves the cross-referencing because the general ontologies have deeper external-referencing to UMLS and other OBO sources.
  2. This spreadsheet lists the SABs and example codes for the ontologies that are currently represented in the UBKG. It should be used as the reference for formatting existing source abbreviations (SAB) and their codes. In other words, if a SAB is already part of the UBKG, it should be sufficient to refer to the node by code.

For additional details regarding the UMLS SABs, please consult the UMLS reference.

  1. Some ontologies (including HGNC, GO, and HPC) include the SAB in codes (e.g., HGNC:9999) Nodes for concepts from these ontologies should be formatted as : code-e.g., “HGNC HGNC:9999”.
  2. The UBKG ingestion will reformat predicates so that strings are delimited with underscores.
  3. The UBKG ultimately requires that two nodes be linked with both a relationship and its inverse. However, in an edges file each relationship (predicate) should be represented only once (NOT with the original and inverse). The UBKG identifies and adds the inverse relationships using the RO as follows:
predicate form of inverse relationship comment
IRI of relationship property in RO Resolution via RO, using the IRI  
String that corresponds to the relation label (e.g., “has gene product”) Resolution via RO, using relationship labels not as precise as IRI
Custom string Appends “inverse_” to the string If the custom string actually corresponds to a relationship in RO with an ambiguous inverse, UBKG will create a new inverse relationship.
IRI of relationship property from ontology other than RO none The UBKG will ignore these relationships.

Best Practices for cross-references

The items in the node_dbxref field of nodes.tsv establish cross-references. The degree to which an ontology integrates with the UBKG depends directly on cross-references: an ontology with nodes that cross-reference nodes in other ontologies are more likely to contribute to new relationships, while an ontology with few cross-references will essentially exist independently of the rest of the UBKG.

A best practice for cross-references is: the closer a cross-reference is to a UMLS CUI, the better.

The UBKG ingestion preferentially selects UMLS CUIs from the list in node_dbxref. It is not necessary to list both a UMLS CUI and a code from an ontology that is already in the UMLS. If it is not feasible to cross-reference a UMLS CUI, then try to cross-reference a code from a published, preferably OBO-compliant ontology.

Cross-references vs. isa

A cross-reference (dbxref) is not the same thing as an isa relationship.

isa

An isa assertion between two nodes indicates that the subject node relates to the object node hierarchically.

For example, the Mammalian Physiology Ontology (MP) asserts that MP:0009920 (abnormal t2 stage b cell morphology) isa MP:0008188 (abnormal transitional stage B cell morphology).

In the UBKG, the Code node for MP:0009920 has a path to the Code node for MP:0008188 through concepts that link via an isa relationship. Each Code node (orange) associates with a separate Concept node (blue). In other words, the UBKG considers the two Codes to represent different concepts.

img.png

dbxref

A dbxref establishes an equivalence between two nodes.

For example, MP cross-references MP 0009920 to CL:0000959 (T2 B cell).

In the UBKG, the Code nodes share the same Concept node. The two Code nodes represent the same Concept.

img_1.png

It is not necessary to link two concepts with both an isa relationship in the edge file and a node_dbxref in the node file.

Another way to think of this is: