Linking to External Resources (HERD)

The HERD (HDMF External Resources Data Structure) class lets you map terms used in your data to entities defined in external, web-accessible resources such as ontologies. For example, you may store a species name "Mus musculus" on a Subject and want to link it to the corresponding NCBI Taxonomy term so that the value is standardized and easy to query.

From a user’s perspective, a HERD can be treated as a single table that associates a key (a term used on an object, i.e. a dataset or attribute in the file) with an entity (a term in an external resource, identified by an entity_id and an entity_uri). Internally, HERD stores this in six interlinked tables (keys, files, entities, entity_keys, objects, and object_keys) and provides convenience methods so you rarely need to interact with those tables directly.

This tutorial shows how to create a HERD, annotate objects in an NWB file, store the HERD in the file, and inspect the annotations after reading the file back. For the full HERD API (including add_ref_termset for validating terms against a TermSet, get_key, and compound-data references), see the HDMF HERD tutorial.

from datetime import datetime
from uuid import uuid4

from dateutil.tz import tzlocal

from pynwb import NWBHDF5IO, NWBFile
from pynwb.file import Subject

Create an NWB file

Start with an NWBFile that has a Subject. The subject’s species is the value we will annotate with an external resource.

nwbfile = NWBFile(
    session_description="a demonstration of external resources",
    identifier=str(uuid4()),
    session_start_time=datetime(2018, 4, 25, 2, 30, 3, tzinfo=tzlocal()),
    subject=Subject(subject_id="001", species="Mus musculus"),
)

Get the file’s HERD

Use get_external_resources to get the file’s HERD. A file has at most one HERD, so this returns the existing HERD if the file already has one (for example, when the file was read from disk) and creates and attaches a new empty HERD otherwise. The external_resources attribute returns the HERD without creating one, returning None when the file has no external resources.

Add references with add_ref

Use add_ref to add a row that links a key on an object to an external entity. Here we link the subject’s species to the NCBI Taxonomy entry for Mus musculus. The subject must be part of a file before a reference is added to it.

An entity is identified by an entity_id and an entity_uri. The entity_id is a compact URI (CURIE) of the form prefix:identifier whose prefix is registered with bioregistry.io, such as NCBITaxon for the NCBI Taxonomy. The entity_uri is the persistent URL the CURIE resolves to, which you can look up at https://bioregistry.io/<entity_id>.

herd.add_ref(
    container=nwbfile.subject,
    key=nwbfile.subject.species,
    entity_id="NCBITaxon:10090",
    entity_uri="http://purl.obolibrary.org/obo/NCBITaxon_10090",
)

References can also point to an attribute of an object, such as a column of a table. Here we record the brain region of a set of electrodes in the electrodes table and link the region to the corresponding structure in the Allen Mouse Brain Atlas. When the target is a column, pass the table as the container and the column name as the attribute; HERD resolves the reference to the column object itself.

Note

This same container plus attribute form also works for ragged columns (those backed by a VectorIndex): add_ref(container=table, attribute="col", ...) annotates the column’s VectorData, which holds the actual values used as keys. Do not annotate the column with add_ref(container=table["col"], attribute=None, ...): for a ragged column, table["col"] is the VectorIndex (the integer offsets into the VectorData), so HERD would annotate the index instead of the values.

device = nwbfile.create_device(name="probe")
electrode_group = nwbfile.create_electrode_group(
    name="shank0",
    description="a shank of the recording probe",
    location="VISp",
    device=device,
)
for _ in range(4):
    nwbfile.add_electrode(location="VISp", group=electrode_group)

herd.add_ref(
    container=nwbfile.electrodes,
    attribute="location",
    key="VISp",
    entity_id="MBA:385",
    entity_uri="https://purl.brain-bican.org/ontology/mbao/MBA_385",
)

Inspect the HERD

to_dataframe flattens the interlinked tables into a single DataFrame, with one row per (object, key, entity) association.

file_object_id objects_idx object_id files_idx object_type relative_path field keys_idx key entities_idx entity_id entity_uri
0 7b57d1b9-79a0-467f-972e-dec54780330e 0 cdd979e3-132c-45de-a128-aa1c4b5c221a 0 Subject 0 Mus musculus 0 NCBITaxon:10090 http://purl.obolibrary.org/obo/NCBITaxon_10090
1 7b57d1b9-79a0-467f-972e-dec54780330e 1 3cea8c50-d9ca-497d-a43e-891de7676ae1 0 VectorData 1 VISp 1 MBA:385 https://purl.brain-bican.org/ontology/mbao/MBA...


You can also view the individual tables. Each is a DynamicTable and has its own to_dataframe method.

key
0 Mus musculus
1 VISp


entity_id entity_uri
0 NCBITaxon:10090 http://purl.obolibrary.org/obo/NCBITaxon_10090
1 MBA:385 https://purl.brain-bican.org/ontology/mbao/MBA...


get_object_type returns all annotations for objects of a given type, for example every annotated Subject.

herd.get_object_type(object_type="Subject")
file_object_id objects_idx object_id files_idx object_type relative_path field keys_idx key entities_idx entity_id entity_uri
0 7b57d1b9-79a0-467f-972e-dec54780330e 0 cdd979e3-132c-45de-a128-aa1c4b5c221a 0 Subject 0 Mus musculus 0 NCBITaxon:10090 http://purl.obolibrary.org/obo/NCBITaxon_10090


Write and read the NWB file

Writing the file stores the HERD inside it. Reading the file back makes the HERD available again through the external_resources field.

filename = "external_resources_tutorial.nwb"
with NWBHDF5IO(filename, mode="w") as io:
    io.write(nwbfile)

read_io = NWBHDF5IO(filename, mode="r")
read_nwbfile = read_io.read()
read_herd = read_nwbfile.external_resources

Access the loaded data

The loaded HERD provides the same accessors as before. In a Jupyter notebook, displaying the HERD renders the flattened references as a table, and to_dataframe returns that same table as a DataFrame. The individual tables give a more focused view.

file_object_id objects_idx object_id files_idx object_type relative_path field keys_idx key entities_idx entity_id entity_uri
0 7b57d1b9-79a0-467f-972e-dec54780330e 0 cdd979e3-132c-45de-a128-aa1c4b5c221a 0 Subject 0 Mus musculus 0 NCBITaxon:10090 http://purl.obolibrary.org/obo/NCBITaxon_10090
1 7b57d1b9-79a0-467f-972e-dec54780330e 1 3cea8c50-d9ca-497d-a43e-891de7676ae1 0 VectorData 1 VISp 1 MBA:385 https://purl.brain-bican.org/ontology/mbao/MBA...


View the individual tables, for example:

key
0 Mus musculus
1 VISp


get_object_entities returns the entities annotated on a single object as a DataFrame. Here we view the species annotation stored for the subject:

read_herd.get_object_entities(container=read_nwbfile.subject)
entity_id entity_uri
0 NCBITaxon:10090 http://purl.obolibrary.org/obo/NCBITaxon_10090


Close the file once you are done reading from it.

Alternative: store a HERD outside an NWB file

A HERD can also be saved independently of an NWB file as a zip archive of the underlying tables using to_zip, and read back with from_zip. This is useful when external resources span multiple files; see Annotating Multiple Streamed NWB Files with a Single HERD for an example that annotates many NWB files with a single HERD. For the full HERD API, see the HDMF HERD tutorial.

Gallery generated by Sphinx-Gallery