Manage biological registries¶
This guide shows how to manage metadata for basic biological entities based on plugin bionty
.
# pip install 'lamindb[bionty]'
!lamin init --storage ./test-registries --modules bionty
import lamindb as ln
import bionty as bt
→ connected lamindb: testuser1/test-registries
Import records from public ontologies¶
Let’s first populate our CellType
registry with the default public ontology (Cell Ontology).
# [optional] inspect the available public ontology versions
bt.Source.df()
# [optional] inspect which version we're about to import
bt.Source.get(entity="bionty.CellType", currently_used=True)
# populate the database with the public ontology
bt.CellType.import_source()
This is now your in-house cell type registry in which you can add & modify records as you like.
# all public cell types are now available in LaminDB
bt.CellType.df()
# let's also populate the Gene registry with human and mouse genes
bt.Gene.import_source(organism="human")
bt.Gene.import_source(organism="mouse")
! Starting bulk_create for 75829 Gene records in batches of 10000
! Starting bulk_create for 57510 Gene records in batches of 10000
Access records in in-house registries¶
Search key words:
bt.CellType.search("gamma-delta T").df().head(2)
uid | name | ontology_id | abbr | synonyms | description | space_id | source_id | run_id | created_at | created_by_id | _aux | branch_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||
780 | 1HuNn2EP | gamma-delta T cell | CL:0000798 | None | gamma-delta T-cell|gamma-delta T lymphocyte|ga... | A T Cell That Expresses A Gamma-Delta T Cell R... | 1 | 16 | None | 2025-07-15 14:30:03.681000+00:00 | 1 | None | 1 |
781 | 70lHcCNw | immature gamma-delta T cell | CL:0000799 | None | immature gamma-delta T lymphocyte|immature gam... | A Gamma-Delta T Cell That Has An Immature Phen... | 1 | 16 | None | 2025-07-15 14:30:03.681000+00:00 | 1 | None | 1 |
Or look up with auto-complete:
cell_types = bt.CellType.lookup()
hsc_record = cell_types.hematopoietic_stem_cell
hsc_record
CellType(uid='2U8xapxu', name='hematopoietic stem cell', ontology_id='CL:0000037', synonyms='hemopoietic stem cell|blood forming stem cell', description='A Stem Cell From Which All Cells Of The Lymphoid And Myeloid Lineages Develop, Including Blood Cells And Cells Of The Immune System. Hematopoietic Stem Cells Lack Cell Markers Of Effector Cells (Lin-Negative). Lin-Negative Is Defined By Lacking One Or More Of The Following Cell Surface Markers: Cd2, Cd3 Epsilon, Cd4, Cd5 ,Cd8 Alpha Chain, Cd11B, Cd14, Cd19, Cd20, Cd56, Ly6G, Ter119.', branch_id=1, space_id=1, created_by_id=1, source_id=16, created_at=2025-07-15 14:30:03 UTC)
Filter by fields and relationships:
gdt_cell = bt.CellType.get(ontology_id="CL:0000798", created_by__handle="testuser1")
gdt_cell
CellType(uid='1HuNn2EP', name='gamma-delta T cell', ontology_id='CL:0000798', synonyms='gamma-delta T-cell|gamma-delta T lymphocyte|gammadelta T cell|gamma-delta T-lymphocyte', description='A T Cell That Expresses A Gamma-Delta T Cell Receptor Complex.', branch_id=1, space_id=1, created_by_id=1, source_id=16, created_at=2025-07-15 14:30:03 UTC)
View the ontological hierarchy:
gdt_cell.view_parents() # pass with_children=True to also view children
Or access the parents and children directly:
gdt_cell.parents.df()
gdt_cell.children.df()
It is also possible to recursively query parents or children, getting direct parents (children), their parents, and so forth.
gdt_cell.query_parents().df()
gdt_cell.query_children().df()
You can construct custom hierarchies of records:
# register a new cell type
my_celltype = bt.CellType(name="my new T-cell subtype").save()
# specify "gamma-delta T cell" as a parent
my_celltype.parents.add(gdt_cell)
# visualize hierarchy
gdt_cell.view_parents(distance=2, with_children=True)
Create records from values¶
When accessing datasets, one often encounters bulk references to entities that might be corrupted or standardized using different standardization schemes.
Let’s consider an example based on an AnnData
object, in the cell_type
annotations of this AnnData
object, we find 4 references to cell types:
adata = ln.core.datasets.anndata_with_obs()
adata.obs.cell_type.value_counts()
We’d like to load the corresponding records in our in-house registry to annotate a dataset.
To this end, you’ll typically use from_values
, which will both validate & retrieve records that match the values.
cell_types = bt.CellType.from_values(adata.obs.cell_type)
cell_types
Logging informed us that 3 cell types were validated. Since we loaded these records at the same time, we could readily use them to annotate a dataset.
What happened under-the-hood?
.from_values()
performs the following look ups:
If registry records match the values, load these records
If values match synonyms of registry records, load these records
If no record in the registry matches, attempt to load records from a public ontology
Same as 3. but based on synonyms
No records will be returned if all 4 look ups are unsuccessful.
Sometimes, it’s useful to treat validated records differently from non-validated records. Here is a way:
original_values = ["gut", "gut2"]
inspector = bt.Tissue.inspect(original_values)
records_from_validated_values = bt.Tissue.from_values(inspector.validated)
Alternatively, we can retrieve records based on ontology ids:
adata.obs.cell_type_id.unique().tolist()
bt.CellType.from_values(adata.obs.cell_type_id, field=bt.CellType.ontology_id)
Validate & standardize¶
Simple validation of an iterable of values works like so:
bt.CellType.validate(["fat cell", "blood forming stem cell"])
Because these values don’t comply with the registry, they’re not validated!
You can easily convert these values to validated standardized names based on synonyms like so:
bt.CellType.standardize(["fat cell", "blood forming stem cell"])
Alternatively, you can use .from_values()
, which will only ever return validated records and automatically standardize under-the-hood:
bt.CellType.from_values(["fat cell", "blood forming stem cell"])
If you are now sure what to do, use .inspect()
to get instructions:
bt.CellType.inspect(["fat cell", "blood forming stem cell"]);
We can also add new synonyms to a record:
hsc_record.add_synonym("HSC")
And when we encounter this synonym as a value, it will now be standardized using synonyms-lookup, and mapped on the correct registry record:
bt.CellType.standardize(["HSC"])
A special synonym is .abbr
(short for abbreviation), which has its own field and can be assigned via:
hsc_record.set_abbr("HSC")
You can create a lookup object from the .abbr
field:
cell_types = bt.CellType.lookup("abbr")
cell_types.hsc
The same workflow works for all of bionty
’s registries.
Manage registries across organisms¶
Several registries are organism-aware (has a .organism
field), for instance, Gene
.
In this case, API calls that interact with multi-organism registries require an organism
argument when there’s ambiguity.
For instance, when validating gene symbols:
bt.Gene.validate(["TCF7", "ABC1"], organism="human")
In contrary, working with Ensembl Gene IDs doesn’t require passing organism
, as there’s no ambiguity:
bt.Gene.validate(
["ENSG00000000419", "ENSMUSG00002076988"], field=bt.Gene.ensembl_gene_id
)
! 1 unique term (50.00%) is not validated for ensembl_gene_id: 'ENSMUSG00002076988'
array([ True, False])
When working with the same organism throughout your analysis/workflow, you can omit the organism
argument by configuring it globally:
bt.settings.organism = "mouse"
bt.Gene.from_source(symbol="Ap5b1")
! using default organism = mouse
Gene(uid='3b8mHb0MRal4', symbol='Ap5b1', ensembl_gene_id='ENSMUSG00000049562', biotype='protein_coding', synonyms='Gm962', description='adaptor-related protein complex 5, beta 1 subunit ', branch_id=1, space_id=1, created_by_id=1, source_id=8, organism_id=2, created_at=2025-07-15 14:33:56 UTC)
Track underlying ontology source versions¶
Under-the-hood, source ontology versions are automatically tracked for each registry:
bt.Source.filter(currently_used=True).df()
Each record is linked to a versioned public source (if it was created from public):
hepatocyte = bt.CellType.get(name="hepatocyte")
hepatocyte.source
Create records from specific source¶
By default, new records are imported or created from the "currently_used"
public sources which are configured during the instance initialization, e.g.:
bt.Source.filter(entity="bionty.Phenotype", currently_used=True).df()
Sometimes, the default source doesn’t contain the ontology term you are looking for.
You can then specify to create a record from a non-default source. For instance, we can use the ncbitaxon
ontology:
source = bt.Source.get(entity="bionty.Organism", name="ncbitaxon")
source
Source(uid='4tsksCMX', entity='bionty.Organism', organism='all', name='ncbitaxon', version='2023-06-20', in_db=False, currently_used=True, description='NCBItaxon Ontology', url='http://purl.obolibrary.org/obo/ncbitaxon/2023-06-20/ncbitaxon.owl', source_website='https://github.com/obophenotype/ncbitaxon', branch_id=1, space_id=1, created_by_id=1, created_at=2025-07-15 14:29:55 UTC)
# validate against the NCBI Taxonomy
bt.Organism.validate(
["iris setosa", "iris versicolor", "iris virginica"], source=source
)
# since we didn't seed the Organism registry with the NCBITaxon public ontology
# we need to save the records to the database
records = bt.Organism.from_values(
["iris setosa", "iris versicolor", "iris virginica"], source=source
).save()
# now we can query a iris organism and view its parents and children
bt.Organism.get(name="iris").view_parents(with_children=True)
Access any Ensembl genes¶
Genes from all Ensembl versions and organisms can be accessed, even though they are not yet present in the bt.Source
registry.
For instance, if you want to use rabbit
genes from Ensembl version release-103
:
# pip install pymysql
import bionty as bt
# automatically download genes for a new organism
gene_ontology = bt.base.Gene(source="ensembl", organism="rabbit", version='release-103')
# register the new source in lamindb
gene_ontology.register_source_in_lamindb()
# now you can start using this source
# import all genes from this source to your Gene registry
source = bt.Source.get(entity="bionty.Gene", name="ensembl", organism="rabbit", version="release-103")
bt.Gene.import_source(source=source)