Curate datasets¶

Curating a dataset with LaminDB means three things:

Validate that the dataset matches a desired schema
If validation fails, standardize the dataset (e.g., by fixing typos, mapping synonyms) or update registries
Annotate the dataset by linking it against metadata entities so that it becomes queryable

In this guide we’ll curate common data structures. Here is a guide for the underlying low-level API.

Note: If you know either pydantic or pandera, here is an FAQ that compares LaminDB with both of these tools.

# pip install 'lamindb[bionty]'
!lamin init --storage ./test-curate --modules bionty

import lamindb as ln

ln.track("MCeA3reqZG2e")

DataFrame¶

Allow a flexible schema¶

We’ll be working with the mini immuno dataset:

df = ln.core.datasets.mini_immuno.get_dataset1()
df

Show code cell output

Hide code cell output

	ENSG00000153563	ENSG00000010610	ENSG00000170458	perturbation	sample_note	cell_type_by_expert	cell_type_by_model	assay_oid	concentration	treatment_time_h	donor	donor_ethnicity
sample1	1	3	5	DMSO	was ok	B cell	B cell	EFO:0008913	0.1%	24	D0001	[Chinese, Singaporean Chinese]
sample2	2	4	6	IFNG	looks naah	CD8-positive, alpha-beta T cell	T cell	EFO:0008913	200 nM	24	D0002	[Chinese, Han Chinese]
sample3	3	5	7	DMSO	pretty! 🤩	CD8-positive, alpha-beta T cell	T cell	EFO:0008913	0.1%	6	None	[Chinese]

This is how we curate it in a script.

curate_dataframe_flexible.py¶

import lamindb as ln

ln.core.datasets.mini_immuno.define_features_labels()
schema = ln.examples.schemas.valid_features()
df = ln.core.datasets.small_dataset1(otype="DataFrame")
artifact = ln.Artifact.from_df(
    df, key="examples/dataset1.parquet", schema=schema
).save()
artifact.describe()

Let’s run the script.

!python scripts/curate_dataframe_flexible.py

Show code cell output

Hide code cell output

→ connected lamindb: testuser1/test-curate

→ connected lamindb: testuser1/test-curate

! no run & transform got linked, call `ln.track()` & re-run

! 4 terms not validated in feature 'columns': 'ENSG00000153563', 'ENSG00000010610', 'ENSG00000170458', 'sample_note'
    → fix typos, remove non-existent values, or save terms via: curator.cat.add_new_from('columns')

Artifact .parquet · DataFrame · dataset
├── General
│   ├── uid: CYHS6Jftx9moxHOi0000          hash: kQSstgz6tk5ug4-rq8yz0A
│   ├── size: 9.6 KB                       n_observations: 3
│   ├── space: all                         branch: main
│   ├── created_at: 2025-07-15 14:31:36    created_by: testuser1 (Test User1)
│   ├── key: examples/dataset1.parquet
│   └── storage location / path: 
│       /home/runner/work/lamindb/lamindb/docs/test-curate/.lamindb/CYHS6Jftx9mo
│       xHOi0000.parquet
├── Dataset features
│   └── columns • 8         [Feature]                                           
│       assay_oid           cat[bionty.Experiment…  single-cell RNA sequencing  
│       cell_type_by_expe…  cat[bionty.CellType]    B cell, CD8-positive, alpha…
│       cell_type_by_model  cat[bionty.CellType]    B cell, T cell              
│       donor_ethnicity     list[cat[bionty.Ethni…  Chinese, Han Chinese, Singa…
│       perturbation        cat[ULabel[Perturbati…  DMSO, IFNG                  
│       concentration       str                                                 
│       treatment_time_h    num                                                 
│       donor               str                                                 
└── Labels
    └── .cell_types         bionty.CellType         B cell, T cell, CD8-positiv…
        .experimental_fac…  bionty.ExperimentalFa…  single-cell RNA sequencing  
        .ethnicities        bionty.Ethnicity        Chinese, Singaporean Chines…
        .ulabels            ULabel                  DMSO, IFNG                  

The script defined the following features & labels through define_features_labels():

import lamindb as ln
import bionty as bt

# define valid labels
perturbation_type = ln.ULabel(name="Perturbation", is_type=True).save()
ln.ULabel(name="DMSO", type=perturbation_type).save()
ln.ULabel(name="IFNG", type=perturbation_type).save()
bt.CellType.from_source(name="B cell").save()
bt.CellType.from_source(name="T cell").save()

# define valid features
ln.Feature(name="perturbation", dtype=perturbation_type).save()
ln.Feature(name="cell_type_by_expert", dtype=bt.CellType).save()
ln.Feature(name="cell_type_by_model", dtype=bt.CellType).save()
ln.Feature(name="assay_oid", dtype=bt.ExperimentalFactor.ontology_id).save()
ln.Feature(name="concentration", dtype=str).save()
ln.Feature(name="treatment_time_h", dtype="num", coerce_dtype=True).save()
ln.Feature(name="donor", dtype=str, nullable=True).save()
ln.Feature(name="donor_ethnicity", dtype=list[bt.Ethnicity]).save()

And the following schema through valid_features():

import lamindb as ln

schema = ln.Schema(name="valid_features", itype=ln.Feature).save()

Require a minimal set of columns¶

If we’d like to curate the dataframe with a minimal set of required columns, we can use the following schema.

import lamindb as ln

schema = ln.Schema(
    name="Mini immuno schema",
    features=[
        ln.Feature.get(name="perturbation"),
        ln.Feature.get(name="cell_type_by_model"),
        ln.Feature.get(name="assay_oid"),
        ln.Feature.get(name="donor"),
        ln.Feature.get(name="concentration"),
        ln.Feature.get(name="treatment_time_h"),
    ],
    flexible=True,  # _additional_ columns in a dataframe are validated & annotated
).save()

If the dataframe lacks one of the required columns, we’ll get a validation error.

curate_dataframe_minimal_errors.py¶

import lamindb as ln

schema = ln.core.datasets.mini_immuno.define_mini_immuno_schema_flexible()
df = ln.core.datasets.small_dataset1(otype="DataFrame")
df.pop("donor")  # remove donor column to trigger validation error
try:
    artifact = ln.Artifact.from_df(
        df, key="examples/dataset1.parquet", schema=schema
    ).save()
except ln.errors.ValidationError as error:
    print(error)

Let’s run the script.

!python scripts/curate_dataframe_minimal_errors.py

Resolve synonyms and typos¶

Let’s now look at the same dataset but assume there are synonyms and typos.

df = ln.core.datasets.mini_immuno.get_dataset1(
    with_cell_type_synonym=True, with_cell_type_typo=True
)
df

Show code cell output

Hide code cell output

	ENSG00000153563	ENSG00000010610	ENSG00000170458	perturbation	sample_note	cell_type_by_expert	cell_type_by_model	assay_oid	concentration	treatment_time_h	donor	donor_ethnicity
sample1	1	3	5	DMSO	was ok	B-cell	B cell	EFO:0008913	0.1%	24	D0001	[Chinese, Singaporean Chinese]
sample2	2	4	6	IFNG	looks naah	CD8-pos alpha-beta T cell	T cell	EFO:0008913	200 nM	24	D0002	[Chinese, Han Chinese]
sample3	3	5	7	DMSO	pretty! 🤩	CD8-pos alpha-beta T cell	T cell	EFO:0008913	0.1%	6	None	[Chinese]

Let’s reuse the schema that defines a minimal set of columns we expect in the dataframe.

schema = ln.core.datasets.mini_immuno.define_mini_immuno_schema_flexible()
schema.describe()

Show code cell output

Hide code cell output

→ returning existing ULabel record with same name: 'Perturbation'

→ returning existing ULabel record with same name: 'DMSO'

→ returning existing ULabel record with same name: 'IFNG'

→ returning existing Feature record with same name: 'perturbation'

→ returning existing Feature record with same name: 'cell_type_by_expert'

→ returning existing Feature record with same name: 'cell_type_by_model'

→ returning existing Feature record with same name: 'assay_oid'

→ returning existing Feature record with same name: 'concentration'

→ returning existing Feature record with same name: 'treatment_time_h'

→ returning existing Feature record with same name: 'donor'

→ returning existing Feature record with same name: 'donor_ethnicity'

→ returning existing schema with same hash: Schema(uid='lauWQkRkEHPedOXD', name='Mini immuno schema', n=6, is_type=False, itype='Feature', hash='hjnQP2fMg8hzZnnjtwhVSg', minimal_set=True, ordered_set=False, maximal_set=False, branch_id=1, space_id=1, created_by_id=1, created_at=2025-07-15 14:31:40 UTC)

Schema 
├── .uid = 'lauWQkRkEHPedOXD'
├── .name = 'Mini immuno schema'
├── .itype = 'Feature'
├── .ordered_set = False
├── .maximal_set = False
├── .minimal_set = True
├── .created_by = testuser1 (Test User1)
├── .created_at = 2025-07-15 14:31:40
└── Feature • 6
    └── name               dtype                                      optional  nullab…  coerce_dtype  default_val…
        perturbation       cat[ULabel[Perturbation]]                  ✗         ✓        ✗             unset       
        cell_type_by_mod…  cat[bionty.CellType]                       ✗         ✓        ✗             unset       
        assay_oid          cat[bionty.ExperimentalFactor.ontology_i…  ✗         ✓        ✗             unset       
        donor              str                                        ✗         ✓        ✗             unset       
        concentration      str                                        ✗         ✓        ✗             unset       
        treatment_time_h   num                                        ✗         ✓        ✓             unset

Create a curator object using the dataset & the schema.

curator = ln.curators.DataFrameCurator(df, schema)

The validate() method validates that your dataset adheres to the criteria defined by the schema. It identifies which values are already validated (exist in the registries) and which are potentially problematic (do not yet exist in our registries).

try:
    curator.validate()
except ln.errors.ValidationError as error:
    print(error)

# check the non-validated terms
curator.cat.non_validated

For cell_type, we saw that “cerebral pyramidal neuron”, “astrocytic glia” are not validated.

First, let’s standardize synonym “astrocytic glia” as suggested

curator.cat.standardize("cell_type_by_expert")

# now we have only one non-validated cell type left
curator.cat.non_validated

For “CD8-pos alpha-beta T cell”, let’s understand which cell type in the public ontology might be the actual match.

# to check the correct spelling of categories, pass `public=True` to get a lookup object from public ontologies
# use `lookup = curator.cat.lookup()` to get a lookup object of existing records in your instance
lookup = curator.cat.lookup(public=True)
lookup

# here is an example for the "cell_type" column
cell_types = lookup["cell_type_by_expert"]
cell_types.cd8_positive_alpha_beta_t_cell

# fix the cell type name
df["cell_type_by_expert"] = df["cell_type_by_expert"].cat.rename_categories(
    {"CD8-pos alpha-beta T cell": cell_types.cd8_positive_alpha_beta_t_cell.name}
)

For perturbation, we want to add the new values: “DMSO”, “IFNG”

# this adds perturbations that were _not_ validated
curator.cat.add_new_from("perturbation")

# validate again
curator.validate()

Save a curated artifact.

artifact = curator.save_artifact(key="examples/my_curated_dataset.parquet")

artifact.describe()

Show code cell output

Hide code cell output

Artifact .parquet · DataFrame · dataset
├── General
│   ├── uid: CYHS6Jftx9moxHOi0000          hash: kQSstgz6tk5ug4-rq8yz0A
│   ├── size: 9.6 KB                       n_observations: 3
│   ├── space: all                         branch: main
│   ├── created_at: 2025-07-15 14:31:36    created_by: testuser1 (Test User1)
│   ├── key: examples/dataset1.parquet
│   ├── storage location / path: 
│   │   /home/runner/work/lamindb/lamindb/docs/test-curate/.lamindb/CYHS6Jftx9moxHOi0000.parquet
│   └── transform: curate.ipynb
├── Dataset features
│   └── columns • 8                     [Feature]                                                                  
│       assay_oid                       cat[bionty.ExperimentalFactor.on…  single-cell RNA sequencing              
│       cell_type_by_expert             cat[bionty.CellType]               B cell, CD8-positive, alpha-beta T cell 
│       cell_type_by_model              cat[bionty.CellType]               B cell, T cell                          
│       donor_ethnicity                 list[cat[bionty.Ethnicity]]        Chinese, Han Chinese, Singaporean Chine…
│       perturbation                    cat[ULabel[Perturbation]]          DMSO, IFNG                              
│       concentration                   str                                                                        
│       treatment_time_h                num                                                                        
│       donor                           str                                                                        
└── Labels
    └── .cell_types                     bionty.CellType                    B cell, T cell, CD8-positive, alpha-bet…
        .experimental_factors           bionty.ExperimentalFactor          single-cell RNA sequencing              
        .ethnicities                    bionty.Ethnicity                   Chinese, Singaporean Chinese, Han Chine…
        .ulabels                        ULabel                             DMSO, IFNG

AnnData¶

AnnData like all other data structures that follow is a composite structure that stores different arrays in different slots.

Allow a flexible schema¶

We can also allow a flexible schema for an AnnData and only require that it’s indexed with Ensembl gene IDs.

curate_anndata_flexible.py¶

import lamindb as ln

ln.core.datasets.mini_immuno.define_features_labels()
adata = ln.core.datasets.mini_immuno.get_dataset1(otype="AnnData")
schema = ln.examples.schemas.anndata_ensembl_gene_ids_and_valid_features_in_obs()
artifact = ln.Artifact.from_anndata(
    adata, key="examples/mini_immuno.h5ad", schema=schema
).save()
artifact.describe()

Let’s run the script.

!python scripts/curate_anndata_flexible.py

Show code cell output

Hide code cell output

→ connected lamindb: testuser1/test-curate

→ returning existing ULabel record with same name: 'Perturbation'
→ returning existing ULabel record with same name: 'DMSO'

→ returning existing ULabel record with same name: 'IFNG'

→ returning existing Feature record with same name: 'perturbation'

→ returning existing Feature record with same name: 'cell_type_by_expert'
→ returning existing Feature record with same name: 'cell_type_by_model'
→ returning existing Feature record with same name: 'assay_oid'
→ returning existing Feature record with same name: 'concentration'
→ returning existing Feature record with same name: 'treatment_time_h'
→ returning existing Feature record with same name: 'donor'
→ returning existing Feature record with same name: 'donor_ethnicity'

→ connected lamindb: testuser1/test-curate

→ connected lamindb: testuser1/test-curate

→ returning existing schema with same hash: Schema(uid='0000000000000000', name='valid_features', n=-1, is_type=False, itype='Feature', hash='kMi7B_N88uu-YnbTLDU-DA', minimal_set=True, ordered_set=False, maximal_set=False, branch_id=1, space_id=1, created_by_id=1, created_at=2025-07-15 14:31:34 UTC)

! no run & transform got linked, call `ln.track()` & re-run

! 1 term not validated in feature 'columns' in slot 'obs': 'sample_note'
    → fix typos, remove non-existent values, or save terms via: curator.slots['obs'].cat.add_new_from('columns')

Artifact .h5ad · AnnData · dataset
├── General
│   ├── uid: QD1MmTwPOcaWTvch0000          hash: FB3CeMjmg1ivN6HDy6wsSg
│   ├── size: 30.9 KB                      n_observations: 3
│   ├── space: all                         branch: main
│   ├── created_at: 2025-07-15 14:31:56    created_by: testuser1 (Test User1)
│   ├── key: examples/mini_immuno.h5ad
│   └── storage location / path: 
│       /home/runner/work/lamindb/lamindb/docs/test-curate/.lamindb/QD1MmTwPOcaW
│       Tvch0000.h5ad
├── Dataset features
│   ├── obs • 7             [Feature]                                           
│   │   assay_oid           cat[bionty.Experiment…  single-cell RNA sequencing  
│   │   cell_type_by_expe…  cat[bionty.CellType]    B cell, CD8-positive, alpha…
│   │   cell_type_by_model  cat[bionty.CellType]    B cell, T cell              
│   │   perturbation        cat[ULabel[Perturbati…  DMSO, IFNG                  
│   │   concentration       str                                                 
│   │   treatment_time_h    num                                                 
│   │   donor               str                                                 
│   └── var.T • 3           [bionty.Gene.ensembl_…                              
│       CD8A                num                                                 
│       CD4                 num                                                 
│       CD14                num                                                 
└── Labels
    └── .cell_types         bionty.CellType         B cell, T cell, CD8-positiv…
        .experimental_fac…  bionty.ExperimentalFa…  single-cell RNA sequencing  
        .ulabels            ULabel                  DMSO, IFNG                  

Under-the-hood, this used the following schema:

import lamindb as ln
import bionty as bt

obs_schema = ln.examples.schemas.valid_features()
varT_schema = ln.Schema(
    name="valid_ensembl_gene_ids", itype=bt.Gene.ensembl_gene_id
).save()
schema = ln.Schema(
    name="anndata_ensembl_gene_ids_and_valid_features_in_obs",
    otype="AnnData",
    slots={"obs": obs_schema, "var.T": varT_schema},
).save()

This schema tranposes the var DataFrame during curation, so that one validates and annotates the var.T schema, i.e., [ENSG00000153563, ENSG00000010610, ENSG00000170458]. If one doesn’t transpose, one would annotate with the schema of var, i.e., [gene_symbol, gene_type].

https://lamin-site-assets.s3.amazonaws.com/.lamindb/gLyfToATM7WUzkWW0001.png

Resolve typos¶

import lamindb as ln

adata = ln.core.datasets.mini_immuno.get_dataset1(
    with_gene_typo=True, with_cell_type_typo=True, otype="AnnData"
)
adata

Check the slots of a schema:

schema.slots

curator = ln.curators.AnnDataCurator(adata, schema)
try:
    curator.validate()
except ln.errors.ValidationError as error:
    print(error)

As above, we leverage a lookup object with valid cell types to find the correct name.

valid_cell_types = curator.slots["obs"].cat.lookup()["cell_type_by_expert"]
adata.obs["cell_type_by_expert"] = adata.obs[
    "cell_type_by_expert"
].cat.rename_categories(
    {"CD8-pos alpha-beta T cell": valid_cell_types.cd8_positive_alpha_beta_t_cell.name}
)

The validated AnnData can be subsequently saved as an Artifact:

adata.obs.columns

Index(['perturbation', 'sample_note', 'cell_type_by_expert',
       'cell_type_by_model', 'assay_oid', 'concentration', 'treatment_time_h',
       'donor'],
      dtype='object')

curator.slots["var.T"].cat.add_new_from("columns")

! using default organism = human

! 1 term not validated in feature 'columns' in slot 'var.T': 'GeneTypo'
    → fix typos, remove non-existent values, or save terms via: curator.slots['var.T'].cat.add_new_from('columns')

curator.validate()

! 1 term not validated in feature 'columns' in slot 'obs': 'sample_note'
    → fix typos, remove non-existent values, or save terms via: curator.slots['obs'].cat.add_new_from('columns')

artifact = curator.save_artifact(key="examples/my_curated_anndata.h5ad")

Access the schema for each slot:

artifact.features.slots

The saved artifact has been annotated with validated features and labels:

artifact.describe()

Show code cell output

Hide code cell output

Artifact .h5ad · AnnData · dataset
├── General
│   ├── uid: RhoPfS6qCsb9PrJo0000          hash: yeNWx0-dOGGkANQbocU4Sg
│   ├── size: 30.9 KB                      n_observations: 3
│   ├── space: all                         branch: main
│   ├── created_at: 2025-07-15 14:32:10    created_by: testuser1 (Test User1)
│   ├── key: examples/my_curated_anndata.h5ad
│   ├── storage location / path: 
│   │   /home/runner/work/lamindb/lamindb/docs/test-curate/.lamindb/RhoPfS6qCsb9PrJo0000.h5ad
│   └── transform: curate.ipynb
├── Dataset features
│   ├── obs • 7                         [Feature]                                                                  
│   │   assay_oid                       cat[bionty.ExperimentalFactor.on…  single-cell RNA sequencing              
│   │   cell_type_by_expert             cat[bionty.CellType]               B cell, CD8-positive, alpha-beta T cell 
│   │   cell_type_by_model              cat[bionty.CellType]               B cell, T cell                          
│   │   perturbation                    cat[ULabel[Perturbation]]          DMSO, IFNG                              
│   │   concentration                   str                                                                        
│   │   treatment_time_h                num                                                                        
│   │   donor                           str                                                                        
│   └── var.T • 3                       [bionty.Gene.ensembl_gene_id]                                              
│       CD8A                            num                                                                        
│       CD4                             num                                                                        
└── Labels
    └── .cell_types                     bionty.CellType                    B cell, T cell, CD8-positive, alpha-bet…
        .experimental_factors           bionty.ExperimentalFactor          single-cell RNA sequencing              
        .ulabels                        ULabel                             DMSO, IFNG

MuData¶

curate_mudata.py¶

import lamindb as ln
import bionty as bt


# define the global obs schema
obs_schema = ln.Schema(
    name="mudata_papalexi21_subset_obs_schema",
    features=[
        ln.Feature(name="perturbation", dtype="cat[ULabel[Perturbation]]").save(),
        ln.Feature(name="replicate", dtype="cat[ULabel[Replicate]]").save(),
    ],
).save()

# define the ['rna'].obs schema
obs_schema_rna = ln.Schema(
    name="mudata_papalexi21_subset_rna_obs_schema",
    features=[
        ln.Feature(name="nCount_RNA", dtype=int).save(),
        ln.Feature(name="nFeature_RNA", dtype=int).save(),
        ln.Feature(name="percent.mito", dtype=float).save(),
    ],
).save()

# define the ['hto'].obs schema
obs_schema_hto = ln.Schema(
    name="mudata_papalexi21_subset_hto_obs_schema",
    features=[
        ln.Feature(name="nCount_HTO", dtype=int).save(),
        ln.Feature(name="nFeature_HTO", dtype=int).save(),
        ln.Feature(name="technique", dtype=bt.ExperimentalFactor).save(),
    ],
).save()

# define ['rna'].var schema
var_schema_rna = ln.Schema(
    name="mudata_papalexi21_subset_rna_var_schema",
    itype=bt.Gene.symbol,
    dtype=float,
).save()

# define composite schema
mudata_schema = ln.Schema(
    name="mudata_papalexi21_subset_mudata_schema",
    otype="MuData",
    slots={
        "obs": obs_schema,
        "rna:obs": obs_schema_rna,
        "hto:obs": obs_schema_hto,
        "rna:var": var_schema_rna,
    },
).save()

# curate a MuData
mdata = ln.core.datasets.mudata_papalexi21_subset()
bt.settings.organism = "human"  # set the organism to map gene symbols
curator = ln.curators.MuDataCurator(mdata, mudata_schema)
artifact = curator.save_artifact(key="examples/mudata_papalexi21_subset.h5mu")
assert artifact.schema == mudata_schema

!python scripts/curate_mudata.py

Show code cell output

Hide code cell output

→ connected lamindb: testuser1/test-curate

→ returning existing Feature record with same name: 'perturbation'

! you are trying to create a record with name='nFeature_HTO' but a record with similar name exists: 'nFeature_RNA'. Did you mean to load it?

! auto-transposed `var` for backward compat, please indicate transposition in the schema definition by calling out `.T`: slots={'var.T': itype=bt.Gene.ensembl_gene_id}

! 37 terms not validated in feature 'columns': 'adt:G2M.Score', 'adt:HTO_classification', 'adt:MULTI_ID', 'adt:NT', 'adt:Phase', 'adt:S.Score', 'adt:gene_target', 'adt:guide_ID', 'adt:orig.ident', 'adt:percent.mito', 'adt:perturbation', 'adt:replicate', 'hto:G2M.Score', 'hto:HTO_classification', 'hto:MULTI_ID', 'hto:NT', 'hto:Phase', 'hto:S.Score', 'hto:gene_target', 'hto:guide_ID', ...
    → fix typos, remove non-existent values, or save terms via: curator.cat.add_new_from('columns')
! 2 terms not validated in feature 'perturbation': 'Perturbed', 'NT'
    → fix typos, remove non-existent values, or save terms via: curator.cat.add_new_from('perturbation')
    → a valid label for subtype 'Perturbation' has to be one of ['DMSO', 'IFNG']
lamindb.models.ulabel.ULabel.DoesNotExist

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/runner/work/lamindb/lamindb/docs/scripts/curate_mudata.py", line 57, in <module>
    artifact = curator.save_artifact(key="examples/mudata_papalexi21_subset.h5mu")
  File "/home/runner/work/lamindb/lamindb/lamindb/curators/core.py", line 335, in save_artifact
    self.validate()
    ~~~~~~~~~~~~~^^
  File "/home/runner/work/lamindb/lamindb/lamindb/curators/core.py", line 320, in validate
    curator.validate()
    ~~~~~~~~~~~~~~~~^^
  File "/home/runner/work/lamindb/lamindb/lamindb/curators/core.py", line 658, in validate
    self._cat_manager_validate()
    ~~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/home/runner/work/lamindb/lamindb/lamindb/curators/core.py", line 642, in _cat_manager_validate
    self.cat.validate()
    ~~~~~~~~~~~~~~~~~^^
  File "/home/runner/work/lamindb/lamindb/lamindb/curators/core.py", line 1510, in validate
    cat_vector.validate()
    ~~~~~~~~~~~~~~~~~~~^^
  File "/home/runner/work/lamindb/lamindb/lamindb/curators/core.py", line 1352, in validate
    self._validated, self._non_validated = self._add_validated()
                                           ~~~~~~~~~~~~~~~~~~~^^
  File "/home/runner/work/lamindb/lamindb/lamindb/curators/core.py", line 1169, in _add_validated
    type_record = registry.get(name=self._subtype_str)
  File "/home/runner/work/lamindb/lamindb/lamindb/models/sqlrecord.py", line 464, in get
    return QuerySet(model=cls).get(idlike, **expressions)
           ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/lamindb/lamindb/lamindb/models/query_set.py", line 873, in get
    record = get(self, idlike, **expressions)
  File "/home/runner/work/lamindb/lamindb/lamindb/models/query_set.py", line 226, in get
    raise registry.DoesNotExist from registry.DoesNotExist
lamindb.models.ulabel.ULabel.DoesNotExist

SpatialData¶

define_schema_spatialdata.py¶

import lamindb as ln
import bionty as bt


attrs_schema = ln.Schema(
    features=[
        ln.Feature(name="bio", dtype=dict).save(),
        ln.Feature(name="tech", dtype=dict).save(),
    ],
).save()

sample_schema = ln.Schema(
    features=[
        ln.Feature(name="disease", dtype=bt.Disease, coerce_dtype=True).save(),
        ln.Feature(
            name="developmental_stage",
            dtype=bt.DevelopmentalStage,
            coerce_dtype=True,
        ).save(),
    ],
).save()

tech_schema = ln.Schema(
    features=[
        ln.Feature(name="assay", dtype=bt.ExperimentalFactor, coerce_dtype=True).save(),
    ],
).save()

obs_schema = ln.Schema(
    features=[
        ln.Feature(name="sample_region", dtype="str").save(),
    ],
).save()

# Schema enforces only registered Ensembl Gene IDs are valid (maximal_set=True)
varT_schema = ln.Schema(itype=bt.Gene.ensembl_gene_id, maximal_set=True).save()

sdata_schema = ln.Schema(
    name="spatialdata_blobs_schema",
    otype="SpatialData",
    slots={
        "attrs:bio": sample_schema,
        "attrs:tech": tech_schema,
        "attrs": attrs_schema,
        "tables:table:obs": obs_schema,
        "tables:table:var.T": varT_schema,
    },
).save()

!python scripts/define_schema_spatialdata.py

curate_spatialdata.py¶

import lamindb as ln

spatialdata = ln.core.datasets.spatialdata_blobs()
sdata_schema = ln.Schema.get(name="spatialdata_blobs_schema")
curator = ln.curators.SpatialDataCurator(spatialdata, sdata_schema)
try:
    curator.validate()
except ln.errors.ValidationError:
    pass

spatialdata.tables["table"].var.drop(index="ENSG00000999999", inplace=True)

# validate again (must pass now) and save artifact
artifact = ln.Artifact.from_spatialdata(
    spatialdata, key="examples/spatialdata1.zarr", schema=sdata_schema
).save()
artifact.describe()

!python scripts/curate_spatialdata.py

Show code cell output

Hide code cell output

→ connected lamindb: testuser1/test-curate

/opt/hostedtoolcache/Python/3.13.5/x64/lib/python3.13/site-packages/xarray_schema/__init__.py:1: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  from pkg_resources import DistributionNotFound, get_distribution

/opt/hostedtoolcache/Python/3.13.5/x64/lib/python3.13/site-packages/spatialdata/_core/query/relational_query.py:532: FutureWarning: functools.partial will be a method descriptor in future Python versions; wrap it in enum.member() if you want to preserve the old behavior
  left = partial(_left_join_spatialelement_table)
/opt/hostedtoolcache/Python/3.13.5/x64/lib/python3.13/site-packages/spatialdata/_core/query/relational_query.py:533: FutureWarning: functools.partial will be a method descriptor in future Python versions; wrap it in enum.member() if you want to preserve the old behavior
  left_exclusive = partial(_left_exclusive_join_spatialelement_table)
/opt/hostedtoolcache/Python/3.13.5/x64/lib/python3.13/site-packages/spatialdata/_core/query/relational_query.py:534: FutureWarning: functools.partial will be a method descriptor in future Python versions; wrap it in enum.member() if you want to preserve the old behavior
  inner = partial(_inner_join_spatialelement_table)
/opt/hostedtoolcache/Python/3.13.5/x64/lib/python3.13/site-packages/spatialdata/_core/query/relational_query.py:535: FutureWarning: functools.partial will be a method descriptor in future Python versions; wrap it in enum.member() if you want to preserve the old behavior
  right = partial(_right_join_spatialelement_table)
/opt/hostedtoolcache/Python/3.13.5/x64/lib/python3.13/site-packages/spatialdata/_core/query/relational_query.py:536: FutureWarning: functools.partial will be a method descriptor in future Python versions; wrap it in enum.member() if you want to preserve the old behavior
  right_exclusive = partial(_right_exclusive_join_spatialelement_table)

/opt/hostedtoolcache/Python/3.13.5/x64/lib/python3.13/site-packages/spatialdata/models/models.py:1144: UserWarning: Converting `region_key: region` to categorical dtype.
  return convert_region_column_to_categorical(adata)

! 1 term not validated in feature 'columns' in slot 'attrs': 'random_int'
    → fix typos, remove non-existent values, or save terms via: curator.slots['attrs'].cat.add_new_from('columns')

! 2 terms not validated in feature 'columns' in slot 'tables:table:obs': 'instance_id', 'region'
    → fix typos, remove non-existent values, or save terms via: curator.slots['tables:table:obs'].cat.add_new_from('columns')

! 1 term not validated in feature 'columns' in slot 'tables:table:var.T': 'ENSG00000999999'
    → fix typos, remove non-existent values, or save terms via: curator.slots['tables:table:var.T'].cat.add_new_from('columns')
! no run & transform got linked, call `ln.track()` & re-run

INFO     The Zarr backing store has been changed from None the new file path:   
         /home/runner/.cache/lamindb/2yJo2wcmkXKU97w90000.zarr                  

! 1 term not validated in feature 'columns' in slot 'attrs': 'random_int'
    → fix typos, remove non-existent values, or save terms via: curator.slots['attrs'].cat.add_new_from('columns')

! 2 terms not validated in feature 'columns' in slot 'tables:table:obs': 'instance_id', 'region'
    → fix typos, remove non-existent values, or save terms via: curator.slots['tables:table:obs'].cat.add_new_from('columns')

→ returning existing schema with same hash: Schema(uid='xyTv3ZnPgrKXqbpC', n=2, is_type=False, itype='Feature', hash='ZslTrYhsGK8lFybDODAgSQ', minimal_set=True, ordered_set=False, maximal_set=False, branch_id=1, space_id=1, created_by_id=1, created_at=2025-07-15 14:32:17 UTC)
→ returning existing schema with same hash: Schema(uid='dzrOv4dt7LsfToog', n=1, is_type=False, itype='Feature', hash='kxsgjMPyjJ3p4IAzIS0Iww', minimal_set=True, ordered_set=False, maximal_set=False, branch_id=1, space_id=1, created_by_id=1, created_at=2025-07-15 14:32:17 UTC)
→ returning existing schema with same hash: Schema(uid='8cn2loPhYO3lysVh', n=2, is_type=False, itype='Feature', hash='vQcrXXfkj3ZVWsbcarjW7Q', minimal_set=True, ordered_set=False, maximal_set=False, branch_id=1, space_id=1, created_by_id=1, created_at=2025-07-15 14:32:17 UTC)
→ returning existing schema with same hash: Schema(uid='zsnKmYpYm1K7KUv0', n=1, is_type=False, itype='Feature', hash='0vr30ppYH0q3Yv4_R2f5pg', minimal_set=True, ordered_set=False, maximal_set=False, branch_id=1, space_id=1, created_by_id=1, created_at=2025-07-15 14:32:17 UTC)

Artifact .zarr · SpatialData · dataset
├── General
│   ├── uid: 2yJo2wcmkXKU97w90000          hash: LeYCNoxHuOJ_JnWc8oXRPA
│   ├── size: 11.6 MB                      n_files: 113
│   ├── space: all                         branch: main
│   ├── created_at: 2025-07-15 14:32:35    created_by: testuser1 (Test User1)
│   ├── key: examples/spatialdata1.zarr
│   └── storage location / path: 
│       /home/runner/work/lamindb/lamindb/docs/test-curate/.lamindb/2yJo2wcmkXKU
│       97w9.zarr
├── Dataset features
│   ├── attrs:bio • 2       [Feature]                                           
│   │   developmental_sta…  cat[bionty.Developmen…  adult stage                 
│   │   disease             cat[bionty.Disease]     Alzheimer disease           
│   ├── attrs:tech • 1      [Feature]                                           
│   │   assay               cat[bionty.Experiment…  Visium Spatial Gene Express…
│   ├── attrs • 2           [Feature]                                           
│   │   bio                 dict                                                
│   │   tech                dict                                                
│   ├── tables:table:obs …  [Feature]                                           
│   │   sample_region       str                                                 
│   └── tables:table:var.…  [bionty.Gene.ensembl_…                              
│       BRCA2               num                                                 
│       BRAF                num                                                 
└── Labels
    └── .diseases           bionty.Disease          Alzheimer disease           
        .experimental_fac…  bionty.ExperimentalFa…  Visium Spatial Gene Express…
        .developmental_st…  bionty.DevelopmentalS…  adult stage                 

TiledbsomaExperiment¶

curate_soma_experiment.py¶

import lamindb as ln
import bionty as bt
import tiledbsoma as soma
import tiledbsoma.io

adata = ln.core.datasets.mini_immuno.get_dataset1(otype="AnnData")
tiledbsoma.io.from_anndata("small_dataset.tiledbsoma", adata, measurement_name="RNA")

obs_schema = ln.Schema(
    name="soma_obs_schema",
    features=[
        ln.Feature(name="cell_type_by_expert", dtype=bt.CellType).save(),
        ln.Feature(name="cell_type_by_model", dtype=bt.CellType).save(),
    ],
).save()

var_schema = ln.Schema(
    name="soma_var_schema",
    features=[
        ln.Feature(name="var_id", dtype=bt.Gene.ensembl_gene_id).save(),
    ],
    coerce_dtype=True,
).save()

soma_schema = ln.Schema(
    name="soma_experiment_schema",
    otype="tiledbsoma",
    slots={
        "obs": obs_schema,
        "ms:RNA.T": var_schema,
    },
).save()

with soma.Experiment.open("small_dataset.tiledbsoma") as experiment:
    curator = ln.curators.TiledbsomaExperimentCurator(experiment, soma_schema)
    curator.validate()
    artifact = curator.save_artifact(
        key="examples/soma_experiment.tiledbsoma",
        description="SOMA experiment with schema validation",
    )
assert artifact.schema == soma_schema
artifact.describe()

!python scripts/curate_soma_experiment.py

Show code cell output

Hide code cell output

→ connected lamindb: testuser1/test-curate

→ returning existing Feature record with same name: 'cell_type_by_expert'
→ returning existing Feature record with same name: 'cell_type_by_model'

! 1 term not validated in feature 'columns': 'sample_note'
    → fix typos, remove non-existent values, or save terms via: curator.cat.add_new_from('columns')

! no run & transform got linked, call `ln.track()` & re-run

→ returning existing schema with same hash: Schema(uid='axAMsWBdXWxKdtFS', n=7, is_type=False, itype='Feature', hash='J0UsksiKZBOpX6D79yHY0w', minimal_set=True, ordered_set=False, maximal_set=False, branch_id=1, space_id=1, created_by_id=1, created_at=2025-07-15 14:31:56 UTC)
→ returning existing schema with same hash: Schema(uid='Btpg9XXzTEoAHHGH', name='soma_var_schema', n=1, is_type=False, itype='Feature', hash='3mgDR4GFxLlaHFuBGqlngg', minimal_set=True, ordered_set=False, maximal_set=False, branch_id=1, space_id=1, created_by_id=1, created_at=2025-07-15 14:32:40 UTC)

Artifact .tiledbsoma · tiledbsoma · dataset
├── General
│   ├── uid: KI1jxYlVb2ZlhuHr0000          hash: qG7ssOAVsOz7eKQjtCWQXg
│   ├── size: 23.9 KB                      n_files: 68
│   ├── n_observations: 3                  space: all
│   ├── branch: main                       created_at: 2025-07-15 14:32:40
│   ├── created_by: testuser1 (Test User1)
│   ├── key: examples/soma_experiment.tiledbsoma
│   ├── storage location / path: 
│   │   /home/runner/work/lamindb/lamindb/docs/test-curate/.lamindb/KI1jxYlVb2Zl
│   │   huHr.tiledbsoma
│   └── description: SOMA experiment with schema validation
├── Dataset features
│   ├── obs • 7             [Feature]                                           
│   │   cell_type_by_expe…  cat[bionty.CellType]    B cell, CD8-positive, alpha…
│   │   cell_type_by_model  cat[bionty.CellType]    B cell, T cell              
│   │   perturbation        cat[ULabel[Perturbati…                              
│   │   assay_oid           cat[bionty.Experiment…                              
│   │   concentration       str                                                 
│   │   treatment_time_h    num                                                 
│   │   donor               str                                                 
│   └── ms:RNA.T • 1        [Feature]                                           
│       var_id              cat[bionty.Gene.ensem…  CD14, CD4, CD8A             
└── Labels
    └── .genes              bionty.Gene             CD8A, CD4, CD14             
        .cell_types         bionty.CellType         B cell, T cell, CD8-positiv…

Other data structures¶

If you have other data structures, read: How do I validate & annotate arbitrary data structures?.