Pydantic & Pandera vs. LaminDB¶

This doc explains conceptual differences between data validation with pydantic, pandera, and lamindb.

!lamin init --storage test-pydantic-pandera --modules bionty

Let us work with a test dataframe.

import pandas as pd
import pydantic
from typing import Literal
import lamindb as ln
import bionty as bt
import pandera

df = ln.core.datasets.small_dataset1()
df

→ connected lamindb: testuser1/test-pydantic-pandera

	ENSG00000153563	ENSG00000010610	ENSG00000170458	perturbation	sample_note	cell_type_by_expert	cell_type_by_model	assay_oid	concentration	treatment_time_h	donor	donor_ethnicity
sample1	1	3	5	DMSO	was ok	B cell	B cell	EFO:0008913	0.1%	24	D0001	[Chinese, Singaporean Chinese]
sample2	2	4	6	IFNG	looks naah	CD8-positive, alpha-beta T cell	T cell	EFO:0008913	200 nM	24	D0002	[Chinese, Han Chinese]
sample3	3	5	7	DMSO	pretty! 🤩	CD8-positive, alpha-beta T cell	T cell	EFO:0008913	0.1%	6	None	[Chinese]

Define a schema¶

pydantic¶

Perturbation = Literal["DMSO", "IFNG"]
CellType = Literal["T cell", "B cell"]
OntologyID = Literal["EFO:0008913"]


class ImmunoSchema(pydantic.BaseModel):
    perturbation: Perturbation
    cell_type_by_model: CellType
    cell_type_by_expert: CellType
    assay_oid: OntologyID
    concentration: str
    treatment_time_h: int
    donor: str | None

    class Config:
        title = "My immuno schema"

pandera¶

# Define the Pandera schema using DataFrameSchema
pandera_schema = pandera.DataFrameSchema(
    {
        "perturbation": pandera.Column(
            str, checks=pandera.Check.isin(["DMSO", "IFNG"])
        ),
        "cell_type_by_model": pandera.Column(
            str, checks=pandera.Check.isin(["T cell", "B cell"])
        ),
        "cell_type_by_expert": pandera.Column(
            str, checks=pandera.Check.isin(["T cell", "B cell"])
        ),
        "assay_oid": pandera.Column(str, checks=pandera.Check.isin(["EFO:0008913"])),
        "concentration": pandera.Column(str),
        "treatment_time_h": pandera.Column(int),
        "donor": pandera.Column(str, nullable=True),
    },
    name="My immuno schema",
)

/opt/hostedtoolcache/Python/3.13.5/x64/lib/python3.13/site-packages/pandera/_pandas_deprecated.py:146: FutureWarning: Importing pandas-specific classes and functions from the
top-level pandera module will be **removed in a future version of pandera**.
If you're using pandera to validate pandas objects, we highly recommend updating
your import:

```
# old import
import pandera as pa

# new import
import pandera.pandas as pa
```

If you're using pandera to validate objects from other compatible libraries
like pyspark or polars, see the supported libraries section of the documentation
for more information on how to import pandera:

https://pandera.readthedocs.io/en/stable/supported_libraries.html

To disable this warning, set the environment variable:

```
export DISABLE_PANDERA_IMPORT_WARNING=True
```

  warnings.warn(_future_warning, FutureWarning)

lamindb¶

Features & labels are defined on the level of the database instance. You can either define a schema with required (and optional) columns.

ln.ULabel(name="DMSO").save()  # define a DMSO label
ln.ULabel(name="IFNG").save()  # define an IFNG label

# leverage ontologies through types ln.ULabel, bt.CellType, bt.ExperimentalFactor
lamindb_schema = ln.Schema(
    name="My immuno schema",
    features=[
        ln.Feature(name="perturbation", dtype=ln.ULabel).save(),
        ln.Feature(name="cell_type_by_model", dtype=bt.CellType).save(),
        ln.Feature(name="cell_type_by_expert", dtype=bt.CellType).save(),
        ln.Feature(name="assay_oid", dtype=bt.ExperimentalFactor.ontology_id).save(),
        ln.Feature(name="concentration", dtype=str).save(),
        ln.Feature(name="treatment_time_h", dtype=int).save(),
        ln.Feature(name="donor", dtype=str, nullable=True).save(),
    ],
).save()

Or merely define a constraint on the feature identifier.

lamindb_schema_only_itype = ln.Schema(
    name="Allow any valid features & labels", itype=ln.Feature
)

Validate a dataframe¶

pydantic¶

class DataFrameValidationError(Exception):
    pass


def validate_dataframe(df: pd.DataFrame, model: type[pydantic.BaseModel]):
    errors = []

    for i, row in enumerate(df.to_dict(orient="records")):
        try:
            model(**row)
        except pydantic.ValidationError as e:
            errors.append(f"row {i} failed validation: {e}")

    if errors:
        error_message = "\n".join(errors)
        raise DataFrameValidationError(
            f"DataFrame validation failed with the following errors:\n{error_message}"
        )

try:
    validate_dataframe(df, ImmunoSchema)
except DataFrameValidationError as e:
    print(e)

To fix the validation error, we need to update the Literal and re-run the model definition.

Perturbation = Literal["DMSO", "IFNG"]
CellType = Literal[
    "T cell", "B cell", "CD8-positive, alpha-beta T cell"  # <-- updated
]
OntologyID = Literal["EFO:0008913"]


class ImmunoSchema(pydantic.BaseModel):
    perturbation: Perturbation
    cell_type_by_model: CellType
    cell_type_by_expert: CellType
    assay_oid: OntologyID
    concentration: str
    treatment_time_h: int
    donor: str | None

    class Config:
        title = "My immuno schema"

validate_dataframe(df, ImmunoSchema)

pandera¶

try:
    pandera_schema.validate(df)
except pandera.errors.SchemaError as e:
    print(e)

lamindb¶

Because the term "CD8-positive, alpha-beta T cell" is part of the public CellType ontology, validation passes the first time.

If validation and not passed, we could have resolved the issue simply by adding a new term to the CellType registry rather than editing the code. This also puts downstream data scientists into a position to update ontologies.

curator = ln.curators.DataFrameCurator(df, lamindb_schema)
curator.validate()

What was the cell type validation based on? Let’s inspect the CellType registry.

bt.CellType.df()

Show code cell output

Hide code cell output

	uid	name	ontology_id	abbr	synonyms	description	space_id	source_id	run_id	created_at	created_by_id	_aux	branch_id
id
14	6By01L04	alpha-beta T cell	CL:0000789	None	alpha-beta T-cell\|alpha-beta T lymphocyte\|alph...	A T Cell That Expresses An Alpha-Beta T Cell R...	1	16	None	2025-07-15 14:34:33.256000+00:00	1	None	1
15	4BEwsp1Q	mature alpha-beta T cell	CL:0000791	None	mature alpha-beta T-lymphocyte\|mature alpha-be...	A Alpha-Beta T Cell That Has A Mature Phenotype.	1	16	None	2025-07-15 14:34:33.256000+00:00	1	None	1
16	2OTzqBTM	mature T cell	CL:0002419	None	CD3e-positive T cell\|mature T-cell	A T Cell That Expresses A T Cell Receptor Comp...	1	16	None	2025-07-15 14:34:33.256000+00:00	1	None	1
13	6IC9NGJE	CD8-positive, alpha-beta T cell	CL:0000625	None	CD8-positive, alpha-beta T-cell\|CD8-positive, ...	A T Cell Expressing An Alpha-Beta T Cell Recep...	1	16	None	2025-07-15 14:34:32.948000+00:00	1	None	1
3	4bKGljt0	cell	CL:0000000	None	None	A Material Entity Of Anatomical Origin (Part O...	1	16	None	2025-07-15 14:34:32.581000+00:00	1	None	1
4	2K93w3xO	motile cell	CL:0000219	None	None	A Cell That Moves By Its Own Activities.	1	16	None	2025-07-15 14:34:32.581000+00:00	1	None	1
5	2cXC7cgF	single nucleate cell	CL:0000226	None	None	A Cell With A Single Nucleus.	1	16	None	2025-07-15 14:34:32.581000+00:00	1	None	1
6	4WnpvUTH	eukaryotic cell	CL:0000255	None	None	Any Cell That Only Exists In Eukaryota.	1	16	None	2025-07-15 14:34:32.581000+00:00	1	None	1
7	X6c7osZ5	lymphocyte	CL:0000542	None	None	A Lymphocyte Is A Leukocyte Commonly Found In ...	1	16	None	2025-07-15 14:34:32.581000+00:00	1	None	1
8	3VEAlFdi	leukocyte	CL:0000738	None	white blood cell\|leucocyte	An Achromatic Cell Of The Myeloid Or Lymphoid ...	1	16	None	2025-07-15 14:34:32.581000+00:00	1	None	1
9	2Jgr5Xx4	mononuclear cell	CL:0000842	None	mononuclear leukocyte	A Leukocyte With A Single Non-Segmented Nucleu...	1	16	None	2025-07-15 14:34:32.581000+00:00	1	None	1
10	7GpphKmr	lymphocyte of B lineage	CL:0000945	None	None	A Lymphocyte Of B Lineage With The Commitment ...	1	16	None	2025-07-15 14:34:32.581000+00:00	1	None	1
11	4Ilrnj9U	hematopoietic cell	CL:0000988	None	haematopoietic cell\|hemopoietic cell\|haemopoie...	A Cell Of A Hematopoietic Lineage.	1	16	None	2025-07-15 14:34:32.581000+00:00	1	None	1
12	u3sr1Gdf	nucleate cell	CL:0002242	None	None	A Cell Containing At Least One Nucleus.	1	16	None	2025-07-15 14:34:32.581000+00:00	1	None	1
1	ryEtgi1y	B cell	CL:0000236	None	B lymphocyte\|B-lymphocyte\|B-cell	A Lymphocyte Of B Lineage That Is Capable Of B...	1	16	None	2025-07-15 14:34:32.225000+00:00	1	None	1
2	22LvKd01	T cell	CL:0000084	None	T-cell\|T-lymphocyte\|T lymphocyte	A Type Of Lymphocyte Whose Defining Characteri...	1	16	None	2025-07-15 14:34:32.225000+00:00	1	None	1

The CellType regsitry is hierachical as it contains the Cell Ontology.

bt.CellType.get(name="CD8-positive, alpha-beta T cell").view_parents()

Overview of validation properties¶

Importantly, LaminDB offers not only a DataFrameCurator, but also a AnnDataCurator, MuDataCurator, SpatialDataCurator, TiledbsomaCurator.

The below overview only concerns validating dataframes.

Experience of data engineer¶

property	`pydantic`	`pandera`	`lamindb`
define schema as code	yes, in form of a `pydantic.BaseModel`	yes, in form of a `pandera.DataFrameSchema`	yes, in form of a `lamindb.Schema`
define schema as a set of constraints without the need of listing fields/columns/features; e.g. useful if validating 60k genes	no	no	yes
update labels independent of code	not possible because labels are enums/literals	not possible because labels are hard-coded in `Check`	possible by adding new terms to a registry
built-in validation from public ontologies	no	no	yes
sync labels with ELN/LIMS registries without code change	no	no	yes
can re-use fields/columns/features across schemas	limited via subclass	only in same Python session	yes because persisted in database
schema modifications can invalidate previously validated datasets	yes	yes	no because LaminDB allows to query datasets that were validated with a schema version
can use columnar organization of dataframe	no, need to iterate over potentially millions of rows	yes	yes

Experience of data consumer¶

property	`pydantic`	`pandera`	`lamindb`
dataset is queryable / findable	no	no	yes, by querying for labels & features
dataset is annotated	no	no	yes
user knows what validation constraints were	no, because might not have access to code and doesn’t know which code was run	no (same as pydantic)	yes, via `artifact.schema`

Annotation & queryability¶

Engineer: annotate the dataset¶

Either use the Curator object:

artifact = curator.save_artifact(key="our_datasets/dataset1.parquet")

If you don’t expect a need for Curator functionality for updating ontologies and standaridization, you can also use the Artifact constructor.

artifact = ln.Artifact.from_df(
    df, key="our_datasets/dataset1.parquet", schema=lamindb_schema
).save()

Consumer: see annotations¶

artifact.describe()

Consumer: query the dataset¶

ln.Artifact.filter(perturbation="IFNG").df()

	uid	key	description	suffix	kind	otype	size	hash	n_files	n_observations	_hash_type	_key_is_virtual	_overwrite_versions	space_id	storage_id	schema_id	version	is_latest	run_id	created_at	created_by_id	_aux	branch_id
id
1	LNxjGL6ifTvFjeqq0000	our_datasets/dataset1.parquet	None	.parquet	dataset	DataFrame	9868	kQSstgz6tk5ug4-rq8yz0A	None	3	md5	True	False	1	1	1	None	True	None	2025-07-15 14:34:35.394000+00:00	1	{'af': {'0': True}}	1

Consumer: understand validation¶

By accessing artifact.schema, the consumer can understand how the dataset was validated.

artifact.schema

artifact.schema.features.df()

Show code cell output

Hide code cell output

	uid	name	dtype	is_type	unit	description	array_rank	array_size	array_shape	proxy_dtype	synonyms	_expect_many	_curation	space_id	type_id	run_id	created_at	created_by_id	_aux	branch_id
id
1	IZtC25266YpN	perturbation	cat[ULabel]	None	None	None	0	0	None	None	None	None	None	1	None	None	2025-07-15 14:34:31.345000+00:00	1	{'af': {'0': None, '1': True, '2': False}}	1
2	rwGIU38b7Olr	cell_type_by_model	cat[bionty.CellType]	None	None	None	0	0	None	None	None	None	None	1	None	None	2025-07-15 14:34:31.407000+00:00	1	{'af': {'0': None, '1': True, '2': False}}	1
3	unpFNrggfndm	cell_type_by_expert	cat[bionty.CellType]	None	None	None	0	0	None	None	None	None	None	1	None	None	2025-07-15 14:34:31.416000+00:00	1	{'af': {'0': None, '1': True, '2': False}}	1
4	yIxjXa6hEIXF	assay_oid	cat[bionty.ExperimentalFactor.ontology_id]	None	None	None	0	0	None	None	None	None	None	1	None	None	2025-07-15 14:34:31.425000+00:00	1	{'af': {'0': None, '1': True, '2': False}}	1
5	1Tc1T04F8M5S	concentration	str	None	None	None	0	0	None	None	None	None	None	1	None	None	2025-07-15 14:34:31.434000+00:00	1	{'af': {'0': None, '1': True, '2': False}}	1
6	LrRr8NulTJbC	treatment_time_h	int	None	None	None	0	0	None	None	None	None	None	1	None	None	2025-07-15 14:34:31.443000+00:00	1	{'af': {'0': None, '1': True, '2': False}}	1
7	wRB4YojdoGTR	donor	str	None	None	None	0	0	None	None	None	None	None	1	None	None	2025-07-15 14:34:31.451000+00:00	1	{'af': {'0': None, '1': True, '2': False}}	1