Pydantic & Pandera vs. LaminDB¶
This doc explains conceptual differences between data validation with pydantic
, pandera
, and lamindb
.
!lamin init --storage test-pydantic-pandera --modules bionty
Let us work with a test dataframe.
import pandas as pd
import pydantic
from typing import Literal
import lamindb as ln
import bionty as bt
import pandera
df = ln.core.datasets.small_dataset1()
df
→ connected lamindb: testuser1/test-pydantic-pandera
ENSG00000153563 | ENSG00000010610 | ENSG00000170458 | perturbation | sample_note | cell_type_by_expert | cell_type_by_model | assay_oid | concentration | treatment_time_h | donor | donor_ethnicity | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
sample1 | 1 | 3 | 5 | DMSO | was ok | B cell | B cell | EFO:0008913 | 0.1% | 24 | D0001 | [Chinese, Singaporean Chinese] |
sample2 | 2 | 4 | 6 | IFNG | looks naah | CD8-positive, alpha-beta T cell | T cell | EFO:0008913 | 200 nM | 24 | D0002 | [Chinese, Han Chinese] |
sample3 | 3 | 5 | 7 | DMSO | pretty! 🤩 | CD8-positive, alpha-beta T cell | T cell | EFO:0008913 | 0.1% | 6 | None | [Chinese] |
Define a schema¶
pydantic¶
Perturbation = Literal["DMSO", "IFNG"]
CellType = Literal["T cell", "B cell"]
OntologyID = Literal["EFO:0008913"]
class ImmunoSchema(pydantic.BaseModel):
perturbation: Perturbation
cell_type_by_model: CellType
cell_type_by_expert: CellType
assay_oid: OntologyID
concentration: str
treatment_time_h: int
donor: str | None
class Config:
title = "My immuno schema"
pandera¶
# Define the Pandera schema using DataFrameSchema
pandera_schema = pandera.DataFrameSchema(
{
"perturbation": pandera.Column(
str, checks=pandera.Check.isin(["DMSO", "IFNG"])
),
"cell_type_by_model": pandera.Column(
str, checks=pandera.Check.isin(["T cell", "B cell"])
),
"cell_type_by_expert": pandera.Column(
str, checks=pandera.Check.isin(["T cell", "B cell"])
),
"assay_oid": pandera.Column(str, checks=pandera.Check.isin(["EFO:0008913"])),
"concentration": pandera.Column(str),
"treatment_time_h": pandera.Column(int),
"donor": pandera.Column(str, nullable=True),
},
name="My immuno schema",
)
/opt/hostedtoolcache/Python/3.13.5/x64/lib/python3.13/site-packages/pandera/_pandas_deprecated.py:146: FutureWarning: Importing pandas-specific classes and functions from the
top-level pandera module will be **removed in a future version of pandera**.
If you're using pandera to validate pandas objects, we highly recommend updating
your import:
```
# old import
import pandera as pa
# new import
import pandera.pandas as pa
```
If you're using pandera to validate objects from other compatible libraries
like pyspark or polars, see the supported libraries section of the documentation
for more information on how to import pandera:
https://pandera.readthedocs.io/en/stable/supported_libraries.html
To disable this warning, set the environment variable:
```
export DISABLE_PANDERA_IMPORT_WARNING=True
```
warnings.warn(_future_warning, FutureWarning)
lamindb¶
Features & labels are defined on the level of the database instance. You can either define a schema with required (and optional) columns.
ln.ULabel(name="DMSO").save() # define a DMSO label
ln.ULabel(name="IFNG").save() # define an IFNG label
# leverage ontologies through types ln.ULabel, bt.CellType, bt.ExperimentalFactor
lamindb_schema = ln.Schema(
name="My immuno schema",
features=[
ln.Feature(name="perturbation", dtype=ln.ULabel).save(),
ln.Feature(name="cell_type_by_model", dtype=bt.CellType).save(),
ln.Feature(name="cell_type_by_expert", dtype=bt.CellType).save(),
ln.Feature(name="assay_oid", dtype=bt.ExperimentalFactor.ontology_id).save(),
ln.Feature(name="concentration", dtype=str).save(),
ln.Feature(name="treatment_time_h", dtype=int).save(),
ln.Feature(name="donor", dtype=str, nullable=True).save(),
],
).save()
Or merely define a constraint on the feature identifier.
lamindb_schema_only_itype = ln.Schema(
name="Allow any valid features & labels", itype=ln.Feature
)
Validate a dataframe¶
pydantic¶
class DataFrameValidationError(Exception):
pass
def validate_dataframe(df: pd.DataFrame, model: type[pydantic.BaseModel]):
errors = []
for i, row in enumerate(df.to_dict(orient="records")):
try:
model(**row)
except pydantic.ValidationError as e:
errors.append(f"row {i} failed validation: {e}")
if errors:
error_message = "\n".join(errors)
raise DataFrameValidationError(
f"DataFrame validation failed with the following errors:\n{error_message}"
)
try:
validate_dataframe(df, ImmunoSchema)
except DataFrameValidationError as e:
print(e)
To fix the validation error, we need to update the Literal
and re-run the model definition.
Perturbation = Literal["DMSO", "IFNG"]
CellType = Literal[
"T cell", "B cell", "CD8-positive, alpha-beta T cell" # <-- updated
]
OntologyID = Literal["EFO:0008913"]
class ImmunoSchema(pydantic.BaseModel):
perturbation: Perturbation
cell_type_by_model: CellType
cell_type_by_expert: CellType
assay_oid: OntologyID
concentration: str
treatment_time_h: int
donor: str | None
class Config:
title = "My immuno schema"
validate_dataframe(df, ImmunoSchema)
pandera¶
try:
pandera_schema.validate(df)
except pandera.errors.SchemaError as e:
print(e)
lamindb¶
Because the term "CD8-positive, alpha-beta T cell"
is part of the public CellType
ontology, validation passes the first time.
If validation and not passed, we could have resolved the issue simply by adding a new term to the CellType
registry rather than editing the code. This also puts downstream data scientists into a position to update ontologies.
curator = ln.curators.DataFrameCurator(df, lamindb_schema)
curator.validate()
What was the cell type validation based on? Let’s inspect the CellType
registry.
bt.CellType.df()
The CellType
regsitry is hierachical as it contains the Cell Ontology.
bt.CellType.get(name="CD8-positive, alpha-beta T cell").view_parents()
Overview of validation properties¶
Importantly, LaminDB offers not only a DataFrameCurator
, but also a AnnDataCurator
, MuDataCurator
, SpatialDataCurator
, TiledbsomaCurator
.
The below overview only concerns validating dataframes.
Experience of data engineer¶
property |
|
|
|
---|---|---|---|
define schema as code |
yes, in form of a |
yes, in form of a |
yes, in form of a |
define schema as a set of constraints without the need of listing fields/columns/features; e.g. useful if validating 60k genes |
no |
no |
yes |
update labels independent of code |
not possible because labels are enums/literals |
not possible because labels are hard-coded in |
possible by adding new terms to a registry |
built-in validation from public ontologies |
no |
no |
yes |
sync labels with ELN/LIMS registries without code change |
no |
no |
yes |
can re-use fields/columns/features across schemas |
limited via subclass |
only in same Python session |
yes because persisted in database |
schema modifications can invalidate previously validated datasets |
yes |
yes |
no because LaminDB allows to query datasets that were validated with a schema version |
can use columnar organization of dataframe |
no, need to iterate over potentially millions of rows |
yes |
yes |
Experience of data consumer¶
property |
|
|
|
---|---|---|---|
dataset is queryable / findable |
no |
no |
yes, by querying for labels & features |
dataset is annotated |
no |
no |
yes |
user knows what validation constraints were |
no, because might not have access to code and doesn’t know which code was run |
no (same as pydantic) |
yes, via |
Annotation & queryability¶
Engineer: annotate the dataset¶
Either use the Curator
object:
artifact = curator.save_artifact(key="our_datasets/dataset1.parquet")
If you don’t expect a need for Curator functionality for updating ontologies and standaridization, you can also use the Artifact
constructor.
artifact = ln.Artifact.from_df(
df, key="our_datasets/dataset1.parquet", schema=lamindb_schema
).save()
Consumer: see annotations¶
artifact.describe()
Consumer: query the dataset¶
ln.Artifact.filter(perturbation="IFNG").df()
uid | key | description | suffix | kind | otype | size | hash | n_files | n_observations | _hash_type | _key_is_virtual | _overwrite_versions | space_id | storage_id | schema_id | version | is_latest | run_id | created_at | created_by_id | _aux | branch_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||||
1 | LNxjGL6ifTvFjeqq0000 | our_datasets/dataset1.parquet | None | .parquet | dataset | DataFrame | 9868 | kQSstgz6tk5ug4-rq8yz0A | None | 3 | md5 | True | False | 1 | 1 | 1 | None | True | None | 2025-07-15 14:34:35.394000+00:00 | 1 | {'af': {'0': True}} | 1 |
Consumer: understand validation¶
By accessing artifact.schema
, the consumer can understand how the dataset was validated.
artifact.schema
artifact.schema.features.df()