Slice arrays
We saw how LaminDB allows to query & search across artifacts & collections using registries: Query & search registries.
Let us now look at the following case:
# get a lookup for labels
ulabels = ln.ULabel.lookup()
# query a parquet file matching an "setosa"
df = ln.Artifact.filter(ulabels=ulabels.setosa, suffix=".suffix").first().load()
# query all observations in the DataFrame matching "setosa"
df_setosa = df.loc[:, df.iris_organism_name == ulabels.setosa.name]
Because the artifact was validated, querying the DataFrame
is guaranteed to succeed!
Such within-collection queries are also possible for cloud-backed collections using DuckDB,
TileDB, zarr, HDF5,
parquet, and other storage backends.
In this notebook, we show how to subset an AnnData
and generic HDF5
and zarr
collections accessed in the cloud.
Let us create a remote instance for testing.
Show code cell output
Hide code cell output
✓ logged in with email testuser1@lamin.ai (uid: DzTjkKse)
! updating cloud SQLite 's3://lamindb-ci/test-arrays/.lamindb/lamin.db' of instance 'testuser1/test-arrays'
! locked instance (to unlock and push changes to the cloud SQLite file, call: lamin disconnect)
→ initialized lamindb: testuser1/test-arrays
Import lamindb and track this notebook.
Show code cell output
Hide code cell output
→ connected lamindb: testuser1/test-arrays
→ created Transform('hsRyWJggf2Ca0000'), started new Run('sdYkBm8d...') at 2025-07-15 14:31:41 UTC
→ notebook imports: lamindb==1.8.0
We’ll need some test data:
Show code cell output
Hide code cell output
Artifact(uid='Aw5ozFMggCLAXyqC0000', is_latest=True, key='testfile.hdf5', suffix='.hdf5', size=1400, hash='UCWPjJkhzBjO97rtuo_8Yg', branch_id=1, space_id=1, storage_id=1, run_id=1, created_by_id=1, created_at=2025-07-15 14:31:41 UTC)
Note that it is also possible to register Hugging Face paths. For this huggingface_hub
package should be installed.
We register a folder of parquet
files as a single artifact.
Show code cell output
Hide code cell output
/opt/hostedtoolcache/Python/3.13.5/x64/lib/python3.13/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
→ due to lack of write access, LaminDB won't manage this storage location: hf://datasets/Koncopd/lamindb-test
→ deleted storage record on hub e82908a3045a5fecadfe01b36107a2e4 | hf://datasets/Koncopd/lamindb-test
→ referenced read-only storage location at hf://datasets/Koncopd/lamindb-test
Artifact(uid='UQQuP68mAHNik2BT0000', is_latest=True, key='sharded_parquet', suffix='', size=42767, hash='oj6I3nNKj_eiX2I1q26qaw', n_files=11, branch_id=1, space_id=1, storage_id=2, run_id=1, created_by_id=1, created_at=2025-07-15 14:31:44 UTC)
We also register a collection of individual parquet
files.
Show code cell output
Hide code cell output
Collection(uid='LORPEhE3Uqx0qYhB0000', is_latest=True, key='sharded_parquet_collection', hash='XavO_EEZSi-shT6uJGFHHA', branch_id=1, space_id=1, created_by_id=1, run_id=1, created_at=2025-07-15 14:31:44 UTC)
AnnData
An h5ad
artifact stored on s3:
S3QueryPath('s3://lamindb-ci/test-arrays/pbmc68k.h5ad')
This object is an AnnDataAccessor
object, an AnnData
object backed in the cloud:
Show code cell output
Hide code cell output
AnnDataAccessor object with n_obs × n_vars = 70 × 765
constructed for the AnnData object pbmc68k.h5ad
obs: ['cell_type', 'index', 'louvain', 'n_genes', 'percent_mito']
obsm: ['X_pca', 'X_umap']
obsp: ['connectivities', 'distances']
uns: ['louvain', 'louvain_colors', 'neighbors', 'pca']
var: ['highly_variable', 'index', 'n_counts']
varm: ['PCs']
Without subsetting, the AnnDataAccessor
object references underlying lazy h5
or zarr
arrays:
Show code cell output
Hide code cell output
<HDF5 dataset "X": shape (70, 765), type "<f4">
You can subset it like a normal AnnData
object:
Show code cell output
Hide code cell output
AnnDataAccessorSubset object with n_obs × n_vars = 35 × 765
obs: ['cell_type', 'index', 'louvain', 'n_genes', 'percent_mito']
obsm: ['X_pca', 'X_umap']
obsp: ['connectivities', 'distances']
uns: ['louvain', 'louvain_colors', 'neighbors', 'pca']
var: ['highly_variable', 'index', 'n_counts']
varm: ['PCs']
Subsets load arrays into memory upon direct access:
Show code cell output
Hide code cell output
array([[-0.326, -0.191, 0.499, ..., -0.21 , -0.636, -0.49 ],
[ 0.811, -0.191, -0.728, ..., -0.21 , 0.604, -0.49 ],
[-0.326, -0.191, 0.643, ..., -0.21 , 2.303, -0.49 ],
...,
[-0.326, -0.191, -0.728, ..., -0.21 , 0.626, -0.49 ],
[-0.326, -0.191, -0.728, ..., -0.21 , -0.636, -0.49 ],
[-0.326, -0.191, -0.728, ..., -0.21 , -0.636, -0.49 ]],
shape=(35, 765), dtype=float32)
To load the entire subset into memory as an actual AnnData
object, use to_memory()
:
Show code cell output
Hide code cell output
AnnData object with n_obs × n_vars = 35 × 765
obs: 'cell_type', 'n_genes', 'percent_mito', 'louvain'
var: 'n_counts', 'highly_variable'
uns: 'louvain', 'louvain_colors', 'neighbors', 'pca'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
obsp: 'connectivities', 'distances'
Generic HDF5
Let us query a generic HDF5 artifact:
And get a backed accessor:
The returned object contains the .connection
and h5py.File
or zarr.Group
in .storage
BackedAccessor(connection=<File-like object S3FileSystem, lamindb-ci/test-arrays/testfile.hdf5>, storage=<HDF5 file "testfile.hdf5>" (mode r)>)
<HDF5 file "testfile.hdf5>" (mode r)>
Parquet
A dataframe stored as sharded parquet
.
Show code cell output
Hide code cell output
11 sub-directories & 11 files with suffixes '.parquet'
hf://datasets/Koncopd/lamindb-test/sharded_parquet
├── louvain=0/
│ └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=1/
│ └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=10/
│ └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=2/
│ └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=3/
│ └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=4/
│ └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=5/
│ └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=6/
│ └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=7/
│ └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=8/
│ └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
└── louvain=9/
└── 947eee0b064440c9b9910ca2eb89e608-0.parquet
This returns a pyarrow dataset.
<pyarrow._dataset.FileSystemDataset at 0x7f2f6c6860e0>
Show code cell output
Hide code cell output
|
cell_type |
n_genes |
percent_mito |
index |
|
|
|
CGTTATACAGTACC-8 |
CD4+/CD45RO+ Memory |
1034 |
0.010163 |
AGATATTGACCACA-1 |
CD4+/CD45RO+ Memory |
1078 |
0.012831 |
GCAGGGCTGTATGC-8 |
CD8+/CD45RA+ Naive Cytotoxic |
1055 |
0.012287 |
TTATGGCTGGCAAG-2 |
CD4+/CD25 T Reg |
1236 |
0.023963 |
CACGACCTGGGAGT-7 |
CD4+/CD25 T Reg |
1010 |
0.016620 |
It is also possible to open a collection of cloud artifacts.
<pyarrow._dataset.FileSystemDataset at 0x7f2f6555fca0>
Show code cell output
Hide code cell output
|
cell_type |
n_genes |
percent_mito |
index |
|
|
|
CGTTATACAGTACC-8 |
CD4+/CD45RO+ Memory |
1034 |
0.010163 |
AGATATTGACCACA-1 |
CD4+/CD45RO+ Memory |
1078 |
0.012831 |
GCAGGGCTGTATGC-8 |
CD8+/CD45RA+ Naive Cytotoxic |
1055 |
0.012287 |
TTATGGCTGGCAAG-2 |
CD4+/CD25 T Reg |
1236 |
0.023963 |
CACGACCTGGGAGT-7 |
CD4+/CD25 T Reg |
1010 |
0.016620 |
AATCTCACTCAGTG-3 |
CD4+/CD45RO+ Memory |
1183 |
0.016056 |
CTAGTTTGGCTTAG-4 |
CD4+/CD45RO+ Memory |
1002 |
0.018922 |
ACGCCGGAAGCCTA-6 |
CD8+/CD45RA+ Naive Cytotoxic |
1292 |
0.018315 |
CTGACCACCATGGT-4 |
CD8+/CD45RA+ Naive Cytotoxic |
1559 |
0.024427 |
AGTTAAACAAACAG-1 |
CD19+ B |
1005 |
0.019806 |
CTACGCACAGGGTG-3 |
CD4+/CD45RO+ Memory |
1053 |
0.012073 |
CAGACAACAAAACG-7 |
CD4+/CD25 T Reg |
1109 |
0.012702 |
GAGGGTGACCTATT-1 |
CD4+/CD25 T Reg |
1003 |
0.012971 |
TGACTGGAACCATG-7 |
Dendritic cells |
1277 |
0.012961 |
ACGACCCTGTCTGA-3 |
Dendritic cells |
1074 |
0.017466 |
GTTATGCTACCTCC-3 |
CD14+ Monocytes |
1201 |
0.016839 |
GTGTCAGATCTACT-6 |
CD14+ Monocytes |
1014 |
0.025417 |
AAGAACGAACTCTT-6 |
CD14+ Monocytes |
1067 |
0.019530 |
TACTCTGACGTAGT-1 |
Dendritic cells |
1118 |
0.012069 |
TAAGCTCTTCTGGA-4 |
CD14+ Monocytes |
1059 |
0.021497 |
By default Artifact.open()
and Collection.open()
use pyarrow
to lazily open dataframes. polars
can be also used by passing engine="polars"
. Note also that .open(engine="polars")
returns a context manager with LazyFrame.
Show code cell output
Hide code cell output
<sys>:0: CategoricalRemappingWarning: Local categoricals have different encodings, expensive re-encoding is done to perform this merge operation. Consider using a StringCache or an Enum type if the categories are known in advance
|
cell_type |
n_genes |
percent_mito |
index |
0 |
CD4+/CD45RO+ Memory |
1034 |
0.010163 |
CGTTATACAGTACC-8 |
1 |
CD4+/CD45RO+ Memory |
1078 |
0.012831 |
AGATATTGACCACA-1 |
2 |
CD8+/CD45RA+ Naive Cytotoxic |
1055 |
0.012287 |
GCAGGGCTGTATGC-8 |
3 |
CD4+/CD25 T Reg |
1236 |
0.023963 |
TTATGGCTGGCAAG-2 |
4 |
CD4+/CD25 T Reg |
1010 |
0.016620 |
CACGACCTGGGAGT-7 |
5 |
CD4+/CD45RO+ Memory |
1183 |
0.016056 |
AATCTCACTCAGTG-3 |
6 |
CD4+/CD45RO+ Memory |
1002 |
0.018922 |
CTAGTTTGGCTTAG-4 |
7 |
CD8+/CD45RA+ Naive Cytotoxic |
1292 |
0.018315 |
ACGCCGGAAGCCTA-6 |
8 |
CD8+/CD45RA+ Naive Cytotoxic |
1559 |
0.024427 |
CTGACCACCATGGT-4 |
9 |
CD19+ B |
1005 |
0.019806 |
AGTTAAACAAACAG-1 |
10 |
CD4+/CD45RO+ Memory |
1053 |
0.012073 |
CTACGCACAGGGTG-3 |
11 |
CD4+/CD25 T Reg |
1109 |
0.012702 |
CAGACAACAAAACG-7 |
12 |
CD4+/CD25 T Reg |
1003 |
0.012971 |
GAGGGTGACCTATT-1 |
13 |
Dendritic cells |
1277 |
0.012961 |
TGACTGGAACCATG-7 |
14 |
Dendritic cells |
1074 |
0.017466 |
ACGACCCTGTCTGA-3 |
15 |
CD14+ Monocytes |
1201 |
0.016839 |
GTTATGCTACCTCC-3 |
16 |
CD14+ Monocytes |
1014 |
0.025417 |
GTGTCAGATCTACT-6 |
17 |
CD14+ Monocytes |
1067 |
0.019530 |
AAGAACGAACTCTT-6 |
18 |
Dendritic cells |
1118 |
0.012069 |
TACTCTGACGTAGT-1 |
19 |
CD14+ Monocytes |
1059 |
0.021497 |
TAAGCTCTTCTGGA-4 |
Yet another way to open several parquet files as a single dataset is via calling .open()
directly for a query set.
! this query set is unordered, consider using `.order_by()` first to avoid opening the artifacts in an arbitrary order
<pyarrow._dataset.FileSystemDataset at 0x7f2f4b7c6740>
Show code cell content
Hide code cell content
• deleting instance testuser1/test-arrays
→ deleted storage record on hub 76e5f3b018085f52bcd5ca9b4d7e0ce5 | s3://lamindb-ci/test-arrays
→ deleted instance record on hub 587a82023ecb5ea28b3a448cb8240f7f