Track data using bio-registries & provenance#

Hide code cell content
# a lamindb instance containing Bionty schema
!lamin init --storage ./analysis-usecase --schema bionty
πŸ’‘ creating schemas: core==0.45.0 bionty==0.29.2 
🌱 saved: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-08-11 19:27:21)
🌱 saved: Storage(id='tC7ncbg2', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-usecase', type='local', updated_at=2023-08-11 19:27:21, created_by_id='DzTjkKse')
βœ… loaded instance: testuser1/analysis-usecase
πŸ’‘ did not register local instance on hub (if you want, call `lamin register`)

import lamindb as ln
import lnschema_bionty as lb

lb.settings.species = "human"  # globally set species
lb.settings.auto_save_parents = False
βœ… loaded instance: testuser1/analysis-usecase (lamindb 0.50.2)
🌱 set species: Species(id='uHJU', name='human', taxon_id=9606, scientific_name='homo_sapiens', updated_at=2023-08-11 19:27:23, bionty_source_id='Xo0f', created_by_id='DzTjkKse')
ln.track()
πŸ’‘ notebook imports: lamindb==0.50.2 lnschema_bionty==0.29.2
🌱 saved: Transform(id='eNef4Arw8nNMz8', name='Track data using bio-registries & provenance', short_name='analysis-flow', stem_id='eNef4Arw8nNM', version='0', type=notebook, updated_at=2023-08-11 19:27:23, created_by_id='DzTjkKse')
🌱 saved: Run(id='dMQlQyJpPCpZEE5RxWJd', run_at=2023-08-11 19:27:23, transform_id='eNef4Arw8nNMz8', created_by_id='DzTjkKse')

Track cell types, tissues and diseases#

We fetch an example dataset from LaminDB that has a few cell type, tissue and disease annotations:

Hide code cell content
adata = ln.dev.datasets.anndata_with_obs()
adata
AnnData object with n_obs Γ— n_vars = 40 Γ— 100
    obs: 'cell_type', 'cell_type_id', 'tissue', 'disease'
adata.var_names[:5]
Index(['ENSG00000000003', 'ENSG00000000005', 'ENSG00000000419',
       'ENSG00000000457', 'ENSG00000000460'],
      dtype='object')
adata.obs[["tissue", "cell_type", "disease"]].value_counts()
tissue  cell_type                disease                   
brain   my new cell type         Alzheimer disease             10
heart   hepatocyte               cardiac ventricle disorder    10
kidney  T cell                   chronic kidney disease        10
liver   hematopoietic stem cell  liver lymphoma                10
Name: count, dtype: int64

Processing the dataset#

To track our data transformation we create a new Transform of type β€œpipeline”:

transform = ln.Transform(
    name="Subset to T-cells and liver lymphoma", version="0.1.0", type="pipeline"
)

Set the current tracking to the new transform:

ln.track(transform)
🌱 saved: Transform(id='evzXQzN3EMp25d', name='Subset to T-cells and liver lymphoma', stem_id='evzXQzN3EMp2', version='0.1.0', type='pipeline', updated_at=2023-08-11 19:27:27, created_by_id='DzTjkKse')
🌱 saved: Run(id='pxW55vAzm35ws7jBBhA1', run_at=2023-08-11 19:27:27, transform_id='evzXQzN3EMp25d', created_by_id='DzTjkKse')

Get a backed AnnData object#

file = ln.File.filter(key="mini_anndata_with_obs.h5ad").one()
adata = file.backed()
adata
πŸ’‘ adding file AeLDDYOct4OVgLKTBmRo as input for run pxW55vAzm35ws7jBBhA1, adding parent transform eNef4Arw8nNMz8
AnnDataAccessor object with n_obs Γ— n_vars = 40 Γ— 100
  constructed for the AnnData object mini_anndata_with_obs.h5ad
    obs: ['_index', 'cell_type', 'cell_type_id', 'disease', 'tissue']
    var: ['_index']
adata.obs[["cell_type", "disease"]].value_counts()
cell_type                disease                   
T cell                   chronic kidney disease        10
hematopoietic stem cell  liver lymphoma                10
hepatocyte               cardiac ventricle disorder    10
my new cell type         Alzheimer disease             10
Name: count, dtype: int64

Subset dataset to specific cell types and diseases#

Create the subset:

subset_obs = adata.obs.cell_type.isin(["T cell", "hematopoietic stem cell"]) & (
    adata.obs.disease.isin(["liver lymphoma", "chronic kidney disease"])
)
adata_subset = adata[subset_obs]
adata_subset
AnnDataAccessorSubset object with n_obs Γ— n_vars = 20 Γ— 100
  obs: ['_index', 'cell_type', 'cell_type_id', 'disease', 'tissue']
  var: ['_index']
adata_subset.obs[["cell_type", "disease"]].value_counts()
cell_type                disease               
T cell                   chronic kidney disease    10
hematopoietic stem cell  liver lymphoma            10
Name: count, dtype: int64

This subset can now be registered:

file_subset = ln.File.from_anndata(
    adata_subset.to_memory(),
    key="subset/mini_anndata_with_obs.h5ad",
    var_ref=lb.Gene.ensembl_gene_id,
)
/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/anndata/_core/anndata.py:1840: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")
πŸ’‘ file will be copied to default storage upon `save()` with key 'subset/mini_anndata_with_obs.h5ad'
πŸ’‘ parsing feature names of X stored in slot 'var'
πŸ’‘    using global setting species = human
βœ…    validated 99 Gene records on ensembl_gene_id: ENSG00000000003, ENSG00000000005, ENSG00000000419, ENSG00000000457, ENSG00000000460, ENSG00000000938, ENSG00000000971, ENSG00000001036, ENSG00000001084, ENSG00000001167, ENSG00000001460, ENSG00000001461, ENSG00000001497, ENSG00000001561, ENSG00000001617, ENSG00000001626, ENSG00000001629, ENSG00000001630, ENSG00000001631, ENSG00000002016, ...
βœ…    loaded FeatureSet(id='i68sBk48EYWCEUw578CS', n=99, type='float', registry='bionty.Gene', hash='fHbDaAAmJse48vnUQh9C', updated_at=2023-08-11 19:27:26, created_by_id='DzTjkKse')
🌱    linked: FeatureSet(id='i68sBk48EYWCEUw578CS', n=99, type='float', registry='bionty.Gene', hash='fHbDaAAmJse48vnUQh9C', updated_at=2023-08-11 19:27:26, created_by_id='DzTjkKse')
πŸ’‘ parsing feature names of slot 'obs'
βœ…    validated 3 Feature records on name: cell_type, disease, tissue
πŸ”Ά    did not validate 1 Feature record for name: cell_type_id
πŸ”Ά    ignoring non-validated features: cell_type_id
🌱    linked: FeatureSet(id='nfhtld5EfMgTOHtVEeS9', n=3, registry='core.Feature', hash='vosymFs2FjQ1FaJpSkSP', created_by_id='DzTjkKse')
file_subset.save()
🌱 saved 2 feature sets for slots: ['var', 'obs']
🌱 storing file '0rCATzM4ofqMUs17yJCp' with key 'subset/mini_anndata_with_obs.h5ad'

Add labels to features, all of them validate:

cell_types = lb.CellType.from_values(adata.obs.cell_type, lb.CellType.name)
tissues = lb.Tissue.from_values(adata.obs.tissue, lb.Tissue.name)
diseases = lb.Disease.from_values(adata.obs.disease, lb.Disease.name)

file_subset.add_labels(cell_types)
file_subset.add_labels(tissues)
file_subset.add_labels(diseases)
Hide code cell output
βœ… validated 4 CellType records on name: T cell, hematopoietic stem cell, hepatocyte, my new cell type
βœ… validated 4 Tissue records on name: brain, heart, kidney, liver
βœ… validated 4 Disease records on name: Alzheimer disease, cardiac ventricle disorder, chronic kidney disease, liver lymphoma
🌱 linked labels 'T cell', 'hematopoietic stem cell', 'hepatocyte', 'my new cell type' to feature 'cell_type'
🌱 linked labels 'brain', 'heart', 'kidney', 'liver' to feature 'tissue'
🌱 linked labels 'Alzheimer disease', 'cardiac ventricle disorder', 'chronic kidney disease', 'liver lymphoma' to feature 'disease'
file_subset.describe()
πŸ’‘ File(id=0rCATzM4ofqMUs17yJCp, key=subset/mini_anndata_with_obs.h5ad, suffix=.h5ad, accessor=AnnData, description=None, version=None, size=38992, hash=RgGUx7ndRplZZSmalTAWiw, hash_type=md5, created_at=2023-08-11 19:27:27.916727+00:00, updated_at=2023-08-11 19:27:27.916749+00:00)

Provenance:
    πŸ—ƒοΈ storage: Storage(id='tC7ncbg2', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-usecase', type='local', updated_at=2023-08-11 19:27:21, created_by_id='DzTjkKse')
    πŸ“Ž initial_version: None
    🧩 transform: Transform(id='evzXQzN3EMp25d', name='Subset to T-cells and liver lymphoma', stem_id='evzXQzN3EMp2', version='0.1.0', type='pipeline', updated_at=2023-08-11 19:27:27, created_by_id='DzTjkKse')
    πŸš— run: Run(id='pxW55vAzm35ws7jBBhA1', run_at=2023-08-11 19:27:27, transform_id='evzXQzN3EMp25d', created_by_id='DzTjkKse')
    πŸ‘€ created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-08-11 19:27:21)
Features:
  πŸ—ΊοΈ var (X):
    πŸ”— index (99, bionty.Gene.id): ['1UmZaXkjxuaq', 'Iv22z0sNjRxW', 'N0SyJm0035cr', '0WhF30UcYwMC', 'HAJMISs6IDzW'...]
  πŸ—ΊοΈ obs (metadata):
    πŸ”— cell_type (4, bionty.CellType): ['hematopoietic stem cell', 'T cell', 'hepatocyte', 'my new cell type']
    πŸ”— disease (4, bionty.Disease): ['Alzheimer disease', 'cardiac ventricle disorder', 'liver lymphoma', 'chronic kidney disease']
    πŸ”— tissue (4, bionty.Tissue): ['liver', 'kidney', 'brain', 'heart']

Examine data lineage#

Common questions that might arise are:

  • Which h5ad file is in the subset subfolder?

  • Which notebook ingested this file?

  • By whom?

  • And which file is its parent?

Let’s answer this using LaminDB:

Query a subsetted .h5ad file containing β€œhematopoietic stem cell” and β€œT cell” to learn which h5ad file is in the subset subfolder:

cell_types_bt_lookup = lb.CellType.lookup()
my_subset = ln.File.filter(
    suffix=".h5ad",
    key__startswith="subset",
    cell_types__in=[
        cell_types_bt_lookup.hematopoietic_stem_cell,
        cell_types_bt_lookup.t_cell,
    ],
).first()
my_subset.view_lineage()
https://d33wubrfki0l68.cloudfront.net/d237277aa5658ab2c42480c65ead903f99bcaada/99ee3/_images/5ebc58354d8264db59d3640f84a7709725e439da74cf0c5722aa97a1b2a96245.svg
Hide code cell content
!lamin delete analysis-usecase
!rm -r ./analysis-usecase
πŸ’‘ deleting instance testuser1/analysis-usecase
βœ…     deleted instance settings file: /home/runner/.lamin/instance--testuser1--analysis-usecase.env
βœ…     instance cache deleted
βœ…     deleted '.lndb' sqlite file
πŸ”Ά     consider manually delete your stored data: /home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-usecase