Bird’s eye view#

You typically want to know where your files & datasets came from.

In this guide, you’ll learn how to backtrace file transformations through notebooks, pipelines & app uploads.

import lamindb as ln

✅ loaded instance: testuser1/mydata (lamindb 0.50.2)

Track a bioinformatics pipeline#

When working with a pipeline, we’ll register it before running it.

This only happens once and could be done by anyone on your team.

ln.Transform(name="Cell Ranger", version="7.2.0", type="pipeline").save()

Before running the pipeline, query or search for the corresponding transform record:

transform = ln.Transform.filter(name="Cell Ranger", version="7.2.0").one()

Pass the record to track() to set a global run_context:

ln.track(transform)

✅ loaded: Transform(id='TMiTPuhNmVNDsM', name='Cell Ranger', stem_id='TMiTPuhNmVND', version='7.2.0', type='pipeline', updated_at=2023-08-11 19:27:41, created_by_id='bKeW4T6E')

🌱 saved: Run(id='b9qsjzeoFJM1fKmzpc5Z', run_at=2023-08-11 19:27:42, transform_id='TMiTPuhNmVNDsM', created_by_id='bKeW4T6E')

Now, let’s stage (download) a few files from an instrument upload:

files = ln.File.filter(key__startswith="fastq/perturbseq").all()
filepaths = [file.stage() for file in files]

💡 adding file j6lqMHdqFiLC6RogQG5f as input for run b9qsjzeoFJM1fKmzpc5Z, adding parent transform QAvEbyFGu1Diz8

💡 adding file DmRDf1414CVge87NB5mE as input for run b9qsjzeoFJM1fKmzpc5Z, adding parent transform QAvEbyFGu1Diz8

Assume we processed them and obtained 3 output files in a folder 'filtered_feature_bc_matrix':

ln.File.tree("./mydata/perturbseq/filtered_feature_bc_matrix/")

filtered_feature_bc_matrix (0 sub-directories & 3 files): 
├── features.tsv.gz
├── matrix.mtx.gz
└── barcodes.tsv.gz

output_files = ln.File.from_dir("./mydata/perturbseq/filtered_feature_bc_matrix/")
ln.save(output_files)

✅ created 3 files from directory using storage /home/runner/work/lamin-usecases/lamin-usecases/docs/mydata and key = perturbseq/filtered_feature_bc_matrix/

🌱 storing file 'jrTKqw0BBHztnEwsmc3P' with key 'perturbseq/filtered_feature_bc_matrix/matrix.mtx.gz'

🌱 storing file 'DWi2Nc3Shn9EdkUNczjK' with key 'perturbseq/filtered_feature_bc_matrix/features.tsv.gz'

🌱 storing file 'nriiOWq70kQqYK2uJokd' with key 'perturbseq/filtered_feature_bc_matrix/barcodes.tsv.gz'

Each of these files now has transform and run records. For instance:

output_files[0].transform

Transform(id='TMiTPuhNmVNDsM', name='Cell Ranger', stem_id='TMiTPuhNmVND', version='7.2.0', type='pipeline', updated_at=2023-08-11 19:27:42, created_by_id='bKeW4T6E')

output_files[0].run

Run(id='b9qsjzeoFJM1fKmzpc5Z', run_at=2023-08-11 19:27:42, transform_id='TMiTPuhNmVNDsM', created_by_id='bKeW4T6E')

Let’s look at the data lineage at this stage:

output_files[0].view_lineage()

https://d33wubrfki0l68.cloudfront.net/d170d0727c3ab82afe2e8010d55b1314db083abf/73b62/_images/30db8e8baa673d82a3a210d0e4e2c5365784268f7417748b631d11210bfbfced.svg

And let’s keep running the Cell Ranger pipeline in the background:

Track app upload & analytics#

The hidden cell below simulates additional analytic steps including:

uploading phenotypic screen data
scRNA-seq analysis
analyses of the integrated datasets

Show code cell content Hide code cell content

# app upload
ln.setup.login("testuser1")
transform = ln.Transform(name="Upload GWS CRISPRa result", type="app")
ln.track(transform)

# upload and analyze the GWS data
filepath = ln.dev.datasets.schmidt22_crispra_gws_IFNG(ln.settings.storage)
file = ln.File(filepath, description="Raw data of schmidt22 crispra GWS")
file.save()
ln.setup.login("testuser2")
transform = ln.Transform(name="GWS CRIPSRa analysis", type="notebook")
ln.track(transform)

file_wgs = ln.File.filter(key="schmidt22-crispra-gws-IFNG.csv").one()
df = file_wgs.load().set_index("id")
hits_df = df[df["pos|fdr"] < 0.01].copy()
file_hits = ln.File(hits_df, description="hits from schmidt22 crispra GWS")
file_hits.save()

✅ logged in with email testuser1@lamin.ai and id DzTjkKse

🌱 saved: Transform(id='IrONndjTzLqGz8', name='Upload GWS CRISPRa result', stem_id='IrONndjTzLqG', version='0', type='app', updated_at=2023-08-11 19:27:43, created_by_id='DzTjkKse')

🌱 saved: Run(id='fqeK1qSPn64Bnm8OkAT6', run_at=2023-08-11 19:27:43, transform_id='IrONndjTzLqGz8', created_by_id='DzTjkKse')

💡 file in storage '/home/runner/work/lamin-usecases/lamin-usecases/docs/mydata' with key 'schmidt22-crispra-gws-IFNG.csv'

✅ logged in with email testuser2@lamin.ai and id bKeW4T6E

🌱 saved: Transform(id='UGEF7DB1wS5yz8', name='GWS CRIPSRa analysis', stem_id='UGEF7DB1wS5y', version='0', type='notebook', updated_at=2023-08-11 19:27:44, created_by_id='bKeW4T6E')

🌱 saved: Run(id='8y9cnKPY9VRxRBsyQuyR', run_at=2023-08-11 19:27:44, transform_id='UGEF7DB1wS5yz8', created_by_id='bKeW4T6E')

💡 adding file tTsErZDwt4R0kR46obas as input for run 8y9cnKPY9VRxRBsyQuyR, adding parent transform IrONndjTzLqGz8

💡 file will be copied to default storage upon `save()` with key '64t0tY0Ix8tZ2hlhQcY5.parquet'

💡 file is a dataframe, consider using File.from_df() to link column names as features

🌱 storing file '64t0tY0Ix8tZ2hlhQcY5' with key '.lamindb/64t0tY0Ix8tZ2hlhQcY5.parquet'

Let’s see how the data lineage of this looks:

file = ln.File.filter(description="hits from schmidt22 crispra GWS").one()
file.view_lineage()

https://d33wubrfki0l68.cloudfront.net/d9ffa5b018dfcf8036962366c3c4c3f7f7969502/d85b3/_images/d1bd398cfafa3a0d732514f77155f617bc64dfcb745d7b9b9634f9432ae3e7f1.svg

Track notebooks#

In the backgound, somebody integrated and analyzed the outputs of the app upload and the Cell Ranger pipeline:

Show code cell content Hide code cell content

# Let us add analytics on top of the cell ranger pipeline and the phenotypic screening
transform = ln.Transform(
    name="Perform single cell analysis, integrating with CRISPRa screen",
    type="notebook",
)
ln.track(transform)

file_ps = ln.File.filter(description__icontains="perturbseq").one()
adata = file_ps.load()
screen_hits = file_hits.load()
import scanpy as sc

sc.tl.score_genes(adata, adata.var_names.intersection(screen_hits.index).tolist())
filesuffix = "_fig1_score-wgs-hits.png"
sc.pl.umap(adata, color="score", show=False, save=filesuffix)
filepath = f"figures/umap{filesuffix}"
file = ln.File(filepath, key=filepath)
file.save()
filesuffix = "fig2_score-wgs-hits-per-cluster.png"
sc.pl.matrixplot(
    adata, groupby="cluster_name", var_names=["score"], show=False, save=filesuffix
)
filepath = f"figures/matrixplot_{filesuffix}"
file = ln.File(filepath, key=filepath)
file.save()

🌱 saved: Transform(id='Z90ZSejWZMC7z8', name='Perform single cell analysis, integrating with CRISPRa screen', stem_id='Z90ZSejWZMC7', version='0', type='notebook', updated_at=2023-08-11 19:27:44, created_by_id='bKeW4T6E')

🌱 saved: Run(id='3OE6IIIiw5gIebjDe5zO', run_at=2023-08-11 19:27:44, transform_id='Z90ZSejWZMC7z8', created_by_id='bKeW4T6E')

💡 adding file ROHfy0uWmsPg5rQykcNQ as input for run 3OE6IIIiw5gIebjDe5zO, adding parent transform WwwqvcwrXoqA0b

💡 adding file 64t0tY0Ix8tZ2hlhQcY5 as input for run 3OE6IIIiw5gIebjDe5zO, adding parent transform UGEF7DB1wS5yz8

WARNING: saving figure to file figures/umap_fig1_score-wgs-hits.png

💡 file will be copied to default storage upon `save()` with key 'figures/umap_fig1_score-wgs-hits.png'

🌱 storing file 'ANH24ERND4PhXGu8Aphg' with key 'figures/umap_fig1_score-wgs-hits.png'

WARNING: saving figure to file figures/matrixplot_fig2_score-wgs-hits-per-cluster.png

💡 file will be copied to default storage upon `save()` with key 'figures/matrixplot_fig2_score-wgs-hits-per-cluster.png'

🌱 storing file 'PZTirw08DveFsPTpD4AJ' with key 'figures/matrixplot_fig2_score-wgs-hits-per-cluster.png'

The outcome of it are a few figures stored as image files. Let’s query one of them and look at the data lineage:

file = ln.File.filter(key__contains="figures/matrixplot").one()
file.view_lineage()

https://d33wubrfki0l68.cloudfront.net/05b0e04483ae81d6c7dec2296d2d5631d27e67a0/0104b/_images/9d02758a63c5ebd836a931145f69fd7a94d17125a3e57a0c431c722b2b51e899.svg

We’d now like to track the current Jupyter notebook to continue the work:

ln.track()

🌱 saved: Transform(id='1LCd8kco9lZUz8', name='Bird's eye view', short_name='birds-eye', stem_id='1LCd8kco9lZU', version='0', type=notebook, updated_at=2023-08-11 19:27:46, created_by_id='bKeW4T6E')

🌱 saved: Run(id='QyOZK34ueVMeCQFrsen8', run_at=2023-08-11 19:27:46, transform_id='1LCd8kco9lZUz8', created_by_id='bKeW4T6E')

Let’s load the image file:

file.stage()

💡 adding file PZTirw08DveFsPTpD4AJ as input for run QyOZK34ueVMeCQFrsen8, adding parent transform Z90ZSejWZMC7z8

PosixPath('/home/runner/work/lamin-usecases/lamin-usecases/docs/mydata/figures/matrixplot_fig2_score-wgs-hits-per-cluster.png')

We see that the image file is tracked as an input of the current notebook. The input is highlighted, the notebook follows at the bottom:

file.view_lineage()

https://d33wubrfki0l68.cloudfront.net/590cb9ff466e12b4928a06a3333c0bafcc16dce2/7691b/_images/b8508eb2816837dbe0247f138d2ad371748e40c5ba8c385229f337c34fd39a92.svg

We can also purely look at the sequence of transforms:

transform = ln.Transform.search("Track data lineage", return_queryset=True).first()

transform.parents.df()

	name	short_name	stem_id	version	type	reference	updated_at	created_by_id
id
QAvEbyFGu1Diz8	Chromium 10x upload	None	QAvEbyFGu1Di	0	pipeline	None	2023-08-11 19:27:41	DzTjkKse

transform.view_parents()

https://d33wubrfki0l68.cloudfront.net/e0908332f263749c709f2738040fa39a1b5b8420/1f3b0/_images/246b16228faa503dd1d6f68975f4e63a69c03a35fdac7200e469bd2c951e8b43.svg

And if you or another user re-runs a notebook, they’ll be informed about parents in the logging:

ln.track()

✅ loaded: Transform(id='1LCd8kco9lZUz8', name='Bird's eye view', short_name='birds-eye', stem_id='1LCd8kco9lZU', version='0', type='notebook', updated_at=2023-08-11 19:27:46, created_by_id='bKeW4T6E')

✅ loaded: Run(id='QyOZK34ueVMeCQFrsen8', run_at=2023-08-11 19:27:46, transform_id='1LCd8kco9lZUz8', created_by_id='bKeW4T6E')

💡   parent transform: Transform(id='Z90ZSejWZMC7z8', name='Perform single cell analysis, integrating with CRISPRa screen', stem_id='Z90ZSejWZMC7', version='0', type='notebook', updated_at=2023-08-11 19:27:46, created_by_id='bKeW4T6E')

Understand runs#

Under-the-hood we already tracked pipeline and notebook runs through the global context: context.run.

You can see this most easily by looking at the File.run attribute (in addition to File.transform).

File objects are the inputs and outputs of such runs.

Sometimes, we don’t want to create a global run context but manually pass a run when creating a file:

ln.File(filepath, run=ln.Run(transform=transform))

When accessing files (staging, loading, etc.) are two things:

The current run gets added to file.input_of of the file that is accessed from the transform
The transform of that file got linked as a parent to the current transform

While run outputs are automatically tracked as data sources once you call ln.track(), you can then still switch off auto-tracking of run inputs if you set ln.settings.track_run_inputs = False.

You can also track run inputs on a case by case basis via is_run_input=True, e.g., here:

file.load(is_run_input=True)

Query by provenance#

We can query or search for the notebook that created the file:

transform = ln.Transform.search("GWS CRIPSRa analysis", return_queryset=True).first()

And then find all the files created by that notebook:

ln.File.filter(transform=transform).df()

	storage_id	key	suffix	accessor	description	version	initial_version_id	size	hash	hash_type	transform_id	run_id	updated_at	created_by_id
id
64t0tY0Ix8tZ2hlhQcY5	42h8wfpM	None	.parquet	DataFrame	hits from schmidt22 crispra GWS	None	None	18368	yw5f-kMLJhaNhdEF-lhxOQ	md5	UGEF7DB1wS5yz8	8y9cnKPY9VRxRBsyQuyR	2023-08-11 19:27:44	bKeW4T6E

Which transform ingested a given file?

file = ln.File.filter().first()
file.transform

Transform(id='QAvEbyFGu1Diz8', name='Chromium 10x upload', stem_id='QAvEbyFGu1Di', version='0', type='pipeline', updated_at=2023-08-11 19:27:41, created_by_id='DzTjkKse')

And which user?

file.created_by

User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-08-11 19:27:36)

Which transforms were created by a given user?

users = ln.User.lookup(field="handle")

ln.Transform.filter(created_by=users.testuser2).df()

	name	short_name	stem_id	version	type	reference	updated_at	created_by_id
id
TMiTPuhNmVNDsM	Cell Ranger	None	TMiTPuhNmVND	7.2.0	pipeline	None	2023-08-11 19:27:42	bKeW4T6E
WwwqvcwrXoqA0b	Preprocess Cell Ranger outputs	None	WwwqvcwrXoqA	2.0	pipeline	None	2023-08-11 19:27:43	bKeW4T6E
UGEF7DB1wS5yz8	GWS CRIPSRa analysis	None	UGEF7DB1wS5y	0	notebook	None	2023-08-11 19:27:44	bKeW4T6E
Z90ZSejWZMC7z8	Perform single cell analysis, integrating with...	None	Z90ZSejWZMC7	0	notebook	None	2023-08-11 19:27:46	bKeW4T6E
1LCd8kco9lZUz8	Bird's eye view	birds-eye	1LCd8kco9lZU	0	notebook	None	2023-08-11 19:27:46	bKeW4T6E

Which notebooks were created by a given user?

ln.Transform.filter(created_by=users.testuser2, type="notebook").df()

	name	short_name	stem_id	version	type	reference	updated_at	created_by_id
id
UGEF7DB1wS5yz8	GWS CRIPSRa analysis	None	UGEF7DB1wS5y	0	notebook	None	2023-08-11 19:27:44	bKeW4T6E
Z90ZSejWZMC7z8	Perform single cell analysis, integrating with...	None	Z90ZSejWZMC7	0	notebook	None	2023-08-11 19:27:46	bKeW4T6E
1LCd8kco9lZUz8	Bird's eye view	birds-eye	1LCd8kco9lZU	0	notebook	None	2023-08-11 19:27:46	bKeW4T6E

And of course, we can also view all recent additions to the entire database:

ln.view()

Show code cell output Hide code cell output

File

	storage_id	key	suffix	accessor	description	version	initial_version_id	size	hash	hash_type	transform_id	run_id	updated_at	created_by_id
id
PZTirw08DveFsPTpD4AJ	42h8wfpM	figures/matrixplot_fig2_score-wgs-hits-per-clu...	.png	None	None	None	None	28814	JYIPcat0YWYVCX3RVd3mww	md5	Z90ZSejWZMC7z8	3OE6IIIiw5gIebjDe5zO	2023-08-11 19:27:46	bKeW4T6E
ANH24ERND4PhXGu8Aphg	42h8wfpM	figures/umap_fig1_score-wgs-hits.png	.png	None	None	None	None	118999	laQjVk4gh70YFzaUyzbUNg	md5	Z90ZSejWZMC7z8	3OE6IIIiw5gIebjDe5zO	2023-08-11 19:27:45	bKeW4T6E
64t0tY0Ix8tZ2hlhQcY5	42h8wfpM	None	.parquet	DataFrame	hits from schmidt22 crispra GWS	None	None	18368	yw5f-kMLJhaNhdEF-lhxOQ	md5	UGEF7DB1wS5yz8	8y9cnKPY9VRxRBsyQuyR	2023-08-11 19:27:44	bKeW4T6E
tTsErZDwt4R0kR46obas	42h8wfpM	schmidt22-crispra-gws-IFNG.csv	.csv	None	Raw data of schmidt22 crispra GWS	None	None	1729685	cUSH0oQ2w-WccO8_ViKRAQ	md5	IrONndjTzLqGz8	fqeK1qSPn64Bnm8OkAT6	2023-08-11 19:27:43	DzTjkKse
ROHfy0uWmsPg5rQykcNQ	42h8wfpM	schmidt22_perturbseq.h5ad	.h5ad	AnnData	perturbseq counts	None	None	20659936	la7EvqEUMDlug9-rpw-udA	md5	WwwqvcwrXoqA0b	9HtSWlTYrHTMwE35Fu9W	2023-08-11 19:27:43	bKeW4T6E
nriiOWq70kQqYK2uJokd	42h8wfpM	perturbseq/filtered_feature_bc_matrix/barcodes...	.tsv.gz	None	None	None	None	6	BZn31jO5JAs14hCBvlg8Ug	md5	TMiTPuhNmVNDsM	b9qsjzeoFJM1fKmzpc5Z	2023-08-11 19:27:42	bKeW4T6E
DWi2Nc3Shn9EdkUNczjK	42h8wfpM	perturbseq/filtered_feature_bc_matrix/features...	.tsv.gz	None	None	None	None	6	p7PDJt6dKvyzCO8N2_wTWA	md5	TMiTPuhNmVNDsM	b9qsjzeoFJM1fKmzpc5Z	2023-08-11 19:27:42	bKeW4T6E
jrTKqw0BBHztnEwsmc3P	42h8wfpM	perturbseq/filtered_feature_bc_matrix/matrix.m...	.mtx.gz	None	None	None	None	6	WMV3_UVPqYmDAhFKEDPCsQ	md5	TMiTPuhNmVNDsM	b9qsjzeoFJM1fKmzpc5Z	2023-08-11 19:27:42	bKeW4T6E
DmRDf1414CVge87NB5mE	42h8wfpM	fastq/perturbseq_R2_001.fastq.gz	.fastq.gz	None	None	None	None	6	tp4y2I4sOOoY1xgTIaYRrQ	md5	QAvEbyFGu1Diz8	pVs6CrtjyZGBhk9ORAhX	2023-08-11 19:27:41	DzTjkKse
j6lqMHdqFiLC6RogQG5f	42h8wfpM	fastq/perturbseq_R1_001.fastq.gz	.fastq.gz	None	None	None	None	6	mtgKPNggR2xx7y6XNHVUSQ	md5	QAvEbyFGu1Diz8	pVs6CrtjyZGBhk9ORAhX	2023-08-11 19:27:41	DzTjkKse

Run

	transform_id	run_at	created_by_id	reference	reference_type
id
pVs6CrtjyZGBhk9ORAhX	QAvEbyFGu1Diz8	2023-08-11 19:27:41	DzTjkKse	None	None
b9qsjzeoFJM1fKmzpc5Z	TMiTPuhNmVNDsM	2023-08-11 19:27:42	bKeW4T6E	None	None
9HtSWlTYrHTMwE35Fu9W	WwwqvcwrXoqA0b	2023-08-11 19:27:42	bKeW4T6E	None	None
fqeK1qSPn64Bnm8OkAT6	IrONndjTzLqGz8	2023-08-11 19:27:43	DzTjkKse	None	None
8y9cnKPY9VRxRBsyQuyR	UGEF7DB1wS5yz8	2023-08-11 19:27:44	bKeW4T6E	None	None
3OE6IIIiw5gIebjDe5zO	Z90ZSejWZMC7z8	2023-08-11 19:27:44	bKeW4T6E	None	None
QyOZK34ueVMeCQFrsen8	1LCd8kco9lZUz8	2023-08-11 19:27:46	bKeW4T6E	None	None

Storage

	root	type	region	updated_at	created_by_id
id
42h8wfpM	/home/runner/work/lamin-usecases/lamin-usecase...	local	None	2023-08-11 19:27:39	bKeW4T6E

Transform

	name	short_name	stem_id	version	type	reference	updated_at	created_by_id
id
1LCd8kco9lZUz8	Bird's eye view	birds-eye	1LCd8kco9lZU	0	notebook	None	2023-08-11 19:27:46	bKeW4T6E
Z90ZSejWZMC7z8	Perform single cell analysis, integrating with...	None	Z90ZSejWZMC7	0	notebook	None	2023-08-11 19:27:46	bKeW4T6E
UGEF7DB1wS5yz8	GWS CRIPSRa analysis	None	UGEF7DB1wS5y	0	notebook	None	2023-08-11 19:27:44	bKeW4T6E
IrONndjTzLqGz8	Upload GWS CRISPRa result	None	IrONndjTzLqG	0	app	None	2023-08-11 19:27:43	DzTjkKse
WwwqvcwrXoqA0b	Preprocess Cell Ranger outputs	None	WwwqvcwrXoqA	2.0	pipeline	None	2023-08-11 19:27:43	bKeW4T6E
TMiTPuhNmVNDsM	Cell Ranger	None	TMiTPuhNmVND	7.2.0	pipeline	None	2023-08-11 19:27:42	bKeW4T6E
QAvEbyFGu1Diz8	Chromium 10x upload	None	QAvEbyFGu1Di	0	pipeline	None	2023-08-11 19:27:41	DzTjkKse

User

	handle	email	name	updated_at
id
bKeW4T6E	testuser2	testuser2@lamin.ai	Test User2	2023-08-11 19:27:39
DzTjkKse	testuser1	testuser1@lamin.ai	Test User1	2023-08-11 19:27:36