Curate & link bulk RNA-seq datasets#
Show code cell content
!lamin init --storage test-bulkrna --schema bionty
💡 creating schemas: core==0.45.0 bionty==0.29.2
🌱 saved: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-08-11 19:25:22)
🌱 saved: Storage(id='LIYt3lli', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-bulkrna', type='local', updated_at=2023-08-11 19:25:22, created_by_id='DzTjkKse')
✅ loaded instance: testuser1/test-bulkrna
💡 did not register local instance on hub (if you want, call `lamin register`)
import lamindb as ln
from pathlib import Path
import lnschema_bionty as lb
import pandas as pd
import anndata as ad
✅ loaded instance: testuser1/test-bulkrna (lamindb 0.50.2)
Let’s start simulating the output of a bulk RNA-seq pipeline:
Show code cell content
# pretned we're running a bulk RNA-seq pipeline
ln.track(ln.Transform(name="nf-core RNA-seq", reference="https://nf-co.re/rnaseq"))
# create a directory for its output
Path("./test-bulkrna/output_dir").mkdir(exist_ok=True)
# get the count matrix
path = ln.dev.datasets.file_tsv_rnaseq_nfcore_salmon_merged_gene_counts()
# move it into the output directory
path = path.rename(f"./test-bulkrna/output_dir/{path.name}")
# register it
ln.File(path, description="Merged Bulk RNA counts").save()
🌱 saved: Transform(id='S30ZfPE1PZDCz8', name='nf-core RNA-seq', stem_id='S30ZfPE1PZDC', version='0', type=notebook, reference='https://nf-co.re/rnaseq', updated_at=2023-08-11 19:25:23, created_by_id='DzTjkKse')
🌱 saved: Run(id='NH7C3WAkR9bPtamb3HAK', run_at=2023-08-11 19:25:23, transform_id='S30ZfPE1PZDCz8', created_by_id='DzTjkKse')
🔶 file has more than two suffixes (path.suffixes), using only last suffix: '.tsv'
💡 file in storage '/home/runner/work/lamin-usecases/lamin-usecases/docs/test-bulkrna' with key 'output_dir/salmon.merged.gene_counts.tsv'
Curate the count matrix#
ln.track()
💡 notebook imports: anndata==0.9.2 lamindb==0.50.2 lnschema_bionty==0.29.2 pandas==1.5.3
🌱 saved: Transform(id='s5V0dNMVwL9iz8', name='Curate & link bulk RNA-seq datasets', short_name='bulkrna', stem_id='s5V0dNMVwL9i', version='0', type=notebook, updated_at=2023-08-11 19:25:24, created_by_id='DzTjkKse')
🌱 saved: Run(id='Ifw1GtrriFLtURYORqeq', run_at=2023-08-11 19:25:24, transform_id='s5V0dNMVwL9iz8', created_by_id='DzTjkKse')
Let’s query the file:
file = ln.File.filter(description="Merged Bulk RNA counts").one()
df = file.load()
💡 adding file 5mEZixsi2gpMwy6LRJ7m as input for run Ifw1GtrriFLtURYORqeq, adding parent transform S30ZfPE1PZDCz8
If we look at it, we realize it deviates far from the tidy data standard [Wickham14], conventions of statistics & machine learning [Hastie09, Murphy12] and the major Python & R data packages.
Variables aren’t in columns and observations aren’t in rows:
df
gene_id | gene_name | RAP1_IAA_30M_REP1 | RAP1_UNINDUCED_REP1 | RAP1_UNINDUCED_REP2 | WT_REP1 | WT_REP2 | |
---|---|---|---|---|---|---|---|
0 | Gfp_transgene_gene | Gfp_transgene_gene | 0.0 | 0.000 | 0.0 | 0.0 | 0.0 |
1 | HRA1 | HRA1 | 0.0 | 8.572 | 0.0 | 0.0 | 0.0 |
2 | snR18 | snR18 | 3.0 | 8.000 | 4.0 | 8.0 | 8.0 |
3 | tA(UGC)A | TGA1 | 0.0 | 0.000 | 0.0 | 0.0 | 0.0 |
4 | tL(CAA)A | SUP56 | 0.0 | 0.000 | 0.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... |
120 | YAR064W | YAR064W | 0.0 | 2.000 | 0.0 | 0.0 | 0.0 |
121 | YAR066W | YAR066W | 3.0 | 13.000 | 8.0 | 5.0 | 11.0 |
122 | YAR068W | YAR068W | 9.0 | 28.000 | 24.0 | 5.0 | 7.0 |
123 | YAR069C | YAR069C | 0.0 | 0.000 | 0.0 | 0.0 | 1.0 |
124 | YAR070C | YAR070C | 0.0 | 0.000 | 0.0 | 0.0 | 0.0 |
125 rows × 7 columns
Let’s change that and move observations into rows:
df = df.T
df
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
gene_id | Gfp_transgene_gene | HRA1 | snR18 | tA(UGC)A | tL(CAA)A | tP(UGG)A | tS(AGA)A | YAL001C | YAL002W | YAL003W | ... | YAR050W | YAR053W | YAR060C | YAR061W | YAR062W | YAR064W | YAR066W | YAR068W | YAR069C | YAR070C |
gene_name | Gfp_transgene_gene | HRA1 | snR18 | TGA1 | SUP56 | TRN1 | tS(AGA)A | TFC3 | VPS8 | EFB1 | ... | FLO1 | YAR053W | YAR060C | YAR061W | YAR062W | YAR064W | YAR066W | YAR068W | YAR069C | YAR070C |
RAP1_IAA_30M_REP1 | 0.0 | 0.0 | 3.0 | 0.0 | 0.0 | 0.0 | 1.0 | 55.0 | 36.0 | 632.0 | ... | 4.357 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 3.0 | 9.0 | 0.0 | 0.0 |
RAP1_UNINDUCED_REP1 | 0.0 | 8.572 | 8.0 | 0.0 | 0.0 | 0.0 | 0.0 | 72.0 | 33.0 | 810.0 | ... | 15.72 | 0.0 | 0.0 | 0.0 | 3.0 | 2.0 | 13.0 | 28.0 | 0.0 | 0.0 |
RAP1_UNINDUCED_REP2 | 0.0 | 0.0 | 4.0 | 0.0 | 0.0 | 0.0 | 0.0 | 115.0 | 82.0 | 1693.0 | ... | 13.772 | 0.0 | 4.0 | 0.0 | 2.0 | 0.0 | 8.0 | 24.0 | 0.0 | 0.0 |
WT_REP1 | 0.0 | 0.0 | 8.0 | 0.0 | 0.0 | 1.0 | 0.0 | 60.0 | 63.0 | 1115.0 | ... | 13.465 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 5.0 | 5.0 | 0.0 | 0.0 |
WT_REP2 | 0.0 | 0.0 | 8.0 | 0.0 | 0.0 | 0.0 | 0.0 | 30.0 | 25.0 | 704.0 | ... | 6.891 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 11.0 | 7.0 | 1.0 | 0.0 |
7 rows × 125 columns
Now, it’s clear that the first two rows are in fact no observations, but descriptions of the variables (or features) themselves.
Let’s create an AnnData object to model this. First, create a dataframe for the variables:
var = pd.DataFrame({"gene_name": df.loc["gene_name"].values}, index=df.loc["gene_id"])
var.head()
gene_name | |
---|---|
gene_id | |
Gfp_transgene_gene | Gfp_transgene_gene |
HRA1 | HRA1 |
snR18 | snR18 |
tA(UGC)A | TGA1 |
tL(CAA)A | SUP56 |
Now, let’s create an AnnData:
# we're also fixing the datatype here, which was string in the tsv
adata = ad.AnnData(df.iloc[2:].astype("float32"), var=var)
adata
AnnData object with n_obs × n_vars = 5 × 125
var: 'gene_name'
The AnnData object is in tidy form and complies with conventions of statistics and machine learning:
adata.to_df()
gene_id | Gfp_transgene_gene | HRA1 | snR18 | tA(UGC)A | tL(CAA)A | tP(UGG)A | tS(AGA)A | YAL001C | YAL002W | YAL003W | ... | YAR050W | YAR053W | YAR060C | YAR061W | YAR062W | YAR064W | YAR066W | YAR068W | YAR069C | YAR070C |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
RAP1_IAA_30M_REP1 | 0.0 | 0.000 | 3.0 | 0.0 | 0.0 | 0.0 | 1.0 | 55.0 | 36.0 | 632.0 | ... | 4.357 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 3.0 | 9.0 | 0.0 | 0.0 |
RAP1_UNINDUCED_REP1 | 0.0 | 8.572 | 8.0 | 0.0 | 0.0 | 0.0 | 0.0 | 72.0 | 33.0 | 810.0 | ... | 15.720 | 0.0 | 0.0 | 0.0 | 3.0 | 2.0 | 13.0 | 28.0 | 0.0 | 0.0 |
RAP1_UNINDUCED_REP2 | 0.0 | 0.000 | 4.0 | 0.0 | 0.0 | 0.0 | 0.0 | 115.0 | 82.0 | 1693.0 | ... | 13.772 | 0.0 | 4.0 | 0.0 | 2.0 | 0.0 | 8.0 | 24.0 | 0.0 | 0.0 |
WT_REP1 | 0.0 | 0.000 | 8.0 | 0.0 | 0.0 | 1.0 | 0.0 | 60.0 | 63.0 | 1115.0 | ... | 13.465 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 5.0 | 5.0 | 0.0 | 0.0 |
WT_REP2 | 0.0 | 0.000 | 8.0 | 0.0 | 0.0 | 0.0 | 0.0 | 30.0 | 25.0 | 704.0 | ... | 6.891 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 11.0 | 7.0 | 1.0 | 0.0 |
5 rows × 125 columns
Validate & register the count matrix#
Let’s create a File object from this AnnData. Because this will validate the gene IDs and these are only defined given a species, we have to set a species context:
lb.settings.species = "saccharomyces cerevisiae"
🌱 set species: Species(id='nn8c', name='saccharomyces cerevisiae', taxon_id=559292, scientific_name='saccharomyces_cerevisiae', updated_at=2023-08-11 19:25:24, bionty_source_id='xvRa', created_by_id='DzTjkKse')
Almost all gene IDs are validated:
curated_file = ln.File.from_anndata(
adata, description="Curated bulk RNA counts", var_ref=lb.Gene.stable_id
)
💡 file will be copied to default storage upon `save()` with key 'MuPmvyO8u5fTHrH3PWqY.h5ad'
💡 parsing feature names of X stored in slot 'var'
💡 using global setting species = saccharomyces cerevisiae
✅ validated 123 Gene records from Bionty on stable_id: HRA1, snR18, tA(UGC)A, tL(CAA)A, tP(UGG)A, tS(AGA)A, YAL001C, YAL002W, YAL003W, YAL004W, YAL005C, YAL007C, YAL008W, YAL009W, YAL010C, YAL011W, YAL012W, YAL013W, YAL014C, YAL015C, ...
🔶 did not validate 2 Gene records for stable_ids: Gfp_transgene_gene, YAR062W
🔶 ignoring non-validated features: Gfp_transgene_gene,YAR062W
🌱 linked: FeatureSet(id='eVgFMsgWVGxTCcCMeogu', n=123, type='float', registry='bionty.Gene', hash='8j8y_AHnWb5huZ2hXCDj', created_by_id='DzTjkKse')
Hence, let’s save this file:
curated_file.save()
🌱 saved 1 feature set for slot: ['var']
🌱 storing file 'MuPmvyO8u5fTHrH3PWqY' with key '.lamindb/MuPmvyO8u5fTHrH3PWqY.h5ad'
Example queries#
We have two files in the file registry:
ln.File.filter().df()
storage_id | key | suffix | accessor | description | version | initial_version_id | size | hash | hash_type | transform_id | run_id | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||
5mEZixsi2gpMwy6LRJ7m | LIYt3lli | output_dir/salmon.merged.gene_counts.tsv | .tsv | None | Merged Bulk RNA counts | None | None | 3787 | xxw0k3au3KtxFcgtbEr4eQ | md5 | S30ZfPE1PZDCz8 | NH7C3WAkR9bPtamb3HAK | 2023-08-11 19:25:23 | DzTjkKse |
MuPmvyO8u5fTHrH3PWqY | LIYt3lli | None | .h5ad | AnnData | Curated bulk RNA counts | None | None | 28180 | 6bieh8XjOCCz6bJToN4u1g | md5 | s5V0dNMVwL9iz8 | Ifw1GtrriFLtURYORqeq | 2023-08-11 19:25:25 | DzTjkKse |
curated_file.view_lineage()
Let’s by query by gene: …
Show code cell content
!lamin delete test-bulkrna
!rm -r test-bulkrna
💡 deleting instance testuser1/test-bulkrna
✅ deleted instance settings file: /home/runner/.lamin/instance--testuser1--test-bulkrna.env
✅ instance cache deleted
✅ deleted '.lndb' sqlite file
🔶 consider manually delete your stored data: /home/runner/work/lamin-usecases/lamin-usecases/docs/test-bulkrna