Curate & link multi-modal data#
Show code cell content
!lamin init --storage ./test-multimodal --schema bionty
💡 creating schemas: core==0.45.0 bionty==0.29.2
🌱 saved: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-08-11 19:26:36)
🌱 saved: Storage(id='BtPvJynt', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-multimodal', type='local', updated_at=2023-08-11 19:26:36, created_by_id='DzTjkKse')
✅ loaded instance: testuser1/test-multimodal
💡 did not register local instance on hub (if you want, call `lamin register`)
import lamindb as ln
import lnschema_bionty as lb
lb.settings.species = "human"
ln.settings.verbosity = 3
✅ loaded instance: testuser1/test-multimodal (lamindb 0.50.2)
🌱 set species: Species(id='uHJU', name='human', taxon_id=9606, scientific_name='homo_sapiens', updated_at=2023-08-11 19:26:37, bionty_source_id='UqVY', created_by_id='DzTjkKse')
ln.track()
💡 notebook imports: lamindb==0.50.2 lnschema_bionty==0.29.2
🌱 saved: Transform(id='yMWSFirS6qv2z8', name='Curate & link multi-modal data', short_name='multimodal', stem_id='yMWSFirS6qv2', version='0', type=notebook, updated_at=2023-08-11 19:26:37, created_by_id='DzTjkKse')
🌱 saved: Run(id='km3kD5EWDNAXZcGoPvTT', run_at=2023-08-11 19:26:37, transform_id='yMWSFirS6qv2z8', created_by_id='DzTjkKse')
MuData object#
Let’s use a MuData object:
Show code cell content
mdata = ln.dev.datasets.mudata_papalexi21_subset()
mdata
MuData object with n_obs × n_vars = 200 × 300 var: 'name' 4 modalities rna: 200 x 173 obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'nCount_GDO', 'nCount_ADT', 'nFeature_ADT', 'percent.mito', 'MULTI_ID', 'HTO_classification', 'guide_ID', 'gene_target', 'NT', 'perturbation', 'replicate', 'S.Score', 'G2M.Score', 'Phase' var: 'name' adt: 200 x 4 obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'nCount_GDO', 'nCount_ADT', 'nFeature_ADT', 'percent.mito', 'MULTI_ID', 'HTO_classification', 'guide_ID', 'gene_target', 'NT', 'perturbation', 'replicate', 'S.Score', 'G2M.Score', 'Phase' var: 'name' hto: 200 x 12 obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'nCount_GDO', 'nCount_ADT', 'nFeature_ADT', 'percent.mito', 'MULTI_ID', 'HTO_classification', 'guide_ID', 'gene_target', 'NT', 'perturbation', 'replicate', 'S.Score', 'G2M.Score', 'Phase' var: 'name' gdo: 200 x 111 obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'nCount_GDO', 'nCount_ADT', 'nFeature_ADT', 'percent.mito', 'MULTI_ID', 'HTO_classification', 'guide_ID', 'gene_target', 'NT', 'perturbation', 'replicate', 'S.Score', 'G2M.Score', 'Phase' var: 'name'
First we register the file:
file = ln.File(
"papalexi21_subset.h5mu", description="Sub-sampled MuData from Papalexi21"
)
file.save()
🌱 storing file 'WPisD6P743VBQflZHvgn' with key '.lamindb/WPisD6P743VBQflZHvgn.h5mu'
Register features#
Now let’s register the 3 feature sets this data contains:
rna
adt
obs (metadata)
modalities#
For the two modalities rna and adt, we use bionty tables as the reference:
mdata["rna"].var_names[:5]
Index(['RP5-827C21.6', 'XX-CR54.1', 'SH2D6', 'RP11-379B18.5', 'RP11-778D9.12'], dtype='object', name='index')
feature_set_rna = ln.FeatureSet.from_values(
mdata["rna"].var_names, field=lb.Gene.symbol
)
💡 using global setting species = human
✅ validated 93 Gene records from Bionty on symbol: SH2D6, ARHGAP26-AS1, GABRA1, HLA-DQB1-AS1, HLA-DQB1-AS1, HLA-DQB1-AS1, HLA-DQB1-AS1, HLA-DQB1-AS1, HLA-DQB1-AS1, HLA-DQB1-AS1, SPACA1, VNN1, CTAGE15, CTAGE15, PFKFB1, TRPC5, RBPMS-AS1, CA8, CSMD3, ZNF483, ...
🔶 ambiguous validation in Bionty for 11 records: HLA-DQB1-AS1, CTAGE15, CRYAB, CTRB2, LGALS9C, NPHS1, THPO, PCDHB11, XG, TBC1D3G, TUBB1
🔶 did not validate 96 Gene records for symbols: AC002066.1, AC004019.13, AC005150.1, AC006042.7, AC011558.5, AC026471.6, AC073934.6, AC091132.1, AC092295.4, AC092687.5, AE000662.93, AL132989.1, AP000442.4, AP003419.16, C14orf177, C1orf65, CASC1, CTA-373H7.7, CTB-134F13.1, CTB-31O20.9, ...
🔶 ignoring non-validated features: AC002066.1,AC004019.13,AC005150.1,AC006042.7,AC011558.5,AC026471.6,AC073934.6,AC091132.1,AC092295.4,AC092687.5,AE000662.93,AL132989.1,AP000442.4,AP003419.16,C14orf177,C1orf65,CASC1,CTA-373H7.7,CTB-134F13.1,CTB-31O20.9,CTC-467M3.1,CTC-498J12.1,CTD-2562J17.2,CTD-3012A18.1,CTD-3065B20.2,CTD-3193O13.8,FAM65C,HIST1H4K,IBA57-AS1,KIAA1239,LARGE,NBPF16,RP1-1J6.2,RP11-110I1.14,RP11-113K21.4,RP11-120C12.3,RP11-12D24.10,RP11-12J10.3,RP11-134K13.4,RP11-136I14.5,RP11-138C9.1,RP11-146I2.1,RP11-152H18.3,RP11-17J14.2,RP11-186N15.3,RP11-187A9.3,RP11-214C8.2,RP11-219B4.7,RP11-231C14.4,RP11-235C23.5,RP11-247A12.8,RP11-265N6.2,RP11-268G12.1,RP11-2H8.2,RP11-304L19.11,RP11-307N16.6,RP11-324E6.9,RP11-325L7.1,RP11-32B5.7,RP11-335O4.3,RP11-346D14.1,RP11-365N19.2,RP11-379B18.5,RP11-3D4.2,RP11-403P17.5,RP11-408E5.4,RP11-415J8.5,RP11-434D9.1,RP11-465N4.4,RP11-473O4.4,RP11-496I9.1,RP11-524H19.2,RP11-532F6.4,RP11-536K7.5,RP11-546K22.3,RP11-624M8.1,RP11-703G6.1,RP11-717H13.1,RP11-745O10.2,RP11-75C10.9,RP11-760N9.1,RP11-778D9.12,RP11-80H5.7,RP11-835E18.5,RP11-867G23.3,RP11-973N13.4,RP11-982M15.2,RP11-9M16.2,RP13-582O9.7,RP3-327A19.5,RP3-337O18.9,RP5-827C21.6,RP5-855F16.1,TMEM75,U52111.14,XX-CR54.1
mdata["adt"].var_names
Index(['CD86', 'PDL1', 'PDL2', 'CD366'], dtype='object', name='index')
feature_set_adt = ln.FeatureSet.from_values(
mdata["adt"].var_names, field=lb.CellMarker.name
)
💡 using global setting species = human
✅ validated 4 CellMarker records from Bionty on name: CD86, PDL1, PDL2, CD366
Link them to file:
file.features.add_feature_set(feature_set_rna, slot="rna")
file.features.add_feature_set(feature_set_adt, slot="adt")
metadata#
The 3rd feature set is the obs:
obs = mdata["rna"].obs
We’re only interested in a single metadata column:
ln.Feature(name="gene_target", type="category").save()
feature_set_obs = ln.FeatureSet.from_df(obs, "metadata")
✅ validated 1 Feature record on name: gene_target
🔶 did not validate 18 Feature records for names: G2M.Score, HTO_classification, MULTI_ID, NT, Phase, S.Score, guide_ID, nCount_ADT, nCount_GDO, nCount_HTO, nCount_RNA, nFeature_ADT, nFeature_HTO, nFeature_RNA, orig.ident, percent.mito, perturbation, replicate
🔶 ignoring non-validated features: G2M.Score,HTO_classification,MULTI_ID,NT,Phase,S.Score,guide_ID,nCount_ADT,nCount_GDO,nCount_HTO,nCount_RNA,nFeature_ADT,nFeature_HTO,nFeature_RNA,orig.ident,percent.mito,perturbation,replicate
file.features.add_feature_set(feature_set_obs, slot="obs")
gene_targets = lb.Gene.from_values(obs["gene_target"], "symbol")
ln.save(gene_targets)
file.add_labels(gene_targets)
💡 using global setting species = human
✅ validated 35 Gene records from Bionty on symbol: IFNGR1, IFNGR1, CAV1, IRF7, IRF7, IRF7, ATF2, NFKBIA, NFKBIA, STAT1, STAT1, SPI1, JAK2, JAK2, STAT2, STAT2, IFNGR2, IFNGR2, IFNGR2, CD86, ...
🔶 ambiguous validation in Bionty for 10 records: IFNGR1, IRF7, NFKBIA, STAT1, JAK2, STAT2, IFNGR2, SMAD4, STAT3, TNFRSF14
🔶 did not validate 2 Gene records for symbols: MARCH8, NT
🌱 linked labels 'IFNGR1', 'IFNGR1', 'CAV1', 'IRF7', 'IRF7', 'IRF7', 'ATF2', 'NFKBIA', 'NFKBIA', 'STAT1', 'STAT1', 'SPI1', 'JAK2', 'JAK2', 'STAT2', 'STAT2', 'IFNGR2', 'IFNGR2', 'IFNGR2', 'CD86', 'STAT5A', 'SMAD4', 'SMAD4', 'ETV7', 'IRF1', 'UBE2L6', 'PDCD1LG2', 'BRD4', 'POU2F2', 'STAT3', 'STAT3', 'TNFRSF14', 'TNFRSF14', 'CUL3', 'CMTM6', 'MARCH8', 'NT' to feature 'gene_target', linked feature 'gene_target' to registry 'bionty.Gene'
labels = []
for col in ["orig.ident", "perturbation", "replicate", "Phase", "guide_ID"]:
labels += ln.Label.from_values(obs[col])
🔶 did not validate 8 Label records for names: Lane7, Lane4, Lane2, Lane5, Lane3, Lane8, Lane1, Lane6
🔶 did not validate 2 Label records for names: Perturbed, NT
🔶 did not validate 3 Label records for names: rep3, rep1, rep2
🔶 did not validate 3 Label records for names: G1, S, G2M
🔶 did not validate 78 Label records for names: MARCH8g2, IFNGR1g3, MARCH8g4, CAV1g4, IRF7g1, ATF2g1, NFKBIAg2, STAT1g2, SPI1g1, JAK2g3, NTg7, IFNGR1g4, NTg1, STAT2g2, IFNGR2g2, CD86g2, IFNGR2g1, STAT5Ag2, IFNGR1g2, IFNGR1g1, ...
Because none of these labels seem like something we’d want to track in the registry or validate, we don’t link them to the file.
file.features
'rna': FeatureSet(id='D2R5QXP4oJIS4SxSRG3i', n=93, type='float', registry='bionty.Gene', hash='6Sd_y8RL6Uy6JQCuHM6Y', updated_at=2023-08-11 19:26:41, created_by_id='DzTjkKse')
'adt': FeatureSet(id='zuntrtRthbybYQywhRMm', n=4, type='float', registry='bionty.CellMarker', hash='b-CtyjgPRO0WN27lTOqC', updated_at=2023-08-11 19:26:41, created_by_id='DzTjkKse')
'obs': FeatureSet(id='SS3moEX0zSscURrJXSV1', name='metadata', n=1, registry='core.Feature', hash='hoVk1NccHPOkOdxRb4yE', updated_at=2023-08-11 19:26:41, created_by_id='DzTjkKse')
file.describe()
💡 File(id=WPisD6P743VBQflZHvgn, key=None, suffix=.h5mu, accessor=MuData, description=Sub-sampled MuData from Papalexi21, version=None, size=606320, hash=RaivS3NesDOP-6kNIuaC3g, hash_type=md5, created_at=2023-08-11 19:26:38.694937+00:00, updated_at=2023-08-11 19:26:38.694958+00:00)
Provenance:
🗃️ storage: Storage(id='BtPvJynt', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-multimodal', type='local', updated_at=2023-08-11 19:26:36, created_by_id='DzTjkKse')
📎 initial_version: None
📔 transform: Transform(id='yMWSFirS6qv2z8', name='Curate & link multi-modal data', short_name='multimodal', stem_id='yMWSFirS6qv2', version='0', type='notebook', updated_at=2023-08-11 19:26:38, created_by_id='DzTjkKse')
🚗 run: Run(id='km3kD5EWDNAXZcGoPvTT', run_at=2023-08-11 19:26:37, transform_id='yMWSFirS6qv2z8', created_by_id='DzTjkKse')
👤 created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-08-11 19:26:36)
Features:
🗺️ rna:
🔗 index (93, bionty.Gene.id): ['nfMjnjieml5b', 'TdkXUtnP0VAZ', 'rPasaDVKU4X7', 'j0NL3KmI6uNp', 'PXHy9K7Pa0xj'...]
🗺️ adt:
🔗 index (4, bionty.CellMarker.id): ['kbrA7wdDuqDK', 'BK30rjK34sZd', 'L0m6f7FPiDeg', '82nG0xqSuEQD'...]
🗺️ obs (metadata):
🔗 gene_target (37, bionty.Gene): ['IFNGR1', 'IRF7', 'STAT1', 'TNFRSF14', 'CUL3']
file.view_lineage()
Show code cell content
!lamin delete test-multimodal
💡 deleting instance testuser1/test-multimodal
✅ deleted instance settings file: /home/runner/.lamin/instance--testuser1--test-multimodal.env
✅ instance cache deleted
✅ deleted '.lndb' sqlite file
🔶 consider manually delete your stored data: /home/runner/work/lamin-usecases/lamin-usecases/docs/test-multimodal