BEDspace: Joint Region and Metadata Embeddings
Overview
BEDspace applies the StarSpace model to genomic data, enabling simultaneous training of numerical embeddings for both region sets and their metadata labels in a shared low-dimensional space. This allows for rich queries across regions and metadata.
When to Use
Use BEDspace when working with: - Region sets with associated metadata (cell types, tissues, conditions) - Search tasks requiring metadata-aware similarity - Cross-modal queries (e.g., "find regions similar to label X") - Joint analysis of genomic content and experimental conditions
Workflow
BEDspace consists of four sequential operations:
1. Preprocess
Format genomic intervals and metadata for StarSpace training:
geniml bedspace preprocess \
--input /path/to/regions/ \
--metadata labels.csv \
--universe universe.bed \
--labels "cell_type,tissue" \
--output preprocessed.txt
Required files:
- Input folder: Directory containing BED files
- Metadata CSV: Must include file_name column matching BED filenames, plus metadata columns
- Universe file: Reference BED file for tokenization
- Labels: Comma-separated list of metadata columns to use
The preprocessing step adds __label__ prefixes to metadata and converts regions to StarSpace-compatible format.
2. Train
Execute StarSpace model on preprocessed data:
geniml bedspace train \
--path-to-starspace /path/to/starspace \
--input preprocessed.txt \
--output model/ \
--dim 100 \
--epochs 50 \
--lr 0.05
Key training parameters:
- --dim: Embedding dimension (typical: 50-200)
- --epochs: Training epochs (typical: 20-100)
- --lr: Learning rate (typical: 0.01-0.1)
3. Distances
Compute distance metrics between region sets and metadata labels:
geniml bedspace distances \
--input model/ \
--metadata labels.csv \
--universe universe.bed \
--output distances.pkl
This step creates a distance matrix needed for similarity searches.
4. Search
Retrieve similar items across three scenarios:
Region-to-Label (r2l): Query region set → retrieve similar metadata labels
geniml bedspace search -t r2l -d distances.pkl -q query_regions.bed -n 10
Label-to-Region (l2r): Query metadata label → retrieve similar region sets
geniml bedspace search -t l2r -d distances.pkl -q "T_cell" -n 10
Region-to-Region (r2r): Query region set → retrieve similar region sets
geniml bedspace search -t r2r -d distances.pkl -q query_regions.bed -n 10
The -n parameter controls the number of results returned.
Python API
from geniml.bedspace import BEDSpaceModel
# Load trained model
model = BEDSpaceModel.load('model/')
# Query similar items
results = model.search(
query="T_cell",
search_type="l2r",
top_k=10
)
Best Practices
- Metadata structure: Ensure metadata CSV includes
file_namecolumn that exactly matches BED filenames (without path) - Label selection: Choose informative metadata columns that capture biological variation of interest
- Universe consistency: Use the same universe file across preprocessing, distances, and any subsequent analyses
- Validation: Preprocess and check output format before investing in training
- StarSpace installation: Install StarSpace separately as it's an external dependency
Output Interpretation
Search results return items ranked by similarity in the joint embedding space: - r2l: Identifies metadata labels characterizing your query regions - l2r: Finds region sets matching your metadata criteria - r2r: Discovers region sets with similar genomic content
Requirements
BEDspace requires StarSpace to be installed separately. Download from: https://github.com/facebookresearch/StarSpace