references/census_schema.md

CZ CELLxGENE Census Data Schema Reference

Overview

The CZ CELLxGENE Census is a versioned collection of single-cell data built on the TileDB-SOMA framework. This reference documents the data structure, available metadata fields, and query syntax.

High-Level Structure

The Census is organized as a SOMACollection with two main components:

1. census_info

Summary information including: - summary: Build date, cell counts, dataset statistics - datasets: All datasets from CELLxGENE Discover with metadata - summary_cell_counts: Cell counts stratified by metadata categories

2. census_data

Organism-specific SOMAExperiment objects: - "homo_sapiens": Human single-cell data - "mus_musculus": Mouse single-cell data

Data Structure Per Organism

Each organism experiment contains:

obs (Cell Metadata)

Cell-level annotations stored as a SOMADataFrame. Access via:

census["census_data"]["homo_sapiens"].obs

ms["RNA"] (Measurement)

RNA measurement data including: - X: Data matrices with layers: - raw: Raw count data - normalized: (if available) Normalized counts - var: Gene metadata - feature_dataset_presence_matrix: Sparse boolean array showing which genes were measured in each dataset

Cell Metadata Fields (obs)

Required/Core Fields

Identity & Dataset: - soma_joinid: Unique integer identifier for joins - dataset_id: Source dataset identifier - is_primary_data: Boolean flag (True = unique cell, False = duplicate across datasets)

Cell Type: - cell_type: Human-readable cell type name - cell_type_ontology_term_id: Standardized ontology term (e.g., "CL:0000236")

Tissue: - tissue: Specific tissue name - tissue_general: Broader tissue category (useful for grouping) - tissue_ontology_term_id: Standardized ontology term

Assay: - assay: Sequencing technology used - assay_ontology_term_id: Standardized ontology term

Disease: - disease: Disease status or condition - disease_ontology_term_id: Standardized ontology term

Donor: - donor_id: Unique donor identifier - sex: Biological sex (male, female, unknown) - self_reported_ethnicity: Ethnicity information - development_stage: Life stage (adult, child, embryonic, etc.) - development_stage_ontology_term_id: Standardized ontology term

Organism: - organism: Scientific name (Homo sapiens, Mus musculus) - organism_ontology_term_id: Standardized ontology term

Technical: - suspension_type: Sample preparation type (cell, nucleus, na)

Gene Metadata Fields (var)

Access via:

census["census_data"]["homo_sapiens"].ms["RNA"].var

Available Fields: - soma_joinid: Unique integer identifier for joins - feature_id: Ensembl gene ID (e.g., "ENSG00000161798") - feature_name: Gene symbol (e.g., "FOXP2") - feature_length: Gene length in base pairs

Value Filter Syntax

Queries use Python-like expressions for filtering. The syntax is processed by TileDB-SOMA.

Comparison Operators

  • ==: Equal to
  • !=: Not equal to
  • <, >, <=, >=: Numeric comparisons
  • in: Membership test (e.g., feature_id in ['ENSG00000161798', 'ENSG00000188229'])

Logical Operators

  • and, &: Logical AND
  • or, |: Logical OR

Examples

Single condition:

value_filter="cell_type == 'B cell'"

Multiple conditions with AND:

value_filter="cell_type == 'B cell' and tissue_general == 'lung' and is_primary_data == True"

Using IN for multiple values:

value_filter="tissue in ['lung', 'liver', 'kidney']"

Complex condition:

value_filter="(cell_type == 'neuron' or cell_type == 'astrocyte') and disease != 'normal'"

Filtering genes:

var_value_filter="feature_name in ['CD4', 'CD8A', 'CD19']"

Data Inclusion Criteria

The Census includes all data from CZ CELLxGENE Discover meeting:

  1. Species: Human (Homo sapiens) or mouse (Mus musculus)
  2. Technology: Approved sequencing technologies for RNA
  3. Count Type: Raw counts only (no processed/normalized-only data)
  4. Metadata: Standardized following CELLxGENE schema
  5. Both spatial and non-spatial data: Includes traditional and spatial transcriptomics

Important Data Characteristics

Duplicate Cells

Cells may appear across multiple datasets. Use is_primary_data == True to filter for unique cells in most analyses.

Count Types

The Census includes: - Molecule counts: From UMI-based methods - Full-gene sequencing read counts: From non-UMI methods These may need different normalization approaches.

Versioning

Census releases are versioned (e.g., "2023-07-25", "stable"). Always specify version for reproducible analysis:

census = cellxgene_census.open_soma(census_version="2023-07-25")

Dataset Presence Matrix

Access which genes were measured in each dataset:

presence_matrix = census["census_data"]["homo_sapiens"].ms["RNA"]["feature_dataset_presence_matrix"]

This sparse boolean matrix helps understand: - Gene coverage across datasets - Which datasets to include for specific gene analyses - Technical batch effects related to gene coverage

SOMA Object Types

Core TileDB-SOMA objects used: - DataFrame: Tabular data (obs, var) - SparseNDArray: Sparse matrices (X layers, presence matrix) - DenseNDArray: Dense arrays (less common) - Collection: Container for related objects - Experiment: Top-level container for measurements - SOMAScene: Spatial transcriptomics scenes - obs_spatial_presence: Spatial data availability

← Back to cellxgene-census