references/evidence_types.md

Evidence Types and Data Sources

Overview

Evidence represents any event or set of events that identifies a target as a potential causal gene or protein for a disease. Evidence is standardized and mapped to: - Ensembl gene IDs for targets - EFO (Experimental Factor Ontology) for diseases/phenotypes

Evidence is organized into data types (broader categories) and data sources (specific databases/studies).

Evidence Data Types

1. Genetic Association

Evidence from human genetics linking genetic variants to disease phenotypes.

Data Sources:

GWAS (Genome-Wide Association Studies) - Population-level common variant associations - Filtered with Locus-to-Gene (L2G) scores >0.05 - Includes fine-mapping and colocalization data - Sources: GWAS Catalog, FinnGen, UK Biobank, EBI GWAS

Gene Burden Tests - Rare variant association analyses - Aggregate effects of multiple rare variants in a gene - Particularly relevant for Mendelian and rare diseases

ClinVar Germline - Clinical variant interpretations - Classifications: pathogenic, likely pathogenic, VUS, benign - Expert-reviewed variant-disease associations

Genomics England PanelApp - Expert gene-disease ratings - Green (confirmed), amber (probable), red (no evidence) - Focus on rare diseases and cancer

Gene2Phenotype - Curated gene-disease relationships - Allelic requirements and inheritance patterns - Clinical validity assessments

UniProt Literature & Variants - Literature-based gene-disease associations - Expert-curated from scientific publications

Orphanet - Rare disease gene associations - Expert-reviewed and maintained

ClinGen - Clinical genome resource classifications - Gene-disease validity assertions

2. Somatic Mutations

Evidence from cancer genomics identifying driver genes and therapeutic targets.

Data Sources:

Cancer Gene Census - Expert-curated cancer genes - Tier classifications (1 = strong evidence, 2 = emerging) - Mutation types and cancer types

IntOGen - Computational driver gene predictions - Aggregated from large cohort studies - Statistical significance of mutations

ClinVar Somatic - Somatic clinical variant interpretations - Oncogenic/likely oncogenic classifications

Cancer Biomarkers - FDA/EMA approved biomarkers - Clinical trial biomarkers - Prognostic and predictive markers

3. Known Drugs

Evidence from clinical precedence showing drugs targeting genes for disease indications.

Data Source:

ChEMBL - Approved drugs (Phase 4) - Clinical candidates (Phase 1-3) - Withdrawn drugs - Drug-target-indication triplets with mechanism of action

Clinical Trial Information: - phase: Maximum clinical trial phase (1, 2, 3, 4) - status: Active, terminated, completed, withdrawn - mechanismOfAction: How drug affects target

4. Affected Pathways

Evidence linking genes to disease through pathway perturbations and functional screens.

Data Sources:

CRISPR Screens - Genome-scale knockout screens - Cancer dependency and essentiality data

Project Score (Cancer Dependency Map) - CRISPR-Cas9 fitness screens across cancer cell lines - Gene essentiality profiles

SLAPenrich - Pathway enrichment analysis - Somatic mutation pathway impacts

PROGENy - Pathway activity inference - Signaling pathway perturbations

Reactome - Expert-curated pathway annotations - Biological pathway representations

Gene Signatures - Expression-based signatures - Pathway activity patterns

5. RNA Expression

Evidence from differential gene expression in disease vs. control tissues.

Data Source:

Expression Atlas - Differential expression data - Baseline expression across tissues/conditions - RNA-Seq and microarray studies - Log2 fold-change and p-values

6. Animal Models

Evidence from in vivo studies showing phenotypes associated with gene perturbations.

Data Source:

IMPC (International Mouse Phenotyping Consortium) - Systematic mouse knockout phenotypes - Phenotype-disease mappings via ontologies - Standardized phenotyping procedures

7. Literature

Evidence from text-mining of biomedical literature.

Data Source:

Europe PMC - Co-occurrence of genes and diseases in abstracts - Normalized citation counts - Weighted by publication type and recency

Evidence Scoring

Each evidence source has its own scoring methodology:

Score Ranges

  • Most scores normalized to 0-1 range
  • Higher scores indicate stronger evidence
  • Scores are NOT confidence levels but relative strength indicators

Common Scoring Approaches:

Binary Classifications: - ClinVar: Pathogenic (1.0), Likely pathogenic (0.99), etc. - Gene2Phenotype: Confirmed/probable ratings - PanelApp: Green/amber/red classifications

Statistical Measures: - GWAS: L2G scores incorporating multiple lines of evidence - Gene Burden: Statistical significance of variant aggregation - Expression: Adjusted p-values and fold-changes

Clinical Precedence: - Known Drugs: Phase weights (Phase 4 = 1.0, Phase 3 = 0.8, etc.) - Clinical status modifiers

Computational Predictions: - IntOGen: Q-values from driver mutation analysis - PROGENy/SLAPenrich: Pathway activity/enrichment scores

Evidence Interpretation Guidelines

Strengths by Data Type

Genetic Association - Strongest human genetic evidence - Direct link between genetic variation and disease - Mendelian diseases: high confidence - GWAS: requires L2G to identify causal gene - Consider ancestry and population-specific effects

Somatic Mutations - Direct evidence in cancer - Strong for oncology indications - Driver mutations indicate therapeutic potential - Consider cancer type specificity

Known Drugs - Clinical validation - Highest confidence: approved drugs (Phase 4) - Consider mechanism relevance to new indication - Phase 1-2: early evidence, higher risk

Affected Pathways - Mechanistic insights - Supports biological plausibility - May not predict clinical success - Useful for hypothesis generation

RNA Expression - Observational evidence - Correlation, not causation - May reflect disease consequence vs. cause - Useful for biomarker identification

Animal Models - Translational evidence - Strong for understanding biology - Variable translation to human disease - Most useful when phenotype matches human disease

Literature - Exploratory signal - Text-mining captures research focus - May reflect publication bias - Requires manual literature review for validation

Important Considerations

  1. Multiple evidence types strengthen confidence - Convergent evidence from different data types provides stronger support

  2. Under-studied diseases score lower - Novel or rare diseases may have strong evidence but lower aggregate scores due to limited research

  3. Association scores are not probabilities - Scores rank relative evidence strength, not success probability

  4. Context matters - Evidence strength depends on:

  5. Disease mechanism understanding
  6. Target biology and druggability
  7. Clinical precedence in related indications
  8. Safety considerations

  9. Data source reliability varies - Weight expert-curated sources (ClinGen, Gene2Phenotype) higher than computational predictions

Using Evidence in Queries

Filtering by Data Type

query = """
  query evidenceByType($ensemblId: String!, $efoId: String!, $dataTypes: [String!]) {
    disease(efoId: $efoId) {
      evidences(ensemblIds: [$ensemblId], datatypes: $dataTypes) {
        rows {
          datasourceId
          score
        }
      }
    }
  }
"""
variables = {
    "ensemblId": "ENSG00000157764",
    "efoId": "EFO_0000249",
    "dataTypes": ["genetic_association", "somatic_mutation"]
}

Accessing Data Type Scores

Data type scores aggregate all source scores within that type:

query = """
  query associationScores($ensemblId: String!, $efoId: String!) {
    target(ensemblId: $ensemblId) {
      associatedDiseases(efoIds: [$efoId]) {
        rows {
          disease {
            name
          }
          score
          datatypeScores {
            componentId
            score
          }
        }
      }
    }
  }
"""

Evidence Quality Assessment

When evaluating evidence:

  1. Check multiple sources - Single source may be unreliable
  2. Prioritize human genetic evidence - Strongest disease relevance
  3. Consider clinical precedence - Known drugs indicate druggability
  4. Assess mechanistic support - Pathway evidence supports biology
  5. Review literature manually - For critical decisions, read primary publications
  6. Validate in primary databases - Cross-reference with ClinVar, ClinGen, etc.
← Back to opentargets-database