Evidence Hierarchy and Quality Assessment
Traditional Evidence Hierarchy (Medical/Clinical)
Level 1: Systematic Reviews and Meta-Analyses
Description: Comprehensive synthesis of all available evidence on a question.
Strengths: - Combines multiple studies for greater power - Reduces impact of single-study anomalies - Can identify patterns across studies - Quantifies overall effect size
Weaknesses: - Quality depends on included studies ("garbage in, garbage out") - Publication bias can distort findings - Heterogeneity may make pooling inappropriate - Can mask important differences between studies
Critical evaluation: - Was search comprehensive (multiple databases, grey literature)? - Were inclusion criteria appropriate and prespecified? - Was study quality assessed? - Was heterogeneity explored? - Was publication bias assessed (funnel plots, fail-safe N)? - Were appropriate statistical methods used?
Level 2: Randomized Controlled Trials (RCTs)
Description: Experimental studies with random assignment to conditions.
Strengths: - Gold standard for establishing causation - Controls for known and unknown confounders - Minimizes selection bias - Enables causal inference
Weaknesses: - May not be ethical or feasible - Artificial settings may limit generalizability - Often short-term with selected populations - Expensive and time-consuming
Critical evaluation: - Was randomization adequate (sequence generation, allocation concealment)? - Was blinding implemented (participants, providers, assessors)? - Was sample size adequate (power analysis)? - Was intention-to-treat analysis used? - Was attrition rate acceptable and balanced? - Are results generalizable?
Level 3: Cohort Studies
Description: Observational studies following groups over time.
Types: - Prospective: Follow forward from exposure to outcome - Retrospective: Look backward at existing data
Strengths: - Can study multiple outcomes - Establishes temporal sequence - Can calculate incidence and relative risk - More feasible than RCTs for many questions
Weaknesses: - Susceptible to confounding - Selection bias possible - Attrition can bias results - Cannot prove causation definitively
Critical evaluation: - Were cohorts comparable at baseline? - Was exposure measured reliably? - Was follow-up adequate and complete? - Were potential confounders measured and controlled? - Was outcome assessment blinded to exposure?
Level 4: Case-Control Studies
Description: Compare people with outcome (cases) to those without (controls), looking back at exposures.
Strengths: - Efficient for rare outcomes - Relatively quick and inexpensive - Can study multiple exposures - Useful for generating hypotheses
Weaknesses: - Cannot calculate incidence - Susceptible to recall bias - Selection of controls is challenging - Cannot prove causation
Critical evaluation: - Were cases and controls defined clearly? - Were controls appropriate (same source population)? - Was matching appropriate? - How was exposure ascertained (records vs. recall)? - Were potential confounders controlled? - Could recall bias explain findings?
Level 5: Cross-Sectional Studies
Description: Snapshot observation at single point in time.
Strengths: - Quick and inexpensive - Can assess prevalence - Useful for hypothesis generation - Can study multiple outcomes and exposures
Weaknesses: - Cannot establish temporal sequence - Cannot determine causation - Prevalence-incidence bias - Survival bias
Critical evaluation: - Was sample representative? - Were measures validated? - Could reverse causation explain findings? - Are confounders acknowledged?
Level 6: Case Series and Case Reports
Description: Description of observations in clinical practice.
Strengths: - Can identify new diseases or effects - Hypothesis-generating - Details rare phenomena - Quick to report
Weaknesses: - No control group - No statistical inference possible - Highly susceptible to bias - Cannot establish causation or frequency
Use: Primarily for hypothesis generation and clinical description.
Level 7: Expert Opinion
Description: Statements by recognized authorities.
Strengths: - Synthesizes experience - Useful when no research available - May integrate multiple sources
Weaknesses: - Subjective and potentially biased - May not reflect current evidence - Appeal to authority fallacy risk - Individual expertise varies
Use: Lowest level of evidence; should be supported by data when possible.
Nuances and Limitations of Traditional Hierarchy
When Lower-Level Evidence Can Be Strong
- Well-designed observational studies with:
- Large effects (hard to confound)
- Dose-response relationships
- Consistent findings across contexts
- Biological plausibility
-
No plausible confounders
-
Multiple converging lines of evidence from different study types
-
Natural experiments approximating randomization
When Higher-Level Evidence Can Be Weak
- Poor-quality RCTs with:
- Inadequate randomization
- High attrition
- No blinding when feasible
-
Conflicts of interest
-
Biased meta-analyses:
- Publication bias
- Selective inclusion
- Inappropriate pooling
-
Poor search strategy
-
Not addressing the right question:
- Wrong population
- Wrong comparison
- Wrong outcome
- Too artificial to generalize
Alternative: GRADE System
GRADE (Grading of Recommendations Assessment, Development and Evaluation) assesses evidence quality across four levels:
High Quality
Definition: Very confident that true effect is close to estimated effect.
Characteristics: - Well-conducted RCTs - Overwhelming evidence from observational studies - Large, consistent effects - No serious limitations
Moderate Quality
Definition: Moderately confident; true effect likely close to estimated, but could be substantially different.
Downgrades from high: - Some risk of bias - Inconsistency across studies - Indirectness (different populations/interventions) - Imprecision (wide confidence intervals) - Publication bias suspected
Low Quality
Definition: Limited confidence; true effect may be substantially different.
Downgrades: - Serious limitations in above factors - Observational studies without special strengths
Very Low Quality
Definition: Very limited confidence; true effect likely substantially different.
Characteristics: - Very serious limitations - Expert opinion - Multiple serious flaws
Study Quality Assessment Criteria
Internal Validity (Bias Control)
Questions: - Was randomization adequate? - Was allocation concealed? - Were groups similar at baseline? - Was blinding implemented? - Was attrition minimal and balanced? - Was intention-to-treat used? - Were all outcomes reported?
External Validity (Generalizability)
Questions: - Is sample representative of target population? - Are inclusion/exclusion criteria too restrictive? - Is setting realistic? - Are results applicable to other populations? - Are effects consistent across subgroups?
Statistical Conclusion Validity
Questions: - Was sample size adequate (power)? - Were statistical tests appropriate? - Were assumptions checked? - Were effect sizes and confidence intervals reported? - Were multiple comparisons addressed? - Was analysis prespecified?
Construct Validity (Measurement)
Questions: - Were measures validated and reliable? - Was outcome defined clearly and appropriately? - Were assessors blinded? - Were exposures measured accurately? - Was timing of measurement appropriate?
Critical Appraisal Tools
For Different Study Types
RCTs: - Cochrane Risk of Bias Tool - Jadad Scale - PEDro Scale (for trials in physical therapy)
Observational Studies: - Newcastle-Ottawa Scale - ROBINS-I (Risk of Bias in Non-randomized Studies)
Diagnostic Studies: - QUADAS-2 (Quality Assessment of Diagnostic Accuracy Studies)
Systematic Reviews: - AMSTAR-2 (A Measurement Tool to Assess Systematic Reviews)
All Study Types: - CASP Checklists (Critical Appraisal Skills Programme)
Domain-Specific Considerations
Basic Science Research
Hierarchy differs: 1. Multiple convergent lines of evidence 2. Mechanistic understanding 3. Reproducible experiments 4. Established theoretical framework
Key considerations: - Replication essential - Mechanistic plausibility - Consistency across model systems - Convergence of methods
Psychological Research
Additional concerns: - Replication crisis - Publication bias particularly problematic - Small effect sizes often expected - Cultural context matters - Measures often indirect (self-report)
Strong evidence includes: - Preregistered studies - Large samples - Multiple measures - Behavioral (not just self-report) outcomes - Cross-cultural replication
Epidemiology
Causal inference frameworks: - Bradford Hill criteria - Rothman's causal pies - Directed Acyclic Graphs (DAGs)
Strong observational evidence: - Dose-response relationships - Temporal consistency - Biological plausibility - Specificity - Consistency across populations - Large effects unlikely due to confounding
Social Sciences
Challenges: - Complex interventions - Context-dependent effects - Measurement challenges - Ethical constraints on RCTs
Strengthening evidence: - Mixed methods - Natural experiments - Instrumental variables - Regression discontinuity designs - Multiple operationalizations
Synthesizing Evidence Across Studies
Consistency
Strong evidence: - Multiple studies, different investigators - Different populations and settings - Different research designs converge - Different measurement methods
Weak evidence: - Single study - Only one research group - Conflicting results - Publication bias evident
Biological/Theoretical Plausibility
Strengthens evidence: - Known mechanism - Consistent with other knowledge - Dose-response relationship - Coherent with animal/in vitro data
Weakens evidence: - No plausible mechanism - Contradicts established knowledge - Biological implausibility
Temporality
Essential for causation: - Cause must precede effect - Cross-sectional studies cannot establish - Reverse causation must be ruled out
Specificity
Moderate indicator: - Specific cause → specific effect strengthens causation - But lack of specificity doesn't rule out causation - Most causes have multiple effects
Strength of Association
Strong evidence: - Large effects unlikely to be due to confounding - Dose-response relationships - All-or-none effects
Caution: - Small effects may still be real - Large effects can still be confounded
Red Flags in Evidence Quality
Study Design Red Flags
- No control group
- Self-selected participants
- No randomization when feasible
- No blinding when feasible
- Very small sample
- Inappropriate statistical tests
Reporting Red Flags
- Selective outcome reporting
- No study registration/protocol
- Missing methodological details
- No conflicts of interest statement
- Cherry-picked citations
- Results don't match methods
Interpretation Red Flags
- Causal language from correlational data
- Claiming "proof"
- Ignoring limitations
- Overgeneralizing
- Spinning negative results
- Post hoc rationalization
Context Red Flags
- Industry funding without independence
- Single study in isolation
- Contradicts preponderance of evidence
- No replication
- Published in predatory journal
- Press release before peer review
Practical Decision Framework
When Evaluating Evidence, Ask:
- What type of study is this? (Design)
- How well was it conducted? (Quality)
- What does it actually show? (Results)
- How likely is bias? (Internal validity)
- Does it apply to my question? (External validity)
- How does it fit with other evidence? (Context)
- Are the conclusions justified? (Interpretation)
- What are the limitations? (Uncertainty)
Making Decisions with Imperfect Evidence
High-quality evidence: - Strong confidence in acting on findings - Reasonable to change practice/policy
Moderate-quality evidence: - Provisional conclusions - Consider in conjunction with other factors - May warrant action depending on stakes
Low-quality evidence: - Weak confidence - Hypothesis-generating - Insufficient for major decisions alone - Consider cost/benefit of waiting for better evidence
Very low-quality evidence: - Very uncertain - Should not drive decisions alone - Useful for identifying gaps and research needs
When Evidence is Conflicting
Strategies: 1. Weight by study quality 2. Look for systematic differences (population, methods) 3. Consider publication bias 4. Update with most recent, rigorous evidence 5. Conduct/await systematic review 6. Consider if question is well-formed
Communicating Evidence Strength
Avoid: - Absolute certainty ("proves") - False balance (equal weight to unequal evidence) - Ignoring uncertainty - Cherry-picking studies
Better: - Quantify uncertainty - Describe strength of evidence - Acknowledge limitations - Present range of evidence - Distinguish established from emerging findings - Be clear about what is/isn't known