Common Statistical Pitfalls

P-Value Misinterpretations

Pitfall 1: P-Value = Probability Hypothesis is True

Misconception: p = .05 means 5% chance the null hypothesis is true.

Reality: P-value is the probability of observing data this extreme (or more) if the null hypothesis is true. It says nothing about the probability the hypothesis is true.

Correct interpretation: "If there were truly no effect, we would observe data this extreme only 5% of the time."

Pitfall 2: Non-Significant = No Effect

Misconception: p > .05 proves there's no effect.

Reality: Absence of evidence ≠ evidence of absence. Non-significant results may indicate: - Insufficient statistical power - True effect too small to detect - High variability - Small sample size

Better approach: - Report confidence intervals - Conduct power analysis - Consider equivalence testing

Pitfall 3: Significant = Important

Misconception: Statistical significance means practical importance.

Reality: With large samples, trivial effects become "significant." A statistically significant 0.1 IQ point difference is meaningless in practice.

Better approach: - Report effect sizes - Consider practical significance - Use confidence intervals

Pitfall 4: P = .049 vs. P = .051

Misconception: These are meaningfully different because one crosses the .05 threshold.

Reality: These represent nearly identical evidence. The .05 threshold is arbitrary.

Better approach: - Treat p-values as continuous measures of evidence - Report exact p-values - Consider context and prior evidence

Pitfall 5: One-Tailed Tests Without Justification

Misconception: One-tailed tests are free extra power.

Reality: One-tailed tests assume effects can only go one direction, which is rarely true. They're often used to artificially boost significance.

When appropriate: Only when effects in one direction are theoretically impossible or equivalent to null.

Multiple Comparisons Problems

Pitfall 6: Multiple Testing Without Correction

Problem: Testing 20 hypotheses at p < .05 gives ~65% chance of at least one false positive.

Examples: - Testing many outcomes - Testing many subgroups - Conducting multiple interim analyses - Testing at multiple time points

Solutions: - Bonferroni correction (divide α by number of tests) - False Discovery Rate (FDR) control - Prespecify primary outcome - Treat exploratory analyses as hypothesis-generating

Pitfall 7: Subgroup Analysis Fishing

Problem: Testing many subgroups until finding significance.

Why problematic: - Inflates false positive rate - Often reported without disclosure - "Interaction was significant in women" may be random

Solutions: - Prespecify subgroups - Use interaction tests, not separate tests - Require replication - Correct for multiple comparisons

Pitfall 8: Outcome Switching

Problem: Analyzing many outcomes, reporting only significant ones.

Detection signs: - Secondary outcomes emphasized - Incomplete outcome reporting - Discrepancy between registration and publication

Solutions: - Preregister all outcomes - Report all planned outcomes - Distinguish primary from secondary

Sample Size and Power Issues

Pitfall 9: Underpowered Studies

Problem: Small samples have low probability of detecting true effects.

Consequences: - High false negative rate - Significant results more likely to be false positives - Overestimated effect sizes (when significant)

Solutions: - Conduct a priori power analysis - Aim for 80-90% power - Consider effect size from prior research

Pitfall 10: Post-Hoc Power Analysis

Problem: Calculating power after seeing results is circular and uninformative.

Why useless: - Non-significant results always have low "post-hoc power" - It recapitulates the p-value without new information

Better approach: - Calculate confidence intervals - Plan replication with adequate sample - Conduct prospective power analysis for future studies

Pitfall 11: Small Sample Fallacy

Problem: Trusting results from very small samples.

Issues: - High sampling variability - Outliers have large influence - Assumptions of tests violated - Confidence intervals very wide

Guidelines: - Be skeptical of n < 30 - Check assumptions carefully - Consider non-parametric tests - Replicate findings

Effect Size Misunderstandings

Pitfall 12: Ignoring Effect Size

Problem: Focusing only on significance, not magnitude.

Why problematic: - Significance ≠ importance - Can't compare across studies - Doesn't inform practical decisions

Solutions: - Always report effect sizes - Use standardized measures (Cohen's d, r, η²) - Interpret using field conventions - Consider minimum clinically important difference

Pitfall 13: Misinterpreting Standardized Effect Sizes

Problem: Treating Cohen's d = 0.5 as "medium" without context.

Reality: - Field-specific norms vary - Some fields have larger typical effects - Real-world importance depends on context

Better approach: - Compare to effects in same domain - Consider practical implications - Look at raw effect sizes too

Pitfall 14: Confusing Explained Variance with Importance

Problem: "Only explains 5% of variance" = unimportant.

Reality: - Height explains ~5% of variation in NBA player salary but is crucial - Complex phenomena have many small contributors - Predictive accuracy ≠ causal importance

Consideration: Context matters more than percentage alone.

Correlation and Causation

Pitfall 15: Correlation Implies Causation

Problem: Inferring causation from correlation.

Alternative explanations: - Reverse causation (B causes A, not A causes B) - Confounding (C causes both A and B) - Coincidence - Selection bias

Criteria for causation: - Temporal precedence - Covariation - No plausible alternatives - Ideally: experimental manipulation

Pitfall 16: Ecological Fallacy

Problem: Inferring individual-level relationships from group-level data.

Example: Countries with more chocolate consumption have more Nobel laureates doesn't mean eating chocolate makes you win Nobels.

Why problematic: Group-level correlations may not hold at individual level.

Pitfall 17: Simpson's Paradox

Problem: Trend appears in groups but reverses when combined (or vice versa).

Example: Treatment appears worse overall but better in every subgroup.

Cause: Confounding variable distributed differently across groups.

Solution: Consider confounders and look at appropriate level of analysis.

Regression and Modeling Pitfalls

Pitfall 18: Overfitting

Problem: Model fits sample data well but doesn't generalize.

Causes: - Too many predictors relative to sample size - Fitting noise rather than signal - No cross-validation

Solutions: - Use cross-validation - Penalized regression (LASSO, ridge) - Independent test set - Simpler models

Pitfall 19: Extrapolation Beyond Data Range

Problem: Predicting outside the range of observed data.

Why dangerous: - Relationships may not hold outside observed range - Increased uncertainty not reflected in predictions

Solution: Only interpolate; avoid extrapolation.

Pitfall 20: Ignoring Model Assumptions

Problem: Using statistical tests without checking assumptions.

Common violations: - Non-normality (for parametric tests) - Heteroscedasticity (unequal variances) - Non-independence - Linearity - No multicollinearity

Solutions: - Check assumptions with diagnostics - Use robust methods - Transform data - Use appropriate non-parametric alternatives

Pitfall 21: Treating Non-Significant Covariates as Eliminating Confounding

Problem: "We controlled for X and it wasn't significant, so it's not a confounder."

Reality: Non-significant covariates can still be important confounders. Significance ≠ confounding.

Solution: Include theoretically important covariates regardless of significance.

Pitfall 22: Collinearity Masking Effects

Problem: When predictors are highly correlated, true effects may appear non-significant.

Manifestations: - Large standard errors - Unstable coefficients - Sign changes when adding/removing variables

Detection: - Variance Inflation Factors (VIF) - Correlation matrices

Solutions: - Remove redundant predictors - Combine correlated variables - Use regularization methods

Specific Test Misuses

Pitfall 23: T-Test for Multiple Groups

Problem: Conducting multiple t-tests instead of ANOVA.

Why wrong: Inflates Type I error rate dramatically.

Correct approach: - Use ANOVA first - Follow with planned comparisons or post-hoc tests with correction

Pitfall 24: Pearson Correlation for Non-Linear Relationships

Problem: Using Pearson's r for curved relationships.

Why misleading: r measures linear relationships only.

Solutions: - Check scatterplots first - Use Spearman's ρ for monotonic relationships - Consider polynomial or non-linear models

Pitfall 25: Chi-Square with Small Expected Frequencies

Problem: Chi-square test with expected cell counts < 5.

Why wrong: Violates test assumptions, p-values inaccurate.

Solutions: - Fisher's exact test - Combine categories - Increase sample size

Pitfall 26: Paired vs. Independent Tests

Problem: Using independent samples test for paired data (or vice versa).

Why wrong: - Wastes power (paired data analyzed as independent) - Violates independence assumption (independent data analyzed as paired)

Solution: Match test to design.

Confidence Interval Misinterpretations

Pitfall 27: 95% CI = 95% Probability True Value Inside

Misconception: "95% chance the true value is in this interval."

Reality: The true value either is or isn't in this specific interval. If we repeated the study many times, 95% of resulting intervals would contain the true value.

Better interpretation: "We're 95% confident this interval contains the true value."

Pitfall 28: Overlapping CIs = No Difference

Problem: Assuming overlapping confidence intervals mean no significant difference.

Reality: Overlapping CIs are less stringent than difference tests. Two CIs can overlap while the difference between groups is significant.

Guideline: Overlap of point estimate with other CI is more relevant than overlap of intervals.

Pitfall 29: Ignoring CI Width

Problem: Focusing only on whether CI includes zero, not precision.

Why important: Wide CIs indicate high uncertainty. "Significant" effects with huge CIs are less convincing.

Consider: Both significance and precision.

Bayesian vs. Frequentist Confusions

Pitfall 30: Mixing Bayesian and Frequentist Interpretations

Problem: Making Bayesian statements from frequentist analyses.

Examples: - "Probability hypothesis is true" (Bayesian) from p-value (frequentist) - "Evidence for null" from non-significant result (frequentist can't support null)

Solution: - Be clear about framework - Use Bayesian methods for Bayesian questions - Use Bayes factors to compare hypotheses

Pitfall 31: Ignoring Prior Probability

Problem: Treating all hypotheses as equally likely initially.

Reality: Extraordinary claims need extraordinary evidence. Prior plausibility matters.

Consider: - Plausibility given existing knowledge - Mechanism plausibility - Base rates

Data Transformation Issues

Pitfall 32: Dichotomizing Continuous Variables

Problem: Splitting continuous variables at arbitrary cutoffs.

Consequences: - Loss of information and power - Arbitrary distinctions - Discarding individual differences

Exceptions: Clinically meaningful cutoffs with strong justification.

Better: Keep continuous or use multiple categories.

Pitfall 33: Trying Multiple Transformations

Problem: Testing many transformations until finding significance.

Why problematic: Inflates Type I error, is a form of p-hacking.

Better approach: - Prespecify transformations - Use theory-driven transformations - Correct for multiple testing if exploring

Missing Data Problems

Pitfall 34: Listwise Deletion by Default

Problem: Automatically deleting all cases with any missing data.

Consequences: - Reduced power - Potential bias if data not missing completely at random (MCAR)

Better approaches: - Multiple imputation - Maximum likelihood methods - Analyze missingness patterns

Pitfall 35: Ignoring Missing Data Mechanisms

Problem: Not considering why data are missing.

Types: - MCAR (Missing Completely at Random): Safe to delete - MAR (Missing at Random): Can impute - MNAR (Missing Not at Random): May bias results

Solution: Analyze patterns, use appropriate methods, consider sensitivity analyses.

Publication and Reporting Issues

Pitfall 36: Selective Reporting

Problem: Only reporting significant results or favorable analyses.

Consequences: - Literature appears more consistent than reality - Meta-analyses biased - Wasted research effort

Solutions: - Preregistration - Report all analyses - Use reporting guidelines (CONSORT, PRISMA, etc.)

Pitfall 37: Rounding to p < .05

Problem: Reporting exact p-values selectively (e.g., p = .049 but p < .05 for .051).

Why problematic: Obscures values near threshold, enables p-hacking detection evasion.

Better: Always report exact p-values.

Problem: Not making data available for verification or reanalysis.

Consequences: - Can't verify results - Can't include in meta-analyses - Hinders scientific progress

Best practice: Share data unless privacy concerns prohibit.

Cross-Validation and Generalization

Pitfall 39: No Cross-Validation

Problem: Testing model on same data used to build it.

Consequence: Overly optimistic performance estimates.

Solutions: - Split data (train/test) - K-fold cross-validation - Independent validation sample

Pitfall 40: Data Leakage

Problem: Information from test set leaking into training.

Examples: - Normalizing before splitting - Feature selection on full dataset - Including temporal information

Consequence: Inflated performance metrics.

Prevention: All preprocessing decisions made using only training data.

Meta-Analysis Pitfalls

Pitfall 41: Apples and Oranges

Problem: Combining studies with different designs, populations, or measures.

Balance: Need homogeneity but also comprehensiveness.

Solutions: - Clear inclusion criteria - Subgroup analyses - Meta-regression for moderators

Pitfall 42: Ignoring Publication Bias

Problem: Published studies overrepresent significant results.

Consequences: Overestimated effects in meta-analyses.

Detection: - Funnel plots - Trim-and-fill - PET-PEESE - P-curve analysis

Solutions: - Include unpublished studies - Register reviews - Use bias-correction methods

General Best Practices

Preregister studies - Distinguish confirmatory from exploratory
Report transparently - All analyses, not just significant ones
Check assumptions - Don't blindly apply tests
Use appropriate tests - Match test to data and design
Report effect sizes - Not just p-values
Consider practical significance - Not just statistical
Replicate findings - One study is rarely definitive
Share data and code - Enable verification
Use confidence intervals - Show uncertainty
Think causally carefully - Most research is correlational

references/statistical_pitfalls.md