Common Statistical Pitfalls
P-Value Misinterpretations
Pitfall 1: P-Value = Probability Hypothesis is True
Misconception: p = .05 means 5% chance the null hypothesis is true.
Reality: P-value is the probability of observing data this extreme (or more) if the null hypothesis is true. It says nothing about the probability the hypothesis is true.
Correct interpretation: "If there were truly no effect, we would observe data this extreme only 5% of the time."
Pitfall 2: Non-Significant = No Effect
Misconception: p > .05 proves there's no effect.
Reality: Absence of evidence ≠ evidence of absence. Non-significant results may indicate: - Insufficient statistical power - True effect too small to detect - High variability - Small sample size
Better approach: - Report confidence intervals - Conduct power analysis - Consider equivalence testing
Pitfall 3: Significant = Important
Misconception: Statistical significance means practical importance.
Reality: With large samples, trivial effects become "significant." A statistically significant 0.1 IQ point difference is meaningless in practice.
Better approach: - Report effect sizes - Consider practical significance - Use confidence intervals
Pitfall 4: P = .049 vs. P = .051
Misconception: These are meaningfully different because one crosses the .05 threshold.
Reality: These represent nearly identical evidence. The .05 threshold is arbitrary.
Better approach: - Treat p-values as continuous measures of evidence - Report exact p-values - Consider context and prior evidence
Pitfall 5: One-Tailed Tests Without Justification
Misconception: One-tailed tests are free extra power.
Reality: One-tailed tests assume effects can only go one direction, which is rarely true. They're often used to artificially boost significance.
When appropriate: Only when effects in one direction are theoretically impossible or equivalent to null.
Multiple Comparisons Problems
Pitfall 6: Multiple Testing Without Correction
Problem: Testing 20 hypotheses at p < .05 gives ~65% chance of at least one false positive.
Examples: - Testing many outcomes - Testing many subgroups - Conducting multiple interim analyses - Testing at multiple time points
Solutions: - Bonferroni correction (divide α by number of tests) - False Discovery Rate (FDR) control - Prespecify primary outcome - Treat exploratory analyses as hypothesis-generating
Pitfall 7: Subgroup Analysis Fishing
Problem: Testing many subgroups until finding significance.
Why problematic: - Inflates false positive rate - Often reported without disclosure - "Interaction was significant in women" may be random
Solutions: - Prespecify subgroups - Use interaction tests, not separate tests - Require replication - Correct for multiple comparisons
Pitfall 8: Outcome Switching
Problem: Analyzing many outcomes, reporting only significant ones.
Detection signs: - Secondary outcomes emphasized - Incomplete outcome reporting - Discrepancy between registration and publication
Solutions: - Preregister all outcomes - Report all planned outcomes - Distinguish primary from secondary
Sample Size and Power Issues
Pitfall 9: Underpowered Studies
Problem: Small samples have low probability of detecting true effects.
Consequences: - High false negative rate - Significant results more likely to be false positives - Overestimated effect sizes (when significant)
Solutions: - Conduct a priori power analysis - Aim for 80-90% power - Consider effect size from prior research
Pitfall 10: Post-Hoc Power Analysis
Problem: Calculating power after seeing results is circular and uninformative.
Why useless: - Non-significant results always have low "post-hoc power" - It recapitulates the p-value without new information
Better approach: - Calculate confidence intervals - Plan replication with adequate sample - Conduct prospective power analysis for future studies
Pitfall 11: Small Sample Fallacy
Problem: Trusting results from very small samples.
Issues: - High sampling variability - Outliers have large influence - Assumptions of tests violated - Confidence intervals very wide
Guidelines: - Be skeptical of n < 30 - Check assumptions carefully - Consider non-parametric tests - Replicate findings
Effect Size Misunderstandings
Pitfall 12: Ignoring Effect Size
Problem: Focusing only on significance, not magnitude.
Why problematic: - Significance ≠ importance - Can't compare across studies - Doesn't inform practical decisions
Solutions: - Always report effect sizes - Use standardized measures (Cohen's d, r, η²) - Interpret using field conventions - Consider minimum clinically important difference
Pitfall 13: Misinterpreting Standardized Effect Sizes
Problem: Treating Cohen's d = 0.5 as "medium" without context.
Reality: - Field-specific norms vary - Some fields have larger typical effects - Real-world importance depends on context
Better approach: - Compare to effects in same domain - Consider practical implications - Look at raw effect sizes too
Pitfall 14: Confusing Explained Variance with Importance
Problem: "Only explains 5% of variance" = unimportant.
Reality: - Height explains ~5% of variation in NBA player salary but is crucial - Complex phenomena have many small contributors - Predictive accuracy ≠ causal importance
Consideration: Context matters more than percentage alone.
Correlation and Causation
Pitfall 15: Correlation Implies Causation
Problem: Inferring causation from correlation.
Alternative explanations: - Reverse causation (B causes A, not A causes B) - Confounding (C causes both A and B) - Coincidence - Selection bias
Criteria for causation: - Temporal precedence - Covariation - No plausible alternatives - Ideally: experimental manipulation
Pitfall 16: Ecological Fallacy
Problem: Inferring individual-level relationships from group-level data.
Example: Countries with more chocolate consumption have more Nobel laureates doesn't mean eating chocolate makes you win Nobels.
Why problematic: Group-level correlations may not hold at individual level.
Pitfall 17: Simpson's Paradox
Problem: Trend appears in groups but reverses when combined (or vice versa).
Example: Treatment appears worse overall but better in every subgroup.
Cause: Confounding variable distributed differently across groups.
Solution: Consider confounders and look at appropriate level of analysis.
Regression and Modeling Pitfalls
Pitfall 18: Overfitting
Problem: Model fits sample data well but doesn't generalize.
Causes: - Too many predictors relative to sample size - Fitting noise rather than signal - No cross-validation
Solutions: - Use cross-validation - Penalized regression (LASSO, ridge) - Independent test set - Simpler models
Pitfall 19: Extrapolation Beyond Data Range
Problem: Predicting outside the range of observed data.
Why dangerous: - Relationships may not hold outside observed range - Increased uncertainty not reflected in predictions
Solution: Only interpolate; avoid extrapolation.
Pitfall 20: Ignoring Model Assumptions
Problem: Using statistical tests without checking assumptions.
Common violations: - Non-normality (for parametric tests) - Heteroscedasticity (unequal variances) - Non-independence - Linearity - No multicollinearity
Solutions: - Check assumptions with diagnostics - Use robust methods - Transform data - Use appropriate non-parametric alternatives
Pitfall 21: Treating Non-Significant Covariates as Eliminating Confounding
Problem: "We controlled for X and it wasn't significant, so it's not a confounder."
Reality: Non-significant covariates can still be important confounders. Significance ≠ confounding.
Solution: Include theoretically important covariates regardless of significance.
Pitfall 22: Collinearity Masking Effects
Problem: When predictors are highly correlated, true effects may appear non-significant.
Manifestations: - Large standard errors - Unstable coefficients - Sign changes when adding/removing variables
Detection: - Variance Inflation Factors (VIF) - Correlation matrices
Solutions: - Remove redundant predictors - Combine correlated variables - Use regularization methods
Specific Test Misuses
Pitfall 23: T-Test for Multiple Groups
Problem: Conducting multiple t-tests instead of ANOVA.
Why wrong: Inflates Type I error rate dramatically.
Correct approach: - Use ANOVA first - Follow with planned comparisons or post-hoc tests with correction
Pitfall 24: Pearson Correlation for Non-Linear Relationships
Problem: Using Pearson's r for curved relationships.
Why misleading: r measures linear relationships only.
Solutions: - Check scatterplots first - Use Spearman's ρ for monotonic relationships - Consider polynomial or non-linear models
Pitfall 25: Chi-Square with Small Expected Frequencies
Problem: Chi-square test with expected cell counts < 5.
Why wrong: Violates test assumptions, p-values inaccurate.
Solutions: - Fisher's exact test - Combine categories - Increase sample size
Pitfall 26: Paired vs. Independent Tests
Problem: Using independent samples test for paired data (or vice versa).
Why wrong: - Wastes power (paired data analyzed as independent) - Violates independence assumption (independent data analyzed as paired)
Solution: Match test to design.
Confidence Interval Misinterpretations
Pitfall 27: 95% CI = 95% Probability True Value Inside
Misconception: "95% chance the true value is in this interval."
Reality: The true value either is or isn't in this specific interval. If we repeated the study many times, 95% of resulting intervals would contain the true value.
Better interpretation: "We're 95% confident this interval contains the true value."
Pitfall 28: Overlapping CIs = No Difference
Problem: Assuming overlapping confidence intervals mean no significant difference.
Reality: Overlapping CIs are less stringent than difference tests. Two CIs can overlap while the difference between groups is significant.
Guideline: Overlap of point estimate with other CI is more relevant than overlap of intervals.
Pitfall 29: Ignoring CI Width
Problem: Focusing only on whether CI includes zero, not precision.
Why important: Wide CIs indicate high uncertainty. "Significant" effects with huge CIs are less convincing.
Consider: Both significance and precision.
Bayesian vs. Frequentist Confusions
Pitfall 30: Mixing Bayesian and Frequentist Interpretations
Problem: Making Bayesian statements from frequentist analyses.
Examples: - "Probability hypothesis is true" (Bayesian) from p-value (frequentist) - "Evidence for null" from non-significant result (frequentist can't support null)
Solution: - Be clear about framework - Use Bayesian methods for Bayesian questions - Use Bayes factors to compare hypotheses
Pitfall 31: Ignoring Prior Probability
Problem: Treating all hypotheses as equally likely initially.
Reality: Extraordinary claims need extraordinary evidence. Prior plausibility matters.
Consider: - Plausibility given existing knowledge - Mechanism plausibility - Base rates
Data Transformation Issues
Pitfall 32: Dichotomizing Continuous Variables
Problem: Splitting continuous variables at arbitrary cutoffs.
Consequences: - Loss of information and power - Arbitrary distinctions - Discarding individual differences
Exceptions: Clinically meaningful cutoffs with strong justification.
Better: Keep continuous or use multiple categories.
Pitfall 33: Trying Multiple Transformations
Problem: Testing many transformations until finding significance.
Why problematic: Inflates Type I error, is a form of p-hacking.
Better approach: - Prespecify transformations - Use theory-driven transformations - Correct for multiple testing if exploring
Missing Data Problems
Pitfall 34: Listwise Deletion by Default
Problem: Automatically deleting all cases with any missing data.
Consequences: - Reduced power - Potential bias if data not missing completely at random (MCAR)
Better approaches: - Multiple imputation - Maximum likelihood methods - Analyze missingness patterns
Pitfall 35: Ignoring Missing Data Mechanisms
Problem: Not considering why data are missing.
Types: - MCAR (Missing Completely at Random): Safe to delete - MAR (Missing at Random): Can impute - MNAR (Missing Not at Random): May bias results
Solution: Analyze patterns, use appropriate methods, consider sensitivity analyses.
Publication and Reporting Issues
Pitfall 36: Selective Reporting
Problem: Only reporting significant results or favorable analyses.
Consequences: - Literature appears more consistent than reality - Meta-analyses biased - Wasted research effort
Solutions: - Preregistration - Report all analyses - Use reporting guidelines (CONSORT, PRISMA, etc.)
Pitfall 37: Rounding to p < .05
Problem: Reporting exact p-values selectively (e.g., p = .049 but p < .05 for .051).
Why problematic: Obscures values near threshold, enables p-hacking detection evasion.
Better: Always report exact p-values.
Pitfall 38: No Data Sharing
Problem: Not making data available for verification or reanalysis.
Consequences: - Can't verify results - Can't include in meta-analyses - Hinders scientific progress
Best practice: Share data unless privacy concerns prohibit.
Cross-Validation and Generalization
Pitfall 39: No Cross-Validation
Problem: Testing model on same data used to build it.
Consequence: Overly optimistic performance estimates.
Solutions: - Split data (train/test) - K-fold cross-validation - Independent validation sample
Pitfall 40: Data Leakage
Problem: Information from test set leaking into training.
Examples: - Normalizing before splitting - Feature selection on full dataset - Including temporal information
Consequence: Inflated performance metrics.
Prevention: All preprocessing decisions made using only training data.
Meta-Analysis Pitfalls
Pitfall 41: Apples and Oranges
Problem: Combining studies with different designs, populations, or measures.
Balance: Need homogeneity but also comprehensiveness.
Solutions: - Clear inclusion criteria - Subgroup analyses - Meta-regression for moderators
Pitfall 42: Ignoring Publication Bias
Problem: Published studies overrepresent significant results.
Consequences: Overestimated effects in meta-analyses.
Detection: - Funnel plots - Trim-and-fill - PET-PEESE - P-curve analysis
Solutions: - Include unpublished studies - Register reviews - Use bias-correction methods
General Best Practices
- Preregister studies - Distinguish confirmatory from exploratory
- Report transparently - All analyses, not just significant ones
- Check assumptions - Don't blindly apply tests
- Use appropriate tests - Match test to data and design
- Report effect sizes - Not just p-values
- Consider practical significance - Not just statistical
- Replicate findings - One study is rarely definitive
- Share data and code - Enable verification
- Use confidence intervals - Show uncertainty
- Think causally carefully - Most research is correlational