Statistical inference, statistical tests

Critical views on the practice of statistical inference, null hypothesis testing

Armstrong, J. Scott. 2007. “Significance tests harm progress in forecasting.” International Journal of Forecasting 23(2): 321-327. [Science Direct]
Boring, E. G. (1919). Mathematical vs. scientific significance. Psychological Bulletin, 16(10), 335-338. [DOI]
Carver, Ronald P. The Case Against Statistical Significance Testing, Revisited. The Journal of Experimental Education; Washington D.C., Wash.61.4 (Summer 1993): 287.
Cohen, Jacob. 1994. “The Earth is Round (p < .05).” American Psychologist 49(12), 997-1003.
Goldstein, Joshua S. 2010. "On Asterisk Inflation." PS: Political Science & Politics 43(01): 59-61. [Cambridge Journals]
Goodman, S N. 1999. “Toward evidence-based medical statistics. 1: The P value fallacy.” Annals of Internal Medicine 130(12): 995-1004.
John P. A. Ioannidis (2019) What Have We (Not) Learnt from Millions of Scientific Papers with P Values?, The American Statistician, 73:sup1, 20-25, DOI:10.1080/00031305.2018.1447512
Steven N. Goodman (2019) Why is Getting Rid of P-Values So Hard? Musings on Science and Statistics, The American Statistician, 73:sup1, 26-30, DOI: 10.1080/00031305.2018.1558111
Johnson, Douglas H. , "The Insignificance of Statistical Significance Testing". The Journal of Wildlife Management, Vol. 63, No. 3. (Jul., 1999), pp. 763-772. [JSTOR]
Special Issue of the Journal of Socio-Economics, 2004 33(5) (M. Altman ed.) [ScienceDirect]

Altman, Morris. 2004. “Statistical significance, path dependency, and the culture of journal publication.” pp. 651-663.
Berg, Nathan. 2004. “No-decision classification: an alternative to testing for statistical significance.” pp. 631-650.
Elliott, Graham, and Clive W.J. Granger. 2004. “Evaluating significance: comments on 'Size Matters'.” pp. 547-550.
Fidler, Fiona et al. 2004. “Statistical reform in medicine, psychology and ecology.” pp. 615-630.
Gigerenzer, Gerd. 2004. “Mindless statistics.” pp. 587-606.
Horowitz, Joel L. 2004. “Comments on "Size Matters". pp. 551-554.
Leamer, Edward E. 2004. “Are the roads red? Comments on "Size Matters". pp. 555-557.
Lunt, Peter. 2004. “The significance of the significance test controversy: comments on 'Size Matters'.” pp. 559-564.
O'Brien, Anthony Patrick. 2004. “Why is the standard error of regression so low using historical data?: Comments on 'size matters'.” pp. 565-570.
Thompson, Bruce. 2004. “The "significance" crisis in psychology and education.” pp. 607-613.
Thorbecke, Erik. 2004. “Economic and statistical significance: comments on 'Size Matters'.” pp. 571-575.
Wooldridge, Jeffrey M. 2004. “Statistical significance is okay, too: comment on 'Size Matters'.” pp. 577-579.
Zellner, Arnold. 2004. “To test or not to test and if so, how?: Comments on 'Size Matters'.” pp. 581-586.
Ziliak, Stephen T., and Deirdre N. McCloskey. 2004a. “Significance redux.” pp. 665-675.
Ziliak, Stephen T., and Deirdre N. McCloskey. 2004b. “Size matters: the standard error of regressions in the American Economic Review.” pp. 527-546.

Koehnle, Thomas, Douglas Curran-Everett, and Dale J. Benos. 2005. “The proof is not in the P value.” Am J Physiol Regul Integr Comp Physiol 288(3): R777-778.[AJPREGU]
Lucas, Paul. What are the most misunderstood (and misused) concepts in statistics? [Quora answers]
Markel, William D. Statistical Significance: a Misunderstood Concept. School Science and Mathematics, 85 (5).1985 [DOI]
McCloskey, Deirdre Nansen, and Steve Ziliak. 2008. The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives. University of Michigan Press.
McCloskey DN. 1995. The insignificance of statistical significance. Am Sci 272:32–33
Morrison, D.E., Henkel, R.E., 1970. The Significance Test Controversy: A Reader. Aldine, Chicago.
Gigerenzer, Gerd. 2004. ‘Mindless Statistics’. Journal of Socio-Economics 33(5): 587–606. [Science Direct]
Rozeboom, William W. 1960. "The Fallacy of the Null-Hypothesis Significance Test." Psychological Bulletin. Vol. 57, 5,416-428
Sterne, Jonathan A C. Sifting the evidence—what's wrong with significance tests? BMJ. 2001 Jan 27; 322(7280): 226–231.
Stephen Stigler. Fisher and the 5% Level. Chance VOL. 21, NO. 4, 2008 [PDF]
Wang, C. 1993. Sense and Nonsense of Statistical Inference. Dekker: New York.
Ziliak, Stephen T., and Deirdre N. McCloskey. 2004. “Size matters: the standard error of regressions in the American Economic Review.” Journal of Socio-Economics 33(5): 527-546. [Science Direct].

Advice, alternatives, publication policies

Cumming, Geoff. 2011. Understanding the New Statistics: Effect Sizes, Confidence Intervals, and Meta-analysis. 1st ed. Routledge Academic. [Book website]
Fidler, Fiona, Cumming Geoff, Burgman Mark, and Thomason Neil. 2004. “Statistical reform in medicine, psychology and ecology.” pp. 615-630. [Science DIrect] .
Harlow. 1997. What If There Were No Significance Tests? Lawrence Erlbaum Assoc Inc.
Thompson, Bruce. 1996. “Research news and Comment: AERA Editorial Policies Regarding Statistical Significance Testing: Three Suggested Reforms.” Educational Researcher 25(2): 26-30. [Sage]
Wasserstein, Ronald L., and Nicole A. Lazar. 2016. “The ASA’s Statement on p -Values: Context, Process, and Purpose.” The American Statistician 70(2): 129–33.
Ronald L. Wasserstein, Allen L. Schirm & Nicole A. Lazar (2019). Moving to a World Beyond “p < 0.05”, The American Statistician, 73:sup1, 1-19, DOI:10.1080/00031305.2019.1583913
Wilkinson, Leland and Task Force on Statistical Inference (APA Board of Scientific Affairs).1999. “Statistical methods in psychology journals: Guidelines and explanations.” American Psychologist. Vol. 54(8): 594-604. [ Ebsco Host]
Wasserstein, Ronald L., and Nicole A. Lazar. 2016. ‘The ASA Statement on P-Values: Context, Process, and Purpose’. The American Statistician 70(2): 129–33. https://doi.org/10.1080/00031305.2016.1154108

To classify

Altman D.G., Bland J.M. (1995), “Absence of evidence is not evidence of absence,” British Medical Journal, 311:485.
Altman, D.G., Machin, D., Bryant, T.N., and Gardner, M.J., eds. (2000), Statistics with Confidence, 2nd ed., London: BMJ Books.
Amrhein, Valentin, and Sander Greenland. 2018. “Remove, Rather than Redefine, Statistical Significance.” Nature Human Behaviour 2(1): 4–4.
Berger, J.O., and Delampady, M. (1987), "Testing precise hypotheses,” Statistical Science, 2, 317–335
Berry, D. (2012), “Multiplicities in Cancer Research: Ubiquitous and Necessary Evils,” Journal of the National Cancer Institute, 104, 1124–1132
Curran-Everett D and Benos DJ. Guidelines for reporting statistics in journals published by the American Physiological Society. Am J Physiol Regul Integr Comp Physiol 287: R247–R249, 2004. [AJPREGU]
Curran-Everett D, Taylor S, and Kafadar K. Fundamental concepts in statistics: elucidation and illustration. J Appl Physiol 85: 775–786, 1998. [AJPREGU]
Christensen, R. (2005), “Testing Fisher, Neyman, Pearson, and Bayes,” The American Statistician, 59, 2, 121-126
Cox, D.R. (1982), “Statistical Significance Tests,” British Journal of Clinical Pharmacology, 14, 325-331
Demidenko, Eugene (2016). The p-Value You Can’t Buy. The American Statistician. 70,1 pages 33-38
Edwards, W., Lindman, H., and Savage, L.J. (1963), "Bayesian statistical inference for psychological research,” Psychological Review, 70, 193–242.
Gelman, A., and Loken, E. (2014), “The Statistical Crisis in Science [online],” American American Scientist
Gelman A, Stern HS. (2006), “The difference between ‘significant’ and ‘not significant’ is not itself statistically significant,” The American Statistician, 60:328–331.
Gelman, Andrew, and Christian Hennig. 2017. “Beyond Subjective and Objective in Statistics.” Journal of the Royal Statistical Society: Series A (Statistics in Society) 180(4): 967–1033.
Gelman, Andrew. 2018. “The Failure of Null Hypothesis Significance Testing When Studying Incremental Changes, and What to Do About It.” Personality and Social Psychology Bulletin 44(1): 16–23.
Gigerenzer G. The Empire of Chance: How Probability Changed Science and Everyday Life. New York: Cambridge Univ. Press, 1989.
Gigerenzer G (2004), “Mindless statistics,” Journal of Socioeconomics, 33:567–606.
Goodman, S.N. (1999), “Toward Evidence-Based Medical Statistics 1: The P Value Fallacy,” Annals of Internal Medicine, 130, 995-1004.
Goodman, S.N. (1999), “Toward Evidence-Based Medical Statistics. 2: The Bayes Factor,” Annals of Internal Medicine, 130, 1005-1013.
Goodman, S.N. (2008), “A Dirty Dozen: Twelve P-Value Misconceptions,” Seminars in Hematology, 45, 135-140.
Greenland, S. (2011), “Null misinterpretation in statistical testing and its impact on health risk assessment,” Preventive Medicine, 53, 225–228.
Greenland, S. (2012). Nonsignificance plus high power does not imply support for the null overthe alternative. Annals of Epidemiology, 22:364–368.
Greenland, S., and Poole C (2011), “Problems in common interpretations of statistics in scientific articles, expert reports, and testimony,” Jurimetrics, 51, 113–129.
Hoenig J.M., and Heisey D.M. (2001). The abuse of power: The pervasive fallacy of power calculations for data analysis. The American Statistician, 55:19–24.
Kuffner, Todd A., and Stephen G. Walker. 2019. “Why Are p -Values Controversial?” The American Statistician 73(1): 1–3.
Ioannidis, J.P. (2005), “Contradicted and initially stronger effects in highly cited clinical research.” Journal of the American Medical Association, 294, 218-228.
Ioannidis, J.P. (2008), “Why most discovered true associations are inflated (with discussion),” Epidemiology 19: 640-658.
Johnson, V.E. (2013), “Revised standards for statistical evidence,” Proceedings of the National Academy of Sciences, 110(48), 19313–19317.
Johnson, V.E.(2013), "Uniformly most powerful Bayesian tests,” Annals of Statistics, 41, 1716-1741.
Lang, J., Rothman K.J., and Cann, C.I. (1998), “That confounded P-value. (Editorial),” Epidemiology, 9, 7-8.
Lavine, M. (1999), “What is Bayesian Statistics and Why Everything Else is Wrong,” UMAP Journal, 20:2
Lew, M.J. (2012), “Bad statistical practice in pharmacology (and other basic biomedical disciplines): you probably don't know P,” British Journal of Pharmacology, 166:5, 1559-1567.
Locascio, Joseph J. 2017. “Results Blind Science Publishing.” Basic and Applied Social Psychology 39(5): 239–46.
Oakes MW. Statistical Inference: A commentary for the Social and Behavioural Sciences. New York: Wiley, 1986.
Mayo, Deborah G. 2018. Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars. Cambridge?; New York, NY: Cambridge University Press.
McShane, Blakeley B., and David Gal. 2017. “Statistical Significance and the Dichotomization of Evidence.” Journal of the American Statistical Association 112(519): 885–95.
Morrison, Denton E.; Henkel, Ramon E. (2006) The Significance Test Controversy: A Reader: Methodological Perspectives. Aldine Transaction.
Phillips, C.V. (2004), “Publication bias in situ,” BMC Medical Research Methodology, 4:20.
Pearl, Judea. 2009. Causality. 2nd Revised edition. Cambridge, U.K.?; New York: Cambridge University Press.
Pearl, Judea, and Dana Mackenzie. 2018. The Book of Why: The New Science of Cause and Effect. 1st ed. New York: Basic Books.
Poole C. (1987), “Beyond the confidence interval,” American Journal of Public Health, 77, 195–199.
Poole, C. (2001). Low P-values or narrow confidence intervals: Which are more durable? Epidemiology, 12, 291–294.
Rosenthal, Robert. 1979. “The File Drawer Problem and Tolerance for Null Results.” Psychological Bulletin 86(3): 638–41.
Rothman, K.J. (1978), “A show of confidence (Editorial),” New England Journal of Medicine, 299, 1362-1363.
Rothman, K.J.(1986), “Significance questing (Editorial),” Annals of Internal Medicine, 105, 445-447.
Rothman, K.J. (2010), “Curbing type I and type II errors,” European Journal of Epidemiology, 25, 223-224.
Rothman, K.J., Weiss, N.S., Robins, J., Neutra, R., and Stellman, S. (1992), “Amicus Curiae, brief for the U. S. Supreme Court, Daubert v. Merrell Dow Pharmaceuticals, Petition for Writ of Certiorari to the United States Court of Appeals for the Ninth Circuit,” No. 92-102, October Term, 1992
Rozeboom, W.M. (1960), “The fallacy of the null-hypothesis significance test,” Psychological Bulletin, 57:416–428.
Schervish, M.J. (1996), “P Values: What They Are and What They Are Not,” The American Statistician, 50:3, 203-206
Schmidt FL. Statistical significance testing and cumulative knowledge in psychology: implications for training of researchers. Psych Methods 1: 115–29, 1996.
Simmons, J.P., Nelson, L.D., and Simonsohn, U. (2011), “False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant,” Psychological Science, 22(11), 1359-1366.
Stang, A., and Rothman, K.J. (2011), “That confounded P-value revisited,” Journal of Clinical Epidemiology, 64(9), 1047-1048
Stang, A., Poole, C., and Kuss, O. (2010), “The ongoing tyranny of statistical significance testing in biomedical research,” European Journal of Epidemiology, 25(4), 225-30.
Sterne, J. A. C. (2002). "Teaching hypothesis tests – time for significant change?" Statistics in Medicine, 21, 985-994.
Sterne, J. A. C. and G. D. Smith (2001). "Sifting the evidence – what's wrong with significance tests?" British Medical Journal, 322, 226-231.
Wainer H. , Robinson D. H. Shaping Up the Practice of Null Hypothesis Significance Testing. Educational Researcher,32(7), 22-30, 2003.
Wainer H. One cheer for null hypothesis significance testing. Psychological Methods, 4(2), 212-213, 1999.
Wainer H. & Robinson, DH. On the Past and Future of Null Hypothesis Significance Testing . Journal of Wildlife Management, 66, 263-271, 2002.
Wellek, Stefan. 2017. “A Critical Evaluation of the Current ‘ p -Value Controversy’: P -Value Controversy.” Biometrical Journal 59(5): 854–72.
Ronald L. Wasserstein & Nicole A. Lazar (2016): The ASA's statement on p-values: context, process, and purpose, The American Statistician,
Wasserstein, Ronald L., Allen L. Schirm, and Nicole A. Lazar. 2019. “Moving to a World Beyond ‘ p < 0.05.’” The American Statistician 73(sup1): 1–19.
Weisberg, Herbert I. 2014. Willful Ignorance: The Mismeasure of Uncertainty. New. Hoboken, New Jersey: John Wiley & Sons.
Ziliak, S.T. (2010), "The Validus Medicus and a New Gold Standard,” The Lancet, 376, 9738, 324-325.

Brembs B, Button K, Munafo M. (2013) Deep impact: unintended consequences of journal rank. Front Hum Neurosci 7: 291.
Button KS, Ioannidis JP, Mokrysz C, Nosek BA, Flint J, et al. (2013) Power failure: why small sample size undermines the reliability of neuroscience. Nat Rev Neurosci 14: 365-376.
Chambers CD, Munafo MR. (2013) Trust in science would be improved by study pre-registration. Scott SK. (2013) Will pre-registration of studies be good for psychology?
Chambers CD. (2013) Registered reports: a new publishing initiative at Cortex. Cortex 49: 609-610.
Christopher D. Chambers,Eva Feredoes, Suresh D. Muthukumaraswamy, Peter J. Etchells Instead of “playing the game” it is time to change the rules: Registered Reports at AIMS Neuroscience and beyond. Volume 1, Issue 1, 4-17. AIMS Neuroscience DOI: 10.3934/Neuroscience2014.1.4
Cohen J. (1962) The statistical power of abnormal-social psychological research: a review. J Abnorm Soc Psychol 65: 145-153.
de Groot AD. (2014) The meaning of "significance" for different types of research [translated and annotated by Eric-Jan Wagenmakers, Denny Borsboom, Josine Verhagen, Rogier Kievit, Marjan Bakker, Angelique Cramer, Dora Matzke, Don Mellenbergh, and Han L. J. van der Maas]. Acta Psychol (Amst) 148: 188-194.
Dienes Z. (2011) Bayesian Versus Orthodox Statistics: Which Side Are You On? Perspect Psychol Sci 6: 274-290.
Faneli D. (2010) “Positive” Results Increase Down the Hierarchy of the Sciences. PLos One 5: e10068.
Fiedler K, Kutzner F, Krueger JI. (2012) The Long Way From a-Error Control to Validity Proper: Problems With a Short-Sighted False-Positive Debate. Perspect Psychol Sci 7: 661-669.
Gelman A, Loken E. (2014) The garden of forking paths: Why multiple comparisons can be a problem, even when there is no fishing expedition" or "p-hacking" and the research hypothesis was posited ahead of time. Unpublished manuscript
Ioannidis JPA. (2005) Why Most Published Research Findings Are False. PLoS Med 2: e124.
Ioannidis JPA. (2012) Why Science Is Not Necessarily Self-Correcting. Perspect Psychol Sci 7: 645-654.
John LK, Loewenstein G, Prelec D. (2012) Measuring the prevalence of questionable research practices with incentives for truth telling. Psychol Sci 23: 524-532.
Kerr NL. (1998) HARKing: hypothesizing after the results are known. Pers Soc Psychol Rev 2: 196-217.
Makel MC, Plucker JA, Hegarty B. (2012) Replications in Psychology Research: How Often Do They Really Occur? Perspect Psychol Sci 7: 537-542.
Mathieu S, Chan AW, Ravaud P. (2013) Use of trial register information during the peer review process. PLoS One 8: e59910.
Munafo MR, Strain E. (2014) Registered Reports: A new submission format at Drug and Alcohol Dependence. Drug Alcohol Depend 137: 1-2.
Nelson LD. (2014) Preregistration: Not just for the Empiro-zealots.http://datacoladaorg/2014/01/07/12-preregistration-not-just-for-the-empiro-zealots/.
Nosek BA, Lakens D. (in press) Registered reports: A method to increase the credibility of published results. Soc Psychol.
Nosek BA, Spies JR, Motyl M. (2012) Scientific Utopia : II. Restructuring Incentives and Practices to Promote Truth Over Publishability. Perspect Psychol Sci 7: 615-631.
Rouder J, Speckman P, Sun D, Morey R, Iverson G. (2009) Bayesian t tests for accepting and rejecting the null hypothesis. Psychon Bull Rev 16: 225-237.
Scott SK. (2013) Pre-registration would put science in chains. Times Higher Education.
Simmons JP, Nelson LD, Simonsohn U. (2011) False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychol Sci 22: 359-366.
Stahl C. (2014) Experimental psychology: toward reproducible research. Exp Psychol 61: 1-2.
Sterling TD. (1959) Publication Decisions and their Possible Effects on Inferences Drawn from Tests of Significance—or Vice Versa. J Am Stat Assoc 54: 30-34.
Strube MJ. (2006) SNOOP:a program for demonstrating the consequences of premature and repeated null hypothesis testing. Behav Res Methods 38: 24-27.
Wagenmakers EJ. (2007) A practical solution to the pervasive problems of p values. Psychon Bull Rev 14: 779-804.
Whelan R, Conrod PJ, Poline JB, Lourdusamy A, Banaschewski T, et al. (2012) Adolescent impulsivity phenotypes characterized by distinct brain networks. Nat Neurosci 15: 920-925.
Wicherts JM, Bakker M, Molenaar D. (2011) Willingness to share research data is related to the strength of the evidence and the quality of reporting of statistical results. PLoS One 6: e26828.
Wolfe J. (2013) Registered Reports and Replications in Attention, Perception, & Psychophysics.Atten Percept Psycho 75: 781-783.

On the Internet and in the news

Steve Ziliak; also The Gosset Laboratory
The Validus Medicus and a new gold standard (The Lancet) and a reply.
A Statistical Test Gets Its Closeup (Wall Street Journal)