The war against p values in medical research

# The war against p values in medical research
### Claus Thorn Ekstrøm and Theis Lange UCPH Biostatistics .small[<a href="mailto:ekstrom@sund.ku.dk" class="email">ekstrom@sund.ku.dk</a> ] / .small[<a href="mailto:thlan@sund.ku.dk" class="email">thlan@sund.ku.dk</a> ] <svg style="height:0.8em;top:.04em;position:relative;fill:steelblue;" viewBox="0 0 512 512"><path d="M459.37 151.716c.325 4.548.325 9.097.325 13.645 0 138.72-105.583 298.558-298.558 298.558-59.452 0-114.68-17.219-161.137-47.106 8.447.974 16.568 1.299 25.34 1.299 49.055 0 94.213-16.568 130.274-44.832-46.132-.975-84.792-31.188-98.112-72.772 6.498.974 12.995 1.624 19.818 1.624 9.421 0 18.843-1.3 27.614-3.573-48.081-9.747-84.143-51.98-84.143-102.985v-1.299c13.969 7.797 30.214 12.67 47.431 13.319-28.264-18.843-46.781-51.005-46.781-87.391 0-19.492 5.197-37.36 14.294-52.954 51.655 63.675 129.3 105.258 216.365 109.807-1.624-7.797-2.599-15.918-2.599-24.04 0-57.828 46.782-104.934 104.934-104.934 30.213 0 57.502 12.67 76.67 33.137 23.715-4.548 46.456-13.32 66.599-25.34-7.798 24.366-24.366 44.833-46.132 57.827 21.117-2.273 41.584-8.122 60.426-16.243-14.292 20.791-32.161 39.308-52.628 54.253z"/></svg> @ClausEkstrom / @GetTheisOnTwitter
### IFSV Sept. 19th 2019 .small[Slides: <a href="https://www.biostatistics.dk/talks/">biostatistics.dk/talks/</a>]

---

# Quiz

---

You want to see if the means of two groups are different. You compare the means statistically and get a ** `$p$` value of 0.05** when testing at a **significance level of 0.07**. What is the conclusion?

1. You reject the null hypothesis. Thus you cannot reject that the two population means are the same.
2. You fail to reject the null hypothesis. Thus you cannot reject that the two population means are the same.
3. You reject the null hypothesis. Thus you reject that the two population means are the same.
4. You fail to reject the null hypothesis. Thus you reject that the two population means are the same.
5. Help!!

---

# Exercise

---

# What is a `$p$` value anyway?

> *The `$p$` value is the probability of having obtained a result **at least as extreme** as the one found with our sample if the null hypothesis were true.*  --- Kirkwood & Sterne

IF the null hypothesis is true 
.yellow[AND all the other assumptions about the model are *also* true] 
THEN the `$p$` value expresses the probability of observing something as extreme as what you have in your sample.

???

If A is TRUE then B cannot occur;
   However, B has occurred;
   Therefore A is false
   
   
If A is TRUE then B probably cannot occur;
   However, B has occurred;
   Therefore A is probably false

---

# What is a `$p$` value anyway?

> *The `$p$` value is the probability of having obtained a result **at least as extreme** as the one found with our sample if the null hypothesis were true.*  --- Kirkwood & Sterne

Roughly: the `$p$` value is a number that measures how surprised you are.

---

# The epistemology of science

---

---

# A history of the (war against) p values

---

# From ancient times ...

???

Popper did not like the probability argument

---

# ... to recent times

![](pvalues_files/figure-html/unnamed-chunk-3-1.png)

---

# What are the problems with p values?

They try to answer the *"wrong"* question

A researcher typically wants to know if the hypothesis holds:

`$$P(H | D)$$`

but the p value computes

`$$P(D \text{ "or more extreme" } | H)$$`

???

They give a very precise answer to the wrong question instead of an approxiomate answer to the right question.

---

# The p value is used in the wrong way

Typically used as a decision rule:

`$$p \text{ value} \left\{\begin{array}{ll}<0.05 & \text{reject} - \text{"significant"} \\ \geq 0.05 & \text{not reject} - \text{"not significant" or "no association"} \end{array} \right.$$`

*   Arbitrary threshold for continuous scale
*   Significant does not mean clinically relevant
*   Non-significance does not mean that `$H_0$` is true - only that there was insufficient evidence to reject it (*"absence of evidence is not evidence of absence"*).

???

"No association" is wrong to say

binary thinking makes everything worse in that people inappropriately combine probabilistic statements with Boolean rules.

---

# The p value contains two types of information

When `$N\rightarrow\infty$` *everything* becomes significant.
]

]

---

# Unrealistic null hypothesis

Compare two treatments with effects `$\mu_1$` and `$\mu_2$`

`$$H_0 :  \mu_1 = \mu_2$$`

When do we really believe that the effects of two treatments are *exactly* the same?

Hard to believe in most public health or social science research.

???

At least outside randomization

---

---

---

# Alternative proposals

---

# Use confidence intervals

The CI is defined as the values of `$H_0$` that are *not* rejected.

![](pvalues_files/figure-html/unnamed-chunk-5-1.gif)

Fully defined from (infinitely many) p values

---

# ... interpretation of the CI

]

**Epidemiologists:** interpret confidence intervals as credible intervals.

**Biostatisticians:** Know that CIs are not credible intervals, but interpret them as though they were anyway.

]

---

# Bayes factors

The Bayes factor is the ratio of the likelihood of two hypotheses:

`$$BF = \frac{P(D | H_1)}{P(D | H_0)}$$`

Move problem to another scale!

**Several** other problems.

]

]

???

If you think p values are problematic, wait until you understand Bayes facts

depend crucially on aspects
of the prior distribution that are typically assigned in a completely arbitrary manner by users.

IF B10 IS…	THEN YOU HAVE…
> 100	Extreme evidence for H1
30 – 100	Very strong evidence for H1
10 – 30	Strong evidence for H1
3 – 10	Moderate evidence for H1
1 – 3	Anecdotal evidence for H1
1	No evidence
1/3 – 1	Anecdotal evidence for H1
1/3 – 1/10	Moderate evidence for H1
1/10 – 1/30	Strong evidence for H1
1/30 – 1/100	Very strong evidence for H1
< 1/100	Extreme evidence for H1

---

# Lower the significance level

*   Fewer false positives
*   Improve replicability

Cons:

*   More false negatives
*   Still dichotomizes results
*   Does not fix **any** of the conceptual problems with the p value
]
]

Use `$\alpha$`=0.005 instead of 0.05.

]

---

# Bayesian analysis

Answers the "right" question:

*What is the probability that my hypothesis holds?*

`$$P(H|D)$$`

*   Subjective vs objective
*   Moves discussion to priors

]
.pull-right[

Posterior distribution of `$\theta$`

![](pvalues_files/figure-html/unnamed-chunk-10-1.png)

]

---

---

# Let's put things into perspective

*   Which variables?
*   How to measure?
*   Missing data
*   Entry errors

*   Which model?
*   Which specification?
*   Which assumptions?

]

*   Confounding
*   Collinearity
*   Overfitting
*   `$p$` hacking

*   Interpretation
*   Published?
*   Replicated?

]

---

# Publication bias

| Scientist  | Salesman |
|:-----------|:-------------|
| Be sceptical of your results | "Sell" your results |
| Interpret conclusions carefully | Highlight / exaggerate importance |
| "Publish" negative results | Publish strategically |
| Replicate replicate replicate | Replicate ... if you must |
| Novel exciting results are less likely to be true | Publish novel results before they get scooped |
]
]

]

---

# What about the future?

---

# Are p values bad?

*   No

*   Medicine / public health has moved forward in leaps and bounds in the last 100 years.

*   *"Assessing the Statistical Analyses Used in *Basic and Applied Social Psychology* After Their p-Value Ban" (2019)*

*   The function of significance tests is to *prevent you from making a fool of yourself*, and not to make unpublishable results publishable

???

If a drunken driver crashes into a tree it is not the cars fault (at least not yet).

31 BASP papers. 17 with some statistics

---

# The sceptical p value

Held (2019): The sceptical p value. Focus on statistical evaluation of replication studies.

![](pvalues_files/figure-html/unnamed-chunk-12-1.png)

---

# Recommendations

Embrace uncertainty!

Know your tools!

*   Report effect sizes and CIs (and perhaps p values)
*   Put as much energy into discussing clinical relevance as statistical results.
*   Abandon dichotomizing and "statistically significant"
*   Never conclude: "no difference" eller "no effect"

???

Present statistical conclusions with uncertainty rather than as dichotomies

March 2019, 800 scientists har skrevet under.
1: never conclude: no difference eller no association
2: abandon dichotomizing and "statistically significant"

---

# Who said this?

>   *Some people hate the very name of statistics, but
    I find them full of beauty and interest. Whenever
    **they are not brutalized**, but **delicately handled** by
    the higher methods, and are **warily interpreted**, their
    power of dealing with complicated phenomena is
    extraordinary. They are the **only tools** by which an
    opening can be cut through the formidable thicket of
    difficulties that bars the path of those who pursue the
    Science of man.*
    
--

>   *- Francis Galton (1894)*

???

We can provide the best methods possible. it is up to the researcher to apply them (delicately handled) and appropirately decipher the results "warily interpreted"

---

# References

.small[
*   Amrhein V, Greenland S & McShane B (2019) [Scientists rise up against statistical significance](https://www.nature.com/articles/d41586-019-00857-9)
*   Benjamin et al (2018) [Redefine statistical significance](https://www.nature.com/articles/s41562-017-0189-z). Nature Human Behaviour. 
*   Fisher, R. (1925). [Statistical Methods for Research Workers](http://psychclassics.yorku.ca/Fisher/Methods/)
*   Fisher, R. (1935). [The Logic of Inductive Inference](https://www.jstor.org/stable/2342435?seq=1#metadata_info_tab_contents), JRSS
*   Fricker et al. (2019). [Assessing the Statistical Analyses Used in *Basic and Applied Social Psychology* After Their p-Value Ban](https://www.tandfonline.com/doi/full/10.1080/00031305.2018.1537892)
*   Held, L (2019). [A New Standard for the Analysis and
Design of Replication Studies](https://arxiv.org/pdf/1811.10287.pdf)
*   Ioannidis, J. (2005) [Why Most Published Research Findings Are False](https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.0020124) PLoS Medicine
]

---

*   Lindquist, EF (1940). Statistical analysis in educational research.
*   Mayo, D. (2018): [Statistical Inference as Severe Testing : How to Get Beyond the Statistics Wars](https://amzn.to/2Q5rAiJ)
*   Neyman J., Pearson E. S. (1928). [On the use and interpretation of certain test criteria for purposes of statistical inference: part I.](https://www.jstor.org/stable/2331945?seq=1#metadata_info_tab_contents) Biometrika 20A, 175–240. 
*   Popper K (1934) [Logik der Forschung](http://strangebeautiful.com/other-texts/popper-logic-scientific-discovery.pdf)
*   Sheldon, N. (2019), [What does it all mean?](https://rss.onlinelibrary.wiley.com/doi/10.1111/j.1740-9713.2019.01296.x) Significance, 16: 15-17.
*   Trafimow, D. & Marks, M. (2015) [Editorial](https://www.tandfonline.com/doi/full/10.1080/01973533.2015.1012991), Basic and Applied Social Psychology, 37, 1-2.
*   Wasserstein, R.L. & Lazar, N.A. (2016) [The ASA Statement on p-Values: Context, Process, and Purpose](https://amstat.tandfonline.com/doi/full/10.1080/00031305.2016.1154108), The American Statistician, 70:2, 129-133.
*   Open Science Collaboration (2015), [Estimating the reproducibility of psychological science](https://science.sciencemag.org/content/349/6251/aac4716). Science.

]