Saturday 21 May 2016

Misuse of p-values in visual stress studies (1)

"To consult the statistician after the experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of"
R A Fisher 1938

I thought it might be useful to discuss p-values and their utility in appraising a piece of research.
For a much better account than mine, I recommend a recent post by Dorothy Bishop The amazing significo: why researchers need to understand poker which illustrates some of the problems associated with p-values. Proponents of the treatment of visual stress with coloured overlays and lenses place too much emphasis on p-values. Values below the arbitrary level of 0.05 are said to indicate that their results could not have arisen by chance and lead to the rejection of the null hypothesis.
The assumption of many people that a p-value of 0.05 or less means less than a 1 in 20 chance of a false positive result is usually wrong. The application of a statistical test is only the final step in the design and execution of a scientific study. If there are problems further back, in the behaviour and practices the produced the data, then no statistical test can rescue a flawed study and a statistically significant result doesn't become clinically significant. Unfortunately, however, a p-value less than 0.05 can lend a study credibility that it does not deserve
Estimation of a p-value is only the final step in the sequence of events in a treatment trial. Readers are often kept in the dark about decisions made earlier in the design and execution of a study that can be far more important in the deciding how much credence to place in the results of a study.











Systematic reviews aim to shine a light on the behaviour and practices that led to the data rather focusing exclusively on the final step; the p-value. Randomised controlled trial are evaluated according to a template to determine the risk of bias. That is bias in the statistical sense, that can result in the data veering off in one direction or another. See the figure below.
A systematic review aims to look at the behaviour and practices that led to the data and apply the same template to assess the risk of bias to all studies. Only studies at low risk of bias count to the final analysis. The domains that are evaluated are random sequence generation, allocation concealment, similarity of groups at baseline, blinding of participants and personnel, blinding of outcome assessment, attrition bias and reporting bias.













A systematic review is more than just a thorough narrative review that takes care to include all studies. The crux of a systematic review is the analysis of all RCTs according to established criteria to estimate the risk of bias.
Those criteria include
1) Random sequence generation. That is, were the subjects properly randomised to ensure the groups were evenly balanced at the start of the trial?
2) Allocation concealment. Could the experimenters have guessed which arm of the trial the next patient would go into? For example, if alternate patients are allocated to the two arms of the trial.  A common way around this problem is having an external office which allocates patients to a study keeping patients and experimenters at arms length from the randomisation process. An example of good practice is to found in the RCT by Ritchie et al reported in the blog of July 2015. Alternatively, you can access the full text via this link.
3) Similarity of groups at baseline. If one group starts with worse reading at the start of the trial then they have more room for improvement or are more likely to improve because of regression to the mean.
4) Blinding of personnel and participants. In an ideal study, neither party knows which is experimental and which is the placebo intervention. This is not easy in trials of coloured lenses and overlays although it is not impossible. Wilkins et al 94 managed it in a trial using the intuitive colorimeter. Nor is it absolute binary issue a trial comparing the chosen colour with another colour is likely to be more reliable than a trial that compares no overlay or clear overlay with the chosen colour.
5) Blinding of outcome assessment. This speaks for itself. An example of good practice would be to record readers and ask experimenters blinded to the status of participants to score the reading in terms of accuracy and speed.
6)Attrition. This is very important in trials of coloured overlays and lenses. Although subjects may be randomised on entry into a trial, dropouts seldom are random and can easily bias the outcome. For this reason, results should be analysed on an intention to treat basis. For an example of poor practice see the widely cited paper by Wilkins et al. 1994 reviewed elsewhere on this blog. The trial started with 68 participants but data was only available for 36 out of 68 participants. No attempt was made to account for the missing data and a complete case analysis rather than an intention to treat analysis was carried out.
7) Reporting bias. If slice and dice your data in enough different ways but don't disclose this post-hoc flexible approach to data analysis you are very likely to find statistically significant effects. For example, you could have multiple outcome measures and not declare until after the trial is complete which was the primary outcome measure. Or, if you transform your data by converting from words per minute to syllables per minute or change the subgroups. For an example of poor practice in this regard see the paper by Tyrell et al. 1996 reviewed in my blog of August 2015.
That this really matters is exemplified by a recent publication that you can download here. The simple act of forcing researchers to pre-register trials, declaring the primary outcome measure and statistical analyses has reduced the number of positive trial results. The disturbing implication of this study is that many of the positive trial results that used to be published were false positives.
So a p-value taken in isolation tells you almost nothing useful about a study.

"Statistical significance is perhaps the least important attribute of a good experiment: it is never a sufficient condition for claiming that a theory has been usefully corroborated, that a meaningful empirical fact has been established or that an experimental report ought to be published"