Statistical significance -- a misconstrued notion in medical research 1

The P-value is the significance probability of obtaining a value of the test statistic that is as extreme, in relation to the null hypothesis, as that observed. Medical researchers may, in some situations, disagree on its appropriate use or on its interpretation as a summary measure of consistency with the null hypothesis in a particular data set. More informative statistical measures, such as the likelihood ratio and the Bayesian posterior probability, have been suggested for drawing inferences from clinical trials and epidemiologic studies. Causal inference is not statistical in nature; rather it strives to provide scientific explanations or criticisms of proposed explanations that would describe the observed data pattern. In this context, it is important to remember that a finding may not be medically important, or a causal hypothesis may even not be true, even if a study shows a significant P-value.

Significance testing prevails a general method of analysis in medical research although the overemphasis of the use of the P-value has long been criticized (2). Researchers obviously believe that it is not worth while to submit a manuscript for publication to a journal unless it contains a significant P-value. Significance testing is an apparently objective way to decide whether a socalled null hypothesis 3 (e.g. treatment A is as good as treatment B) remains valid or should one reject it and accept in its place a study hypothesis (e.g. treatment A is better than treatment B). Instead of the P-value, computations of more informative statistical measures have been suggested. Such statistics include the P-value function (3), which yields the significance of also other hypotheses and not merely that of the null hypothesis, and the likelihood ratio test (4), which compares 2 rival hypotheses. Some scientific journals (eg, Cancer Research) instruct the authors to indicate the significance of their findings using an appropriate statistical analysis. Other journals, such as the British Medical Journal (5) and The Lancet (6), have recommended that significance testing be replaced by the computation of a confidence interval. Certain statisticians reject significance testing categorically. (See, eg, ref. 7.) Respected epidemiologists like Rothman (8) and Greenland (9) would not ban significance tests, but they hold the view that the tests appear to have produced much more harm than good in the social and health sciences.
The traditional ("frequentist") Neyman-Pearson school of statistics and the alternative Bayes school interpret the notion of probability underlying the test of significance differently. The frequentist statisticians define the P-value as the probability of the observed outcome in a study plus the probabilities of the more extreme (unobserved) outcomes -that is, as a relative frequency, or proportion, in large samples -assuming that the observations are generated according to a given probability model. The P-value measures whether a null hypothesis is compatible with the data or not. It is, however, totally contrary to the spirit of significance testing to compare the P-value with preset levels, which are conventionally chosen as 5, 1 or 0.1%, and to interpret the result in a rigidly different manner depending on whether the P-value is below or above a certain level. [These reference levels are often marked with 12 or 3 asterisks (*), but they need not to be considered in the same light as the stars indicating the quality hotels.] Significance testing is not to be regarded as decision-making but as statistical inference. Occasionally one sees the frequentist P-value being interpreted as giving the probability for the statement that "the null hypothesis is true" or that "the result is a random finding". The former interpretation is surely wrong because the computation of the P-value explicitly assumes that the null hypothesis is true. The latter interpretation is problematic since, in a frequentist analysis, one can never infer definitely whether a single hypothesis pertaining to the considered parameter 4 (eg, the difference between mean values) is true or not or whether the unknown value of the studied parameter lies within, say, the 95% confidence interval computed from a particular experimental material. The frequentist statisticians can only state that, if the experiment would be repeated sufficiently many times, then approximately 95% of the computed intervals (which are stochastic variates) would cover the true value of the studied parameter (which is a constant of nature).
In the interpretation of the P-value one must also consider the amount of information contained in the data (the "power" of a test). Miettinen (10) provides the following guidelines for interpretation: (i) If information is very sparse, one should not analyze the data at all; (ii) if information is very ample, the P-value is too sensitive to be useful and, instead of testing, one should estimate the magnitude of the effect; (iii) if the amount of information is neither very sparse nor very ample, one may infer that a) a very small P-value supports the study hypothesis, b) a small P-value does not discriminate between the study hypothesis and the null hypothesis, and c) a moderately or especially large P-value is relatively less consistent with the study hypothesis than with the null hypothesis, which speaks for the refutation of the study hypothesis.
A Bayesian statistician overcomes the interpretative problems of significance testing by viewing probability as a degree of personal belief of the correctness of a study hypothesis. This subjective probability is based on investigative foreknowledge regarding the uncertainty of the study hypothesis and the preconception that one got of it, which will be modified via a model as new empirical evidence accrues. Bayesian statistical theory produces a "posterior" probability distribution of the studied hypothesis, by means of which one can inductively state, for instance, that "with a 95% probability treatment A is more effective than treatment B at least in 10% of the cases and at most in 20% of the cases." Different experts will often have different preconceptions on the credibility of the studied hypothesis, but in a Bayesian analysis these prior beliefs can naturally be reflected by the presentation of several prior distributions in the context of the same data (11).
The Bayesian approach also avoids the frequentist problem which relates to the testing of multiple hypotheses in 1 data set or the simultaneous testing of a single hypothesis in many subsets of data. For example, when one studies the differential diagnosis of malignant mesothelioma and lung carcinoma with the aid of genetic alterations one can examine 10's of different chromosome changes. Frequentist statisticians try to control the occurrence of false significant findings by applying more strict levels of significance. By using this procedure, for example, a significant difference (p=0.004) of the frequency of changes observed in the chromosome 22 between patient groups becomes nonsignificant, if one accounts for the respective tests made for the chromosomes 1,...,21 in the same investigation, and corrects the 5% critical level to 0.05/21 = 0.0024. According to the Bayesian way of thinking there is no reason to correct a particular P-value merely because also other variables were considered in the same study. The Bayesian solution of the problem is to define the prior joint likelihood of the mutually dependent hypotheses, which would appear to be a scientifically more rational procedure than a mechanistic correction of the P-value. The specification of the prior likelihood function is, however, a challenging data-analytic task, especially in problems involving many parameters (12).
Frequentist inference is thus problematic. Why isn't everyone then a Bayesian (13)? The answer is dictated by practice: for example, the Bayesian likelihood ratio test is harder to compute than the frequentist significance test. Ten years ago, the Bayesian analytic solutions of even the simplist epidemiological problems were difficult to tackle (14). Nowadays, however, simulation modeling techniques make the performance of a Bayesian analysis possible also in more complex biomedical applications (15), in which the frequentist and Bayesian analyses do not necessarily result in the same inferences (16). This being the case, the Bayesian methods will inescapably be used in clinical medicine and epidemiology (17). During the transition period, medical scientists should prepare for the change by familiarizing themselves with the Bayesian methods (18). The natural simplicity of the Bayesian concepts is appealing.
The role of statistics in cause-effect studies depends on the study design. The traditional theory of statistics was created for randomized experiments. Thus in clinical trials, in which the treatment of patients is randomized, the results produced by customary analysis (the P-value, the confidence interval of a parameter, the likelihood ratio) are interpretable quantities from the point of view of causal inference. In nonexperimental (eg, epidemiologic) studies, in which the exposure of persons is not randomized, probabilistic interpretations of conventional statisticals are not necessarily justified, and can lead to incorrect inferences of nonrandomized studies (19). Can thus, for example, the P-value be interpreted in any reasonable way in nonrandomized studies? As a remedy for this problem Greenland (20) suggests that in the data analysis one should separate from each other (i) the description of the data variability by means of graphic displays or simple summaries, (ii) the profiling of distributions or relations being sought from the data in comparisons with statistical models (pattern recognition, data smoothing), and (iii) scientific inference. Greenland (20) contends that statistical analysis is limited to stage 2, in which a statistical measure, "such as a P-value is not a data summary; rather it is a convolution of the data with some model or preconceived notion about the process that generated the data [p 227]". One should use modern techniques of statistical analysis to examine the impact of discrepant observations on the outcome measures (influence analysis) and the effects of departures from model assumptions on the stability of the findings (sensitivity analysis). Causal inference is not statistical by nature; rather it strives to (i) determine scientific explanations that would explain the results of statistical analyses in a logically coherent way and (ii) criticize proposed explanations that would not lead to the observed data pattern (20).
"Clinical significance" is determined in population studies, for example, as the magnitude of the difference in the mean values between the experimental group and the comparison group. In large population groups even a small difference becomes statistically significant, whereas in small samples a clinically significant observation can remain statistically nonsignificant. Two recent cohort studies on reproductive health furnish examples of surveys in which the size of the material was a significant factor. The notable sample size (over 5000 people) of a Danish study (21) permitted the expression of minimal differences, whereas, in an American study (22), the small number of exposed (only 27) prevented the presentation of differences that were not big.
On the other hand, although the difference would be small on the group level, a clinical finding can be of decisive importance for some individuals who belong to a risk group. In a Finnish epidemiological study (23), the risk of dying of coronary disease among a cohort of 343 industrial workers exposed to carbon disulfide was over 2-fold relative to the risk of a same-sized, individually matched comparison cohort. The researchers discussed several biochemical mechanisms, which could explain why carbon disulfide exposure caused the increased risk of coronary mortality. A possible indirect mechanism might have been a raised blood pressure. On a group level, the differences in the mean values of the subjects' blood pressures were statistically significant, although the differences were relatively small [difference in systolic pressure 8 mmHg (1.1 kPa) and in diastolic pressure 3.5 mmHg (0.5 kPa)]. If a worker had, in addition, other risk factors, even a minor elevation of blood pressure could be a danger. The researchers estimated that raised blood pressure was a causal factor in every 6th death from coronary heart disease, originally caused by carbon disulfide exposure (23).
It is not very remarkable if your large study produces a statistically significant result. The finding can be medically important only if one's colleagues still believe in the result after having read the discussion of its significance without reference to P-values.
Acknowledgments I thank Tuula Nurminen for her valuable comments.

1.
This commentary was published in Finnish as an editorial of Duodecim Vol. 113, No. 4, 1997.

3.
Null hypothesis is an exact statistical formulation for the studied assumption (hypothesis) to be incorrect: for example, the difference of two groups' mean values equals zero. Assuming that the null hypothesis prevails one can make deductive inferences about the correctness of the study hypothesis, which is often formulated in less exact terms.

4.
A parameter is a quantity which partly or fully determines a probability distribution. A parameter is not directly measurable, but using a distribution model, one can describe what kind of samples are associated with particular values of the parameter. Based on the compatibility of the data with the model one can estimate the most likely value of the parameter.