|
CONFERENCES Regional
Alabama
|
by Robert Gifford and Whitney Pidot - Harvard Law School '99 The Search for an Objective Measure
of Research Quality
Even after abstracting away from mathematical and statistical complexities inherent in quantitative research, the technical meaning of statistical significance cannot be ignored. This paper intentionally avoids resorting to any formal equations, but a definition is required. The significance level is the probability that the researchers would not have reached their result through random chance alone even if it were untrue. For example, when a study concludes at a 95% significance level that a given product triples one's chance of illness, 5% of studies would reach the same conclusion even if the product had no impact - or even if the product actually helped to prevent the illness being considered. Proof and Error in Practice:
A Closer Look at Science
After conducting our Sun Study, we may find that most people tan when exposed to sunlight for an extended duration. At no point can we say, however, with 100% certainty that this is the true cause. Perhaps tanning is caused by exposure to blue sky; the fact that sunlight is abundant when the sky is blue is incidental. Or perhaps tanning is a random process that we imagine being associated with sunlight. Thus, we use statistical correlation to show relationships among variables; H is supported when correlation is high, and H0 is supported when correlation is low. But how high is high? Given that the bulk of data supports our hypothesis, the question becomes how likely are we to get this data by sheer chance? Coincidences do happen. For example, Powerball Lottery is played across the nation, but in one year, three winning tickets were sold within one mile of each other in Fond du Lac, Wisconsin. A hypothesis that Fond du Lac tickets have increased chances of winning sounds ludicrous but would perhaps merit investigation. Our choices are, H::Fond du Lac sales are good luck, and H0::It's just chance. In this investigation, two types of error may emerge. An example of Type I error would be a conclusion that Fond du Lac is good luck, when in fact the location of the ticket sales was just coincidence (H asserted; H0 true). Conversely, Type II error leads to the claim that the lucky tickets were all coincidentally from Fond du Lac, if in truth those tickets had a veritable cosmic advantage (H0 asserted; H true). What is p?
But what is true? Even though the Fond du Lac cluster is an unlikely product of chance, we have not proven that Fond du Lac is special. Perhaps this case is indeed a rare "1%" incidence of chance. Yet science has accepted the data and is making conclusions based on it. This is a case of Type I error. Science might conclude that Fond du Lac has mystical properties when it does not. Conversely, if the study concluded that chance could account for the Fond du Lac cluster 8% of the time, science would not reject H, and would conclude that the evidence does not support the notion that Fond du Lac is supernatural. However, there would still be substantial evidence that H is valid. If H is actually true, but we reject it, we have committed Type II error. It should be noted that p need not be set at 95%. If a plaintiff presents a study, the 95% standard means that Type I error will be kept to a minimum. This standard is good for science, which proceeds conservatively, but results in more false negative results than false positive ones. In fact, several commentators have noted that because one goal of civil law is to make decisions based on a more-likely-than-not standard (i.e., greater than 50%), perhaps p should be set at a level which equalizes Type I and Type II error(2). In a legal setting, they argue plaintiff and defendant should share the risk that chance accounts for a result. Thus, Type I and Type II error should be set equal to each other. The mathematical relationship between the two is not straightforward, but for many studies this equilibrium exists at about 80%-85%.(3) Using Confidence Intervals
Master marksman Cedric Tell, like his great-uncle William, must shoot an apple from the top of his son's head. He is allowed to pick his favorite rifle for the task; naturally he wishes to pick one which is accurate. One of his rifles is a family heirloom that is rumored never to miss. But Cedric is skeptical of old guns and wishes to test it out before he uses it. Cedric decides that he will use the gun if he can be 95% certain that the gun will not injure his son. He takes the gun to the shooting range where he fires at a target 50 times. Cedric forms the hypothesis, H::The rifle is accurate. Because he always aimed at the bullseye (and Cedric rarely misses), he concludes that if the bullet holes are generally far from the center, the gun must be inaccurate. Cedric performs a few statistical procedures on the bullet hole distribution and finds the center point of the spread, labeled g. It is easy to imagine that g may not actually designate one of the bullet holes; g is simply an average value. Also note that g is unlikely to be located at the exact center of the target. Even if the rifle placed 49 bullets at the exact center, and one bullet only one centimeter to the right, g would still deviate from the center by a little bit. Yet in this case, Cedric would be scientifically wrong to conclude that the gun is inaccurate. The p < 0.05 standard only requires 95% certainty. But we're 100% certain that g does not lie on the center bullseye, aren't we? The answer is yes, but it's the wrong question. Cedric knows that g does not lie at the bullseye, but g is only a point estimate of the true aim of the gun. The true aim (let's call it A) is confounded by a certain element of chance (draughts, temperature, bullet density, etc.) Thus, where 49 bullets pass through the bullseye and one errs, we are comfortable chalking the deviation up to chance. This conclusion is an example of a confidence interval at work. In common terms, we decided that g was "close enough" to the bullseye. The confidence interval (CI) is a statistical way of saying the same thing. CI's are widely used by statisticians, and their construction follows accepted methods. The CI designates a range of values in which we are confident that A might lie. Imagine Cedric's problem again. He needs to be 95% certain that the bullet will not pass through his son's head if aimed at the center of the apple. To do this, Cedric would form a CI around g in which he was 95% certain A lies. If the CI extends farther than the radius of an apple, Cedric will pick another rifle. Cedric doesn't really care if the rifle is possibly perfectly accurate (that the bullseye lies within the CI), but rather that A does not lie outside the area of the apple. Thus, Cedric is 95% certain that the CI contains A; because the CI is entirely contained within the radius of the apple, he is willing to attempt the trick with the heirloom rifle. At this point a couple of observations should
be noted. First, if the sample of 50 bullets is widely dispersed, g might
lie near the bullseye, but the CI will be quite large, and Cedric would
pick another gun. Additionally, if the initial sample consisted of only
2 shots, the CI would still be large, because the sample size was too small
to make good conclusions. Intuitively, we are more comfortable making inferences
from large data pools than from small ones.
Confidence Intervals in the Legal
Context
A study may show that compared with baseline, the relative risk is 2.2. On its face, the study suggests that plaintiff has met her burden. However, if the CI for that study is +/- 0.3, then the range of values defined by the CI is 1.9 to 2.5. That is, we are 95% certain that the actual relative risk lies within that range. Technically, however, we can make no conclusions where the actual relative risk really lies. This observation has led some observers to conclude that H should only be accepted if the range defined by the appropriate CI does not encompass any values which support the opposite conclusion, H0.(5) Thus, plaintiff would be judged not to have met her burden because we are not 95% certain that the true relative risk is not below 2.0. This conclusion has intuitive appeal. However, it can be criticized as placing too large a burden on the plaintiff. After all, only the largest studies will have negligible 95% CI's. Thus, in order to meet her burden, plaintiff will have to produce a study which reaches a conclusion substantially over the target value, 2.0.(6) Why Confidence Intervals Alone
Don't Solve the Problem
Confidence intervals do help to some extent, however, when a legal standard is interested in more than the simple question of whether or not one condition has any effect on another. If the minimum increase in risk to result in legal liability is, for example, a doubling of risk then the confidence interval can be adjusted accordingly. In particular, the significance level would then measure the chance (one minus the chance, actually) of finding a doubling effect by chance alone if the true effect were less than a doubling, as opposed to less than zero. Data Mining and Expert Shopping
To complicate matters even further, assessing validity of statistical studies does not end with the understanding of significance levels. Again, through random chance alone, an ill-conceived research design will appear to be statistically significant at any level some fraction of the time. Given the ability of modern computer programs to "data mine"--that is, to search data samples for anything remotely correlated with anything else--opportunities for invented claims of causation abound. Given a litigant's interest in producing support for his theory, some researcher and some data set can almost always come together--even in completely good faith--to produce the desired result. As such, research presented by one side will almost never be dispositive. Since each side is likely to emphasize only those particular studies supporting its side of any contested issue, it is rarely immediately obvious which position receives the support of experts in the field. The mere counting of studies proving and disproving a theory, however, provides an unsatisfying solution to this problem, as there is often only a limited pool of data available for analysis. Further, with modern research techniques, it is all too easy for an "expert" to try to convince a court that her narrowly defined correlation is incontrovertible evidence of causation. The Makings of a "Good" Research
Design
Admittedly, there comes a point at which the correlation between various phenomena cannot be ignored, even though the technical relationship between the two cannot be explained with particularity. For example, although the exact genetic process through which cigarette smoking causes cancer has been historically uncertain, most of the scientific community has been virtually certain of the link for many years. Once a theory has been articulated, the emergence of new data can add power to the hypothesis, as the chance of additional data having the same random characteristics of those in the previous studies become lower and lower as more and more samples are considered. Yet even unlimited access to new data samples does not alleviate the concern over misleading research results. Even with stratospheric significance levels, a plethora of other errors, all outside the scope of this paper, can systematically bias the results of a study. Even after studying many different data sets over a period of time, ill-conceived studies can continue to produce erroneous results, as incomplete or mathematically naïve models may systematically assign "fault" for various phenomena to the wrong explanatory variables. Given the number of statistical intricacies easily manipulated by careless or unscrupulous experts, the judicial process could benefit from the presence of statistical experts to screen research methodology, or at least a familiarity among judges with the characteristics common to "good" research. Statistics is a complex field in its own right, and a maze through which not all adjucators - much less juries - could possibly be expected to maneuver effectively on their own. Nonetheless, judges should at least be familiar with the research procedures that indicate better research, and recognize that high significance levels are not the most important element.
1.
For a comprehensive discussion of common errors in research design,
See Gary King et al., Designing Social Inquiry : Scientific Inference
in Qualitative Research (1994).
|
All materials are the property of the Berkman Center for Internet & Society and Harvard Law School. Materials may be reproduced, distributed, or quoted as long as appropriate credit and citation is given. |