The Judge's Role as Gatekeeper: Chapter Eleven

The Judicial Gatekeeping Project

STATE PAPERS &
CONFERENCES

Regional
& National Conferences

Alabama
Alaska
Arizona
Arkansas
California
Colorado
Connecticut
Delaware
Florida
Georgia
Hawaii
Idaho
Illinois
Indiana
Iowa
Kansas
Kentucky
Louisiana
Maine
Maryland
Massachusetts
Michigan
Minnesota
Missouri
Mississippi
Montana
Nebraska
Nevada
New Hampshire
New Jersey
New Mexico
New York
North Carolina
North Dakota
Ohio
Oregon
Oklahoma
Pennsylvania
Rhode Island
South Carolina
South Dakota
Tennessee
Texas
Utah
Virginia
Vermont
Washington
West Virginia
Wisconsin
Wyoming

THE JUDGE'S ROLE AS GATEKEEPER:
RESPONSIBILITIES & POWERS
CHAPTER ELEVEN

Judge as Amateur Scientist:
How to Approach Statistics in Expert Opinions

by Robert Gifford and Whitney Pidot - Harvard Law School '99

The Search for an Objective Measure of Research Quality
While there are numerous "objective" measures of research quality and accuracy, one must interpret such measures with extraordinary caution. One of the most cited, yet perhaps most misunderstood, of these measures is the "significance level" used to justify research findings. In conjunction with "confidence intervals," high significance levels are often heralded as proof of the researcher's claim. Significance levels do play an important role in statistics, but one must note that they rarely estimate the likelihood of the error in which the court is most interested.⁽¹⁾ They almost never coincide with the legal standard of proof relevant to the case under consideration.

Even after abstracting away from mathematical and statistical complexities inherent in quantitative research, the technical meaning of statistical significance cannot be ignored. This paper intentionally avoids resorting to any formal equations, but a definition is required. The significance level is the probability that the researchers would not have reached their result through random chance alone even if it were untrue. For example, when a study concludes at a 95% significance level that a given product triples one's chance of illness, 5% of studies would reach the same conclusion even if the product had no impact - or even if the product actually helped to prevent the illness being considered.

Proof and Error in Practice: A Closer Look at Science
Researchers almost always conduct experiments that will test a hypothesis (H) against the status quo. The status quo is usually called the null hypothesis (represented as H_O). The hypothesis and the null hypothesis are mutually exclusive: if the null is true, then the hypothesis is not, and vice versa. For example, if we wish to determine the effect of sunlight on tanning, we might hypothesize that H::sun causes tanning, and H₀::sun is not related to tanning. Note that this experiment does not address the possibility of the anti-hypothesis, sun counteracts tanning. If H is patently untrue or even backward, the best we will be able to infer is that there is no correlation between sun and tanning.

After conducting our Sun Study, we may find that most people tan when exposed to sunlight for an extended duration. At no point can we say, however, with 100% certainty that this is the true cause. Perhaps tanning is caused by exposure to blue sky; the fact that sunlight is abundant when the sky is blue is incidental. Or perhaps tanning is a random process that we imagine being associated with sunlight. Thus, we use statistical correlation to show relationships among variables; H is supported when correlation is high, and H₀ is supported when correlation is low.

But how high is high? Given that the bulk of data supports our hypothesis, the question becomes how likely are we to get this data by sheer chance? Coincidences do happen. For example, Powerball Lottery is played across the nation, but in one year, three winning tickets were sold within one mile of each other in Fond du Lac, Wisconsin. A hypothesis that Fond du Lac tickets have increased chances of winning sounds ludicrous but would perhaps merit investigation. Our choices are, H::Fond du Lac sales are good luck, and H₀::It's just chance.

In this investigation, two types of error may emerge. An example of Type I error would be a conclusion that Fond du Lac is good luck, when in fact the location of the ticket sales was just coincidence (H asserted; H₀ true). Conversely, Type II error leads to the claim that the lucky tickets were all coincidentally from Fond du Lac, if in truth those tickets had a veritable cosmic advantage (H₀ asserted; H true).

What is p?
Nearly every scientific discipline has adopted a 95% threshold for determining statistical significance. This is the p < 0.05 standard. Here, p represents the statistical likelihood that chance could cause the observed data. In our example, a statistical study might show that the probability of a random cluster of three winning tickets occurring within one mile of each other over the course of only one year is 1/100. So, p = 0.01, or 1%. That is, one out of one hundred times, chance will generate a Fond du Lac cluster somewhere. The 95% standard states that science will only accept data that is less than 5% likely to result from chance. Thus, science would be willing to conclude from this study that Fond du Lac is special, because chance could cause the observed effect in only 1% of cases.

But what is true? Even though the Fond du Lac cluster is an unlikely product of chance, we have not proven that Fond du Lac is special. Perhaps this case is indeed a rare "1%" incidence of chance. Yet science has accepted the data and is making conclusions based on it. This is a case of Type I error. Science might conclude that Fond du Lac has mystical properties when it does not. Conversely, if the study concluded that chance could account for the Fond du Lac cluster 8% of the time, science would not reject H, and would conclude that the evidence does not support the notion that Fond du Lac is supernatural. However, there would still be substantial evidence that H is valid. If H is actually true, but we reject it, we have committed Type II error.

It should be noted that p need not be set at 95%. If a plaintiff presents a study, the 95% standard means that Type I error will be kept to a minimum. This standard is good for science, which proceeds conservatively, but results in more false negative results than false positive ones. In fact, several commentators have noted that because one goal of civil law is to make decisions based on a more-likely-than-not standard (i.e., greater than 50%), perhaps p should be set at a level which equalizes Type I and Type II error ⁽²⁾. In a legal setting, they argue plaintiff and defendant should share the risk that chance accounts for a result. Thus, Type I and Type II error should be set equal to each other. The mathematical relationship between the two is not straightforward, but for many studies this equilibrium exists at about 80%-85%.⁽³⁾

Using Confidence Intervals
As the previous discussion illustrates, the scientific method rarely gives a 100% certain answer. Rather, it accepts results which meet a certain threshold of likelihood, p. The confidence interval is a way statisticians discuss the import of a study. Technically, the confidence interval defines a range of values within which we are 95% certain that the true value lies. The following example may help to clarify this definition.

Master marksman Cedric Tell, like his great-uncle William, must shoot an apple from the top of his son's head. He is allowed to pick his favorite rifle for the task; naturally he wishes to pick one which is accurate. One of his rifles is a family heirloom that is rumored never to miss. But Cedric is skeptical of old guns and wishes to test it out before he uses it. Cedric decides that he will use the gun if he can be 95% certain that the gun will not injure his son. He takes the gun to the shooting range where he fires at a target 50 times.

Cedric forms the hypothesis, H::The rifle is accurate. Because he always aimed at the bullseye (and Cedric rarely misses), he concludes that if the bullet holes are generally far from the center, the gun must be inaccurate. Cedric performs a few statistical procedures on the bullet hole distribution and finds the center point of the spread, labeled g. It is easy to imagine that g may not actually designate one of the bullet holes; g is simply an average value. Also note that g is unlikely to be located at the exact center of the target. Even if the rifle placed 49 bullets at the exact center, and one bullet only one centimeter to the right, g would still deviate from the center by a little bit.

Yet in this case, Cedric would be scientifically wrong to conclude that the gun is inaccurate. The p < 0.05 standard only requires 95% certainty. But we're 100% certain that g does not lie on the center bullseye, aren't we? The answer is yes, but it's the wrong question. Cedric knows that g does not lie at the bullseye, but g is only a point estimate of the true aim of the gun. The true aim (let's call it A) is confounded by a certain element of chance (draughts, temperature, bullet density, etc.) Thus, where 49 bullets pass through the bullseye and one errs, we are comfortable chalking the deviation up to chance.

This conclusion is an example of a confidence interval at work. In common terms, we decided that g was "close enough" to the bullseye. The confidence interval (CI) is a statistical way of saying the same thing. CI's are widely used by statisticians, and their construction follows accepted methods. The CI designates a range of values in which we are confident that A might lie. Imagine Cedric's problem again. He needs to be 95% certain that the bullet will not pass through his son's head if aimed at the center of the apple. To do this, Cedric would form a CI around g in which he was 95% certain A lies. If the CI extends farther than the radius of an apple, Cedric will pick another rifle. Cedric doesn't really care if the rifle is possibly perfectly accurate (that the bullseye lies within the CI), but rather that A does not lie outside the area of the apple. Thus, Cedric is 95% certain that the CI contains A; because the CI is entirely contained within the radius of the apple, he is willing to attempt the trick with the heirloom rifle.

At this point a couple of observations should be noted. First, if the sample of 50 bullets is widely dispersed, g might lie near the bullseye, but the CI will be quite large, and Cedric would pick another gun. Additionally, if the initial sample consisted of only 2 shots, the CI would still be large, because the sample size was too small to make good conclusions. Intuitively, we are more comfortable making inferences from large data pools than from small ones.

Confidence Intervals in the Legal Context
The application of the CI to law is not difficult, but somewhat controversial. It is argued that in order to show "preponderance of the evidence," plaintiff must prove that defendant's product/action caused at least a doubling of the risk of harm. If a plaintiff alleges that Drug Z causes a rash, she must show that the drug "more likely than not" causes rashes. Thus, if 6 people out of 1000 get rashes without taking Drug Z, then in order to conclude that plaintiff got the rash because of the drug, she would have to show that at least 12 out of 1000 Drug Z consumers suffered rashes. Thus, she must show a relative risk of 2, compared with the baseline.⁽⁴⁾

A study may show that compared with baseline, the relative risk is 2.2. On its face, the study suggests that plaintiff has met her burden. However, if the CI for that study is +/- 0.3, then the range of values defined by the CI is 1.9 to 2.5. That is, we are 95% certain that the actual relative risk lies within that range.

Technically, however, we can make no conclusions where the actual relative risk really lies. This observation has led some observers to conclude that H should only be accepted if the range defined by the appropriate CI does not encompass any values which support the opposite conclusion, H_0.⁽⁵⁾ Thus, plaintiff would be judged not to have met her burden because we are not 95% certain that the true relative risk is not below 2.0.

This conclusion has intuitive appeal. However, it can be criticized as placing too large a burden on the plaintiff. After all, only the largest studies will have negligible 95% CI's. Thus, in order to meet her burden, plaintiff will have to produce a study which reaches a conclusion substantially over the target value, 2.0.⁽⁶⁾

Why Confidence Intervals Alone Don't Solve the Problem
The introduction of confidence intervals improves matters, but does not eradicate the problem. A significance level refers to the probability that the study would not have produced such results if the true effect were either zero or in the opposite direction of the alleged effect. Changing the confidence interval only changes the range outside which the true value would have to lie for the study to produce a false-positive a given fraction of the time. Even the most commonly reported confidence interval/statistical significance figures, such as election surveys, are widely misunderstood. The significance level is the probability of not obtaining a false positive if the true value is actually outside of the alleged range. It does not reflect the probability of the true value actually falling within that range. The two figures are indeed related, but making inferences about the latter (and judicially more useful) figure is dependent on the quality of the study.

Confidence intervals do help to some extent, however, when a legal standard is interested in more than the simple question of whether or not one condition has any effect on another. If the minimum increase in risk to result in legal liability is, for example, a doubling of risk then the confidence interval can be adjusted accordingly. In particular, the significance level would then measure the chance (one minus the chance, actually) of finding a doubling effect by chance alone if the true effect were less than a doubling, as opposed to less than zero.

Data Mining and Expert Shopping
Since one should obtain statistically significant results one time in twenty by chance alone, on extensively studied topics, it is not difficult to find the one study - or the one researcher - in twenty that supports one's side of the issue. Unfortunately, restricting consideration to those submissions which have appeared in peer reviewed publications does not alleviate the problem; in fact, some experts have claimed that systematic "publication bias" exists in favor of more surprising, and thus more interesting, results. At the 95% significance level, even perfectly well designed and well-intentioned research will produce a false-positive one time in twenty. (It should be noted that most studies report significance better than 5%, but this threshold is generally the minimum standard for scientific acceptance.)

To complicate matters even further, assessing validity of statistical studies does not end with the understanding of significance levels. Again, through random chance alone, an ill-conceived research design will appear to be statistically significant at any level some fraction of the time. Given the ability of modern computer programs to "data mine"--that is, to search data samples for anything remotely correlated with anything else--opportunities for invented claims of causation abound. Given a litigant's interest in producing support for his theory, some researcher and some data set can almost always come together--even in completely good faith--to produce the desired result. As such, research presented by one side will almost never be dispositive.

Since each side is likely to emphasize only those particular studies supporting its side of any contested issue, it is rarely immediately obvious which position receives the support of experts in the field. The mere counting of studies proving and disproving a theory, however, provides an unsatisfying solution to this problem, as there is often only a limited pool of data available for analysis. Further, with modern research techniques, it is all too easy for an "expert" to try to convince a court that her narrowly defined correlation is incontrovertible evidence of causation.

The Makings of a "Good" Research Design
For better or for worse, research cannot be precisely rated on the basis of some bright line, easily applied test. In evaluating evidence, courts have the opportunity to play the role of referees on the staffs of peer reviewed journals, considering whether the research presented passes methodological muster. Fundamentally, the most powerful quantitative research is supported by an a priori justification for the research design. After the fact, many different theories can be conjured up to explain the relationship between correlated phenomena, but the most persuasive have justifications independent of the particular data set at hand. As such, you need statistical proof and that proof must make sense. While the former frequently calls for expert assistance, the latter relies upon the unique discretion of the fact finder.

Admittedly, there comes a point at which the correlation between various phenomena cannot be ignored, even though the technical relationship between the two cannot be explained with particularity. For example, although the exact genetic process through which cigarette smoking causes cancer has been historically uncertain, most of the scientific community has been virtually certain of the link for many years. Once a theory has been articulated, the emergence of new data can add power to the hypothesis, as the chance of additional data having the same random characteristics of those in the previous studies become lower and lower as more and more samples are considered.

Yet even unlimited access to new data samples does not alleviate the concern over misleading research results. Even with stratospheric significance levels, a plethora of other errors, all outside the scope of this paper, can systematically bias the results of a study. Even after studying many different data sets over a period of time, ill-conceived studies can continue to produce erroneous results, as incomplete or mathematically naïve models may systematically assign "fault" for various phenomena to the wrong explanatory variables.

Given the number of statistical intricacies easily manipulated by careless or unscrupulous experts, the judicial process could benefit from the presence of statistical experts to screen research methodology, or at least a familiarity among judges with the characteristics common to "good" research. Statistics is a complex field in its own right, and a maze through which not all adjucators - much less juries - could possibly be expected to maneuver effectively on their own. Nonetheless, judges should at least be familiar with the research procedures that indicate better research, and recognize that high significance levels are not the most important element.

Endnotes

1. For a comprehensive discussion of common errors in research design, See Gary King et al., Designing Social Inquiry : Scientific Inference in Qualitative Research (1994).
2. See, e.g., D.H. Kaye, Apples and Oranges: Confidence Coefficients and the Burden of Persuasion, 73 Cornell L. Rev. 54 (1987).
3. Most statistics manuals contain formulas for this determination. For some easy-to-use recipes, see Thad R. Harshbarger, Introductory Statistics: A Decision Map (1971); or Marigold Linton & Philip S. Gallo, The Practical Statistician: Simplified Handbook of Statistics (1975).
4. Relative Risk is the incidence rate with Drug Z, divided by the incidence rate without Drug Z. The result is a multiplier value for the baseline risk. Thus, relative risk of 3.5 would indicate that 21 out of 1000 people get rashes with Drug Z.
5. Neil B. Cohen, Confidence in Probability: Burdens of Persuasion in a World of Imperfect Knowledge, 60 NYU L. Rev. 385, (1985).
6. For a discussion of this problem in the context of Daubert, see David M. Levy, Scientific Evidence after Daubert, 22 NO. 1 Litigation 48, 50-52 (1995).

Page Last Modified on April 25, 1999 by Dan Fridman - Copyright 1999
All materials are the property of the Berkman Center for Internet & Society and Harvard Law School.
Materials may be reproduced, distributed, or quoted as long as appropriate credit and citation is given.