Multiple Testing in Litigation: The More Places You Look, the More You Find

How multiple testing and after-the-fact searches can turn ordinary variation into apparently significant litigation evidence.
Author

James G. Scott

Eight heads in a row

A simple example about coin flipping shows why. Suppose I walk into my introductory data science class and ask the student in the center of the front row to flip a coin eight times.

She gets heads on the first flip. Then heads again. Then a third time. By the time she has flipped eight heads in a row, everyone in the room is paying attention. That outcome has probability

[ ()^8 = , ]

or about 0.4 percent. If the question was specified in advance—this student, eight flips all coming up heads—the result is genuinely unusual.

But now suppose it’s my large introductory data science course of 700 students, and I ask everyone in the class to do the same thing. Each student flips a coin eight times. I then walk around the room looking for someone who got eight heads in a row.

That changes the meaning of the result. For any one student, eight heads is still unlikely. But across 700 students, the chance that at least one student gets eight heads is

[ 1 - ()^{700}, ]

which is about 94 percent. In fact, the expected number of students who get eight heads is (700/256), or about 2.7 students.

So if I find a student who flipped eight heads, have I found something remarkable? Not really. I have found just what I should have expected to find after looking in 700 places. The result looks dramatic only if the search is hidden. Once the search is visible, the surprise disappears.

The expert’s table has a history

The coin example tells us what questions we should be asking about the MorningMax 9000.

By the time the expert report is filed, the litigation has a clean statistical object. A table shows that MorningMax 9000 toaster ovens, sold as a private label through one retailer and assembled during one six-month period, had an unusually high rate of complaints using words like “smoke” and “burning.” The expert reports that the result is statistically significant. The table is easy to quote and gives the case a clear center of gravity.

But there are two paths for a table like that can enter a case, and the difference between them is everything.

In the first path, the legal theory came first. The case had focused on the MorningMax 9000, that retailer, and that production window from the very beginning. The claim was that Dingbat changed something about the heating element for that private-label run, and smoke-and-burning complaints were the predicted consequence. The expert then tested the specific comparison that the theory called for.

This is the single-comparison path. It is like choosing the student in the front row before the coin is flipped. If we see eight heads in a row, the evidence is likely convincing that something fishy is going on.

But in the second path, the search came first, prior to the legal theory acquiring its specific form. The expert began with the broader universe of Dingbat records and examined models, retailers, factories, suppliers, warranty codes, complaint terms, service diagnoses, and production windows. Maybe at first, the overall failure rate for Dingbat’s toaster ovens did not look especially high. But the expert kept looking, slicing the data into ever more granular subsets. The MorningMax 9000 table became important because it was the one table, among many that were checked, that looked most alarming after the search.

This is the multiple-comparison path. It is like walking around the room after 700 students have flipped coins and finding the one who got eight heads.

Multiple comparisons void the warrant

A statistical test has a kind of warranty. Nobody calls it that, but that is what it is: if the question was specified in advance, and the assumptions are correct, the p-value has its advertised meaning. Once the question is selected after a search through the data, the warranty is voided. The reported p-value is just plain wrong.

That is generically referred to as the multiple-testing problem in statistics. The court sees one table, one p-value, and one expert conclusion, rendered as though this particular set of toaster ovens were the only ones examined. But that p-value belongs to an entirely different parallel universe: one where no slicing, dicing, and searching took place, and where the expert decided to ask about the MorningMax 9000 before seeing what the MorningMax 9000 would show. It does not belong to the universe that actually produced the evidence.

A useful cross-examination would therefore begin with the path, not the results themselves:

  • Before you looked at any of the complaint data, had you identified the MorningMax 9000 as the product of concern? (If no, follow up with: So you looked at many other products besides the MorningMax 9000 to see if they had issues?)
  • Before you ran the analysis, had you selected this six-month window as the relevant subset of the data? (If no: So you looked at many other time periods to see if those revealed an issue, too?).
  • Before you examined the complaints, had you decided that ‘smoke’ and ‘burning’ were the only relevant search terms? (If no: so how many search terms did you examine?)
  • So how many total combinatinos of other toaster models, retailers, production windows, and complaint phrases did you examine before you calculatde the figures in this table?

The answers might show a long, meandering statistical walk through Dingbat’s records, with the final table chosen because it was the only one, of many possibilities, worth putting in the report.

The litigation lesson

The same problem can arise in any case, from on employment to securities to healthcare fraud, where an expert searches across many outcomes, subgroups, definitions, time periods, or models. The details change, but the question remains the same: how much searching preceded the reported result?

Statistical significance always depends on the question that was asked. When the question emerges from the search itself, the reported p-value no longer carries its usual meaning. A statistical expert can account for the search by applying a multiple-testing correction, simulating the search process, or asking how often a search of comparable size would produce a result that impressive by chance. Once the search is accounted for, evidence that looked dramatic in the expert report may become much less impressive.