When a Healthcare Fraud Model Sees a Green Apple

Why fraud models in healthcare must be validated as detection tools before outliers are treated as evidence of fraud.

Author

James G. Scott

The model starts as a screen and becomes an accusation

In healthcare fraud, payer-audit, and False Claims Act matters, agencies and payers often begin with a practical problem: they cannot review every claim, every provider, or every billing pattern by hand. Statistical models help narrow the field. They identify providers whose billing differs from a chosen benchmark and may therefore deserve a closer look. A provider may, for example, use a specific procedure or drug more often than the model would predict, and thus get flagged for closer review.

That kind of flag can be a sensible place to start. The danger comes when the flag is treated as though its meaning is already known. A provider who bills differently from a benchmark may be doing something improper, but the same pattern may also reflect case mix, specialty, referral patterns, payer rules, coding changes, or ordinary clinical judgment.

The expert’s task is to keep those possibilities separate. First comes a statistical question: has the model identified a billing pattern that is unusual relative to an appropriate benchmark? Only then does the legal and factual question arise: whether the unusual pattern reflects fraud, abuse, lack of medical necessity, or some other problem relevant to the claims in the case.

The warning has to be tested

For a statistical warning to carry legal weight, it must be validated in two different ways.

The first form of validation is statistical. Here we ask whether the model’s rarity claims are well calibrated. Suppose a model assigns each provider a score that is meant to describe how unusual the provider’s billing is relative to a benchmark. If the model says that a provider is in the most unusual one percent, that statement has a clear meaning: among providers whose billing is governed by the model’s assumptions, about one percent should look at least that unusual.

In statistics, that basic question about “truth in advertising” is called calibration: does the model’s language of rarity match the frequency with which those alarms actually occur in the relevant provider population? A well-calibrated model assigns labels that occur at about the rate the model says they should. A poorly calibrated model may call a provider a one-in-a-thousand outlier when, under a better benchmark or a more realistic account of billing variation, providers like that appear far more often. Counsel should always ask for affirmative evidence of good calibration; without it, there is no sound basis for crediting the model’s rarity claims.

The second form of validation is substantive. Here we ask a different question: once the model has identified a provider as statistically unusual, does that alarm reliably correspond to something real in the world? In a healthcare fraud matter, the relevant real-world condition may be fraud, abuse, false claims, lack of medical necessity, or some other problem tied to the legal claims in the case.

A model can pass the first test and still fail the second. Its rarity scores may be statistically well calibrated, while its alarms still do a poor job of identifying fraud. That is because rarity and misconduct are different properties. A provider can be genuinely rare under a billing model because it treats a different patient population, receives unusual referrals, follows different documentation workflows, or operates under different payer rules.

The apple machine problem

Imagine a machine designed to flag apples on a conveyor belt for inspection. The goal is simple: find the rotten apples.

One question is whether the machine is calibrated as an outlier detector. If it says that one apple in a thousand should trigger an alarm, then alarms should occur at about that rate whenever the machine is applied to apples drawn from the population it was built to assess. If alarms occur far more often, the machine’s rarity scale is wrong. It is overstating how unusual the apples are.

A second question is whether the alarms correspond to rot. Suppose the machine was trained mostly on red apples. Now a green apple comes down the conveyor belt. The machine flags it as unusual. That’s because it is unusual: it differs from the apples the machine was built to expect. But the green apple is not rotten. It is just a different kind of fresh apple, one unanticipated by the designers of the machine.

The same distinction applies to a healthcare fraud model. Statistical validation concerns the model’s measurement of rarity. Substantive validation concerns the connection between rarity and the legally meaningful condition: fraud, abuse, false claims, or lack of medical necessity. A model trained or calibrated on one provider population may correctly identify a provider as unusual while leaving the relevant legal question unanswered: is the apple rotten, or merely green?

How substantive validation should work

So what would it take to validate the model for this more serious use? What should counsel look for if the outlier flag is itself being offered as affirmative evidence of fraud, perhaps even the linchpin of the case?

The model needs a protocol that connects its flags to independently assessed outcomes. The model should be tested on providers or claims that were not used to build the model. Those cases should then be reviewed through a process capable of determining whether the flagged pattern reflects fraud, abuse, false claims, lack of medical necessity, or lawful variation.

Statisticians usually summarize this as a confusion matrix. The term sounds technical, but the idea is simple. The model makes one classification: flag or no flag. The independent review makes another classification: problem or no problem. Substantive validation compares those two classifications, tabulating performance in a simple matrix:

	Problem found on independent review	No problem found on independent review
Model flags provider	True positive	False positive
Model does not flag provider	False negative	True negative

The first row is the critical row when a model is used accusatorily. A flagged provider may be a true positive: the model identified a billing pattern that independent review confirms as problematic. But a flagged provider may also be a false positive: the model identified a statistically unusual pattern, while independent review showed lawful clinical or billing variation. This could happen for any number of reasons. A specialist may treat patients who are more complex than the model’s comparison group. A wound-care practice may use an expensive product more often because it receives patients after failed treatment elsewhere. A physician group may bill higher-level E/M codes because its patients are older and have multiple comorbidities. These are all ordinary sources of variation in healthcare delivery. A reliable fraud model must be tested against that variation before its warnings are treated as evidence of fraud.

Strong versus weak evidence

A simple pair of examples shows substantive validation matters. Imagine that an audit contractor validates a billing-outlier model on 100 providers whose records are later independently reviewed.

In the stronger validation example, most flagged providers turn out to have a confirmed problem:

	Problem found on independent review	No problem found on independent review
Model flags provider	18	2
Model does not flag provider	7	73

Here, 18 of the 20 flagged providers were confirmed on review. That does not prove that any particular flagged provider committed fraud, and counsel would still need to examine the review protocol, the sample, and the definition of “problem.” But the model’s flags have at least been shown to correspond, in this validation exercise, to the kind of issue the model is being used to detect.

Now compare a weaker validation example:

	Problem found on independent review	No problem found on independent review
Model flags provider	6	24
Model does not flag provider	4	66

This model may still be finding unusual billing. It may even identify some confirmed problems. But most of the flagged providers were false positives: the model raised an alarm, and independent review did not confirm fraud, abuse, false claims, or lack of medical necessity. That result would be a serious problem if the model’s flag were being used as evidence that a particular provider submitted false claims.

The contrast turns on one question: what happens when the outliers are checked against evidence independent of the model?

The litigation lesson

Whenever a statistical model is used to flag misconduct, counsel should ask whether the model has been validated for the use being made of it.

A model used to prioritize audits can be judged only as a screening tool. But a model offered as evidence of fraud, abuse, false claims, or lack of medical necessity must meet a higher standard. It must be shown to distinguish problematic providers from lawful providers who merely look unusual under the benchmark. Showing that requires both layers of validation. Is the rarity claim statistically calibrated? And among providers who trigger the alarm, how often does independent review confirm the substantive problem the model is said to detect?

Without those answers, the model may justify investigation. It does not yet provide reliable evidence of fraud.