When Predictive AI Fails in the Wild

How lawyers should think about model validation, distribution shift, and causation when a deployed AI system produces disappointing business results.

Author

James G. Scott

Published

April 13, 2025

The business result is only the beginning

Predictive AI systems are increasingly sold as business tools. They are used to recommend products, flag risky transactions, screen applicants, forecast demand, personalize prices, identify fraud, and automate decisions that used to depend on human judgment. When those systems disappoint their users, as many inevitably do, the legal dispute often begins with a simple business fact. The fraud tool missed fraud. The recommendation engine failed to increase sales. The hiring model produced bad candidates. The pricing model did not improve margin.

Those facts matter; indeed, they are usually the very reason a lawsuit is filed. But they rarely answer the important technical and legal questions by themselves.

Consider a clothing retailer that hires a software vendor to deploy an online size-prediction tool. The vendor builds or supplies a machine-learning model that recommends sizes to customers based on information such as height, weight, body measurements, prior purchases, fit preference, and so on. The commercial pitch is easy to understand. If customers choose the right size, they will be less likely to return what they buy. For apparel retailers, that is a serious concern: online returns are often frequent enough to affect inventory planning and crush their sales margins.

Now imagine that, after the tool is launched, the retailer sees a high return rate among customers who used it. The reason this can become more than ordinary commercial disappointment is that predictive systems are often sold with engineering-like performance assurances: the model will achieve a stated level of accuracy, reliability, or business performance for a specified use case under specified conditions. The retailer concludes that the vendor delivered a defective system. The vendor responds that the model performed as trained, that returns are affected by many things besides size, and that the retailer’s live website traffic differed from the data used to develop the model.

This is a good example because, while the surface issue is easy enough to understand, the underlying statistical issues are subtle. What’s more, the same structure appears in many commercial disputes involving AI or machine learning systems. A model is purchased to improve an outcome, with disappointing results. One side treats the disappointment as proof that the model failed. The other side treats the complexity of the business environment as a reason the model cannot fairly be blamed.

A serious analysis has to do better than either instinct. It has to ask what the model was supposed to predict, how performance was specified, how the model was validated, whether the deployment environment matched the development environment, and whether the observed business outcome is a reliable measure of the alleged technical failure.

What was the model supposed to do?

The first question in a predictive-model dispute is often the simplest to state and the easiest to overlook. What was the model actually supposed to predict?

In the sizing example, the answer may seem obvious. The model was supposed to recommend the correct size. But “correct size” is not a self-defining statistical target.

A model might be trained to predict the size a customer is most likely to buy. It might be trained to predict the size a customer is most likely to keep. It might be trained to predict the probability of return for each possible size. It might be trained to predict whether a customer will exchange an item for a different size. It might be trained to infer a latent concept such as “best fit” from noisy behavioral data.

Those targets are related, but they are not the same. A customer’s purchased size may reflect habit, uncertainty, brand confusion, or a desire to try multiple sizes at home. A customer’s kept size may be a better outcome, but it still may not reveal the customer’s ideal size. A customer may keep a slightly imperfect item because the price was low or the return process was inconvenient. Another customer may return a well-fitting item because the fabric felt different in person or because the color was not flattering.

This distinction generalizes. In many AI disputes, the model’s statistical target only approximates the business objective. A fraud model may predict chargebacks, not fraud itself. A hiring model may predict employee retention, not job performance. A medical-risk model may predict future cost, not medical need. A recommendation engine may predict clicks, not customer satisfaction. The gap between the proxy and the underlying objective can become central when the system is challenged.

For lawyers, this is often the first place where a technical expert can add value. The expert can translate the language of the contract, proposal, sales materials, and model documentation into a precise statistical question. A phrase like “accurate recommendations” may sound clear in ordinary business language, but it may not specify a target variable, a performance metric, a relevant population, or an acceptable error rate.

The business outcome is evidence, not the whole answer

The retailer’s return rate is obviously relevant. If the sizing tool was sold to reduce returns, and returns remain high, that fact deserves attention. But a return is not the same thing as a wrong size recommendation.

Customers return clothing for many reasons. Fit is one reason. So are color, fabric, styling, delivery delay, price reconsideration, duplicate ordering, product quality, and a habit known in retail as bracketing, where customers order several sizes intending to keep one and return the rest. A customer may use the sizing tool, receive a reasonable recommendation, order the recommended size, and return the item for reasons unrelated to size.

The comparison group also matters. Customers who use a sizing tool may not be typical customers. They may use the tool because they are uncertain about their size, unfamiliar with the brand, buying a difficult category, purchasing a gift, or choosing between two adjacent sizes. These customers may have a higher baseline return risk before the tool ever intervenes. A high return rate among tool users may therefore reflect customer selection rather than model failure.

This is a common problem in disputes over predictive systems. The observed business outcome is often downstream of the model, but it is also downstream of many other forces. If a fraud-detection tool is used on the riskiest transactions, its alerts may be associated with high fraud rates even if it is useful. If a loan-screening tool is deployed during a deteriorating credit environment, default rates may rise even if the model ranks applicants well. If a demand-forecasting model is used during a supply shock, inventory outcomes may disappoint even if the forecast was reasonable given the information available at the time.

The point is not that business outcomes are irrelevant. The point is that they need to be interpreted through a statistical design. A before-and-after comparison may be suggestive. A comparison between users and non-users may be suggestive. But neither automatically isolates the effect of the model.

In the sizing-tool case, a cleaner design might involve an A/B test. Customers could be randomly assigned to receive the size recommendation or not receive it, and outcomes could be compared across the randomized groups. Even then, the analysis should specify which outcomes matter: return rate, exchange rate, conversion rate, average order value, retained revenue, customer satisfaction, or some combination. The tool might increase conversion by giving uncertain customers confidence to buy, while also increasing the number of returns because more uncertain customers entered the purchasing process. Whether that is a failure depends on the promised performance and the proper measure of harm.

Validation must match the claim

Machine-learning models are usually evaluated before they are deployed. But “validated” can mean different things, and not every validation exercise answers the question that later matters in litigation.

At a basic level, validation means testing the model on data that were not used to train it. If a model only performs well on the data it learned from, that may show memorization rather than useful prediction. A validation set gives some evidence about whether the model generalizes.

The details of the validation set are crucial. In the sizing example, the vendor might have trained the model on historical purchases and returns, then held back a random subset of those historical records to evaluate accuracy. That is a common procedure, and it may be appropriate for some purposes. But it may produce an overly optimistic picture if the held-out records closely resemble the training records. The same brands, product lines, seasons, website design, customer base, and return policy may appear in both.

For a retailer, the harder question is whether the model will work on future customers buying future products under live commercial conditions. A time-based validation set, which tests the model on later transactions than those used for training, may be more informative than a random split. A product-category holdout may reveal whether the model works on new categories. A brand holdout may test whether the model generalizes across labels. A new-season holdout may be essential if sizing patterns change with new fabrics, suppliers, cuts, or fashion trends.

The appropriate validation design depends on the claim being made. If the vendor represented only that the model achieved a certain accuracy on a specified benchmark dataset, then the benchmark may matter most. If the vendor represented that the tool would perform well on the retailer’s live site, then live or near-live validation may be more relevant. If the model was sold for use across the retailer’s full assortment, subgroup performance may matter. A model that performs adequately on basic T-shirts but poorly on denim, swimwear, outerwear, or formalwear may not satisfy the practical purpose for which it was purchased.

This pattern extends beyond retail. A model validated on last year’s ordinary transactions may not work on this year’s unusual transactions. A model validated on one hospital system may not work at another. A model validated on one jurisdiction’s cases may not work in another jurisdiction. A model validated on one applicant pool may not work after a change in recruiting strategy. The legal significance of validation depends on whether the validation exercise corresponds to the use case at issue.

Accuracy is not always the right metric

In ordinary speech, people often ask whether a model is accurate. That is a reasonable starting point, but it can obscure important choices.

Suppose the sizing model recommends a medium, and the customer keeps a large. Is that simply wrong? Is it less wrong than recommending an extra-small? Does the model get any credit if medium and large were both plausible? What if the model’s top recommendation was medium, but it also warned that large was a close alternative? What if the customer preferred a relaxed fit but entered inconsistent information?

A model’s performance can be measured in many ways. Exact-size accuracy asks whether the top recommendation matched the observed kept size. Top-two accuracy asks whether the kept size appeared among the two most likely recommendations. A return-rate metric asks whether recommended purchases were returned less often. A calibration metric asks whether the model’s stated probabilities corresponded to observed frequencies. A business metric might focus on retained revenue, margin after return costs, or customer lifetime value.

Calibration is especially important when a model communicates confidence. A calibrated model is one whose probabilities mean what they purport to mean. If the tool says there is an 80 percent chance that medium is the best size, then among similar cases where it assigns an 80 percent probability to medium, medium should be right about 80 percent of the time. A model can be useful at ranking options while still being poorly calibrated. That distinction may matter if the website presents a recommendation with unwarranted certainty.

Subgroup performance also matters. Overall accuracy can hide serious weaknesses. A sizing tool may perform well on average but poorly for petite customers, plus-size customers, customers outside the United States, men’s tailoring, women’s denim, stretch fabrics, or brands with inconsistent size charts. If the vendor claimed broad applicability, average performance may not be enough. If the alleged damages are concentrated in particular categories, aggregate statistics may be misleading.

This is a general lesson for AI litigation. The metric must fit the legal and business question. A single headline accuracy number rarely tells the whole story.

The world may have changed

One of the most important technical concepts in disputes over deployed models is distribution shift. The term refers to a change between the data environment in which a model was developed and the environment in which it is later used. A related term, covariate shift, refers more specifically to changes in the distribution of input variables. In plainer language, the model learned from one world and was deployed in another.

In the sizing-tool example, the possible sources of shift are easy to imagine. The retailer launches a new season. The product mix changes. A new supplier cuts garments differently. A promotion attracts new customers. The site expands internationally. The return policy becomes more generous. Customers increasingly order multiple sizes at once. Product photography changes. Size charts are revised. A new category, such as swimwear or denim, becomes more prominent. The recommendation is moved to a different part of the website. The model is integrated into a mobile interface where customers provide less information.

Any of these changes can affect model performance. A model trained on one mix of customers, products, and behaviors may perform worse when that mix changes. The model may still be doing what it was trained to do, but the learned relationships may no longer apply with the same force.

Distribution shift can be a defense, but it can also support a claim. The legal significance depends on the parties’ obligations. If the vendor promised a static model trained on historical data and validated under specified conditions, then a major change in the retailer’s business environment may complicate an allegation of breach. If the vendor promised an operational AI system suitable for live deployment, then monitoring for shift may have been part of the expected service. If the vendor knew the retailer’s assortment changed seasonally, then seasonal shift may have been foreseeable rather than exceptional.

This is where contract language, technical documentation, and statistical analysis intersect. A model can fail because it was badly built. It can also fail because it was deployed outside the conditions for which it was validated. It can fail because no one monitored whether those conditions continued to hold. In litigation, those are different theories.

Sorting the possible failures

A useful expert analysis often separates several kinds of failure that may otherwise be blurred together.

One possibility is a model-development failure. The training data may have been incomplete, biased, mislabeled, stale, or inappropriate. The model may have overfit historical quirks. Important predictors may have been omitted. The target variable may have been a poor proxy for the business objective.

A second possibility is a validation failure. The model may have looked good because the test was too easy. A random historical validation set may not have tested future seasons, new products, important subgroups, or realistic deployment conditions.

A third possibility is a specification failure. The vendor may have optimized one target while the retailer expected another. The model may have predicted purchase size when the business needed kept size. It may have maximized exact-size accuracy when the retailer cared more about reducing costly returns. It may have treated all errors as equal when some errors had much larger financial consequences.

A fourth possibility is a deployment failure. The model may have been reasonable, but the tool may have been integrated poorly. Recommendations may have been displayed unclearly, mapped to the wrong product sizes, cached incorrectly, shown too late in the customer journey, or overridden by inconsistent size charts. Logs may not accurately record what the customer saw.

A fifth possibility is a monitoring failure. A model that works at launch may degrade as products, customers, and behavior change. Many predictive systems require ongoing monitoring of drift, calibration, subgroup performance, and business outcomes.

A final possibility is that the disappointing result arose from business conditions outside the model. A new promotion, a change in return policy, a product-quality issue, or a shift in customer acquisition can change return behavior without implying that the model was defective.

These categories are not mutually exclusive. Several may be present at once. But distinguishing them helps counsel formulate claims, defenses, discovery requests, and expert questions.

What counsel should want to see

In a dispute like this, the most important evidence will usually include more than the model’s outputs. Counsel should want the contract, statements of work, specifications, service-level commitments, sales materials, and any documents describing what the tool was supposed to accomplish.

The technical record matters as well. That includes model documentation, training-data descriptions, feature definitions, label definitions, preprocessing code, validation reports, model-version history, performance dashboards, and any internal communications about known limitations. It also includes deployment evidence: website logs, recommendation logs, product catalogs, size charts, SKU mappings, user-interface changes, return reasons, exchange records, and the timing of promotions or return-policy changes.

The reason is straightforward. A predictive model is part of a data pipeline and a business process. The data collected from the customer must be transformed into model inputs. The model output must be translated into a recommendation. That recommendation must be mapped to actual inventory. The customer must see and understand it. The resulting purchase, return, or exchange must be logged accurately. A failure anywhere in this chain can look like a model failure from a distance.

This is another general point. In disputes over AI systems, the relevant unit of analysis is often the deployed system, not merely the trained model.

Damages require a counterfactual

Even if the sizing tool performed poorly, damages require a comparison to what would have happened otherwise. That counterfactual can be difficult.

The retailer may claim that the tool caused excess returns. But compared to what? Returns before launch? Returns among customers who did not use the tool? Returns projected under the vendor’s promised performance? Returns that would have occurred under a reasonable alternative tool?

Each comparison has limitations. Before-and-after comparisons can be affected by seasonality, promotions, product mix, and customer acquisition. User-versus-non-user comparisons can be affected by selection, since tool users may be harder to fit in the first place. Comparisons to promised performance require careful interpretation of what was promised and under what conditions.

The tool may also have mixed effects. It might increase returns while also increasing sales. It might reduce size-related returns while increasing purchases by uncertain customers. It might perform well in some categories and poorly in others. A damages analysis that treats every return by a tool user as caused by the model would likely be too crude.

This is why the statistical question and the legal question have to be aligned. The legal theory may concern breach, reliance, causation, damages, or some combination. Each requires a different comparison.

The larger lesson

The sizing-tool example is specific, but the pattern is broad. Predictive systems are often evaluated after deployment through business outcomes that are influenced by many forces besides the model. When a dispute arises, the central question is rarely whether the model was “good” in the abstract. The question is whether the system delivered what it was supposed to deliver, for the population and use case at issue, under the conditions the parties reasonably contemplated, measured against an appropriate standard.

That inquiry requires statistical care. It requires understanding the target variable, the validation design, the performance metric, the deployment environment, and the possibility of distribution shift. It also requires a practical sense of how business data are created, recorded, and transformed before they become evidence.

A high return rate may be the beginning of the story. It is not the end of the analysis.