
A team of researchers at MIT has highlighted a fundamental shortcoming in the way machine-learning models are evaluated and deployed. In a paper published in Nature Communications, the group argues that standard evaluation practices, which typically report a model’s average performance across a test dataset, can mask serious failures when models encounter new or different data, tells MIT News. They found that selecting the model with the highest average score during training does not guarantee good performance in practice; in fact, that “best” model may perform worse than alternatives on a large portion of new instances. In one set of experiments, models chosen for their high overall metric scores performed poorly on 6–75% of out-of-training data, even when trained on large datasets.
This discovery raises questions about the trustworthiness of average metrics such as mean accuracy, AUC, or error rate when models are deployed in real-world settings that differ from the training environment. Many machine-learning systems, from medical diagnosis to loan approval and predictive policing, rely on aggregate metrics that may obscure how poorly the model performs on specific subgroups or under uncommon conditions. When a model optimized for average performance fails in edge cases, the consequences can range from frustrated users to dangerous biases or unsafe decisions.
To address this, the MIT team suggests shifting focus from single summary scores toward richer, more granular performance measures that account for distributional changes and hidden correlations. These could include detailed performance breakdowns by data subgroup, uncertainty quantification, and evaluation protocols designed to mimic deployment conditions rather than static test sets.
The researchers stress that these shortcomings aren’t just academic. They affect how models behave when introduced to new populations or environments—a common situation in fields such as healthcare, finance, and autonomous systems. By moving beyond overly aggregated metrics and embracing evaluation strategies that reflect real-world complexity, developers can build more reliable and trustworthy AI systems.