Home 9 AI 9 Synthetic Data in AI: What Works and What Doesn’t

Synthetic Data in AI: What Works and What Doesn’t

by Ruchika Saini, AI | Sep 4, 2025

MIT’s Kalyan Veeramachaneni on the gains, pitfalls, and safeguards you need.

Synthetic data isn’t made up randomly; it’s algorithmically generated to mirror real-world data’s statistical traits without containing real information. In 2024, an estimated 60% of data used in AI was synthetic, and that share is expected to grow, says MIT News.

Its strengths are clear. Because synthetic data contains no real identifiers, it avoids privacy issues that come with sensitive datasets. Building AI models becomes faster, cheaper, and safer, especially in testing environments where real data is locked behind firewalls, such as customer transactions. Generative models also enable custom data creation, say, simulating Ohio customers who made a purchase in March, while supporting stress testing at scale: you can generate a billion transactions to test system performance. And when real data is scarce, for rare events like fraud, synthetic data helps fill gaps, boosting model accuracy or serving as stand-ins when collecting real data isn’t viable.

That said, synthetic data comes with caveats. Trust hinges on task-specific evaluation. You must test whether synthetic data yields models that still make valid, real-world predictions. Bias is a risk too; any bias in the original data carries over unless you deliberately correct for it during generation. To address these risks, Veeramachaneni’s team developed a Synthetic Data Metrics Library: a suite of tools for measuring statistical quality, privacy, and model-level efficacy, ensuring checks and balances at every step.

Synthetic data lowers cost, boosts privacy, and scales testing. But only when paired with rigorous evaluation, bias handling, and task-specific validation does it deliver on that promise without undermining performance.