The Synthetic Data Paradox
Why fabricated customer information may create more problems than it solves
Google Gemini’s training corpus contains approximately 30 trillion tokens—the equivalent of 300 million novels. This staggering figure represents both an achievement and a predicament for artificial intelligence companies. As Yann LeCun noted at a conference hosted by Nvidia, a software chip manufacturer, earlier this year, the industry has essentially “read the whole internet.” The response to this data drought has been synthetic data: algorithmically generated information that mimics real-world patterns without requiring actual human sources.
The appeal is obvious. Synthetic data sidesteps privacy regulations by avoiding personally identifiable information entirely. Organizations can take a small training set of real data, then expand it exponentially without navigating the legal and logistical complexities of collecting, storing, and protecting customer information. For companies with tightly defined business models and limited scenarios to explore, this represents a genuine efficiency gain. Speed to market accelerates; compliance headaches diminish.
Yet the solution contains a fundamental flaw that becomes apparent at scale. The problem organizations once faced was scarcity: how to acquire enough data to make meaningful decisions. Synthetic data inverts this challenge, creating what Ishmael Interactive CEO Ana Monroe calls an “embarrassment of riches.” The new bottleneck is sense-making. When an AI system can generate tens of thousands of fictional customer scenarios, who consumes them? The answer, typically, is another AI system—which must then be interpreted by humans attempting to extract actionable insights.
This recursive loop exposes deeper issues. Synthetic data does not eliminate organizational blind spots; it amplifies them. Like a television series that spawns increasingly derivative spin-offs, synthetic data perpetuates whatever assumptions and biases existed in the original training set. If a company has not identified why something is not working, fabricated data will not reveal it. Instead, the organization accumulates vast quantities of information that validate existing hunches and justify current approaches.
The individualization problem compounds these concerns. AI tools are fundamentally solitary instruments. An employee working with synthetic data operates alone at a desk, querying an opaque system that cannot explain its reasoning when processing millions of data points. This stands in direct opposition to decades of management orthodoxy emphasizing cross-functional collaboration and breaking down silos. Synthetic data does not facilitate organizational learning; it enables individual confirmation bias at industrial scale.
The economics become questionable at a certain point. Training large language models is extraordinarily expensive, and the marginal returns from expanding a corpus from 20 trillion to 30 trillion tokens are unclear—and likely small. If humans cannot consume the output volume and must employ additional AI systems to interpret results, when does hiring imaginative people who can articulate their reasoning become more cost-effective than continuously running computational models?
The answer suggests a balanced approach: deploy synthetic data where it genuinely accelerates hypothesis testing and scenario planning, but maintain investment in traditional human research practices. Organizations require both efficiency and insight. Synthetic data offers the former; conversations with actual customers provide the latter. As we conclude after our analysis on the CX Pod, sometimes the most effective and economical solution remains remarkably simple: go talk to somebody.
Ishmael Interactive subscribers get our business intelligence delivered every Tuesday, directly to their inboxes. Subscription gets you:
Early access to the weekly pod
Accompanying articles and team analysis providing context and commentary on the interview
Access to the entire Ishmael Interactive back catalogue of CX and business intelligence publications
Special access, offers, and discounts on Ishmael Interactive’s upcoming products and services.
Not quite ready to subscribe directly?
Keep up with us here on Substack, where we publish the CX Pod weekly!








