Feeding the Data Deity

Feeding the Data Deity

Photo by Sonika Agarwal on Unsplash


Recent conversations with colleagues brought me, again, to the topic of synthetic data. Their arguments were familiar and, in many cases, reasonable: lack of representative data, privacy constraints, rare events, safety-critical edge cases. Synthetic data promises relief.

Yet the more I listened, the more uneasy I became. Not because synthetic data is inherently misguided, but because of what its growing prominence reveals about a deeper belief in contemporary AI research. A belief so entrenched that it is rarely stated explicitly:

That AI systems cannot be developed without large quantities of data, and progress requires ever-growing amounts of it.

Once this belief is accepted, synthetic data becomes almost inevitable. If reality does not provide enough data, we simply manufacture more. But is this really the case? 

In this blog, I go back in history. Not because “it was all better before”, but to understand how we arrived here and, more importantly, what alternatives we have moving forward.

Before data ruled the world

This belief is historically contingent. There was a time, often forgotten in today’s understanding, when AI systems were built from scratch. We constructed models by hand, based on explicit representations of knowledge: rules, symbols, constraints, ontologies, logics. These approaches were clumsy, brittle, narrow in scope, and painfully difficult to scale. They were prone to errors and required deep domain expertise.

But they also had virtues that are now, once again, in demand: transparency, inspectability, explicit assumptions, and explanations.

This is not an argument for returning to “good old-fashioned AI”. It is a reminder that data-driven learning was not always the default, and that its dominance is the result of a technological shift, not a fundamental law.

From building models to extracting them

When data and computation made large-scale learning feasible, the field pivoted. We stopped building models and started deriving them. Patterns discovered in data became the model itself.This shift was extraordinarily successful. The current AI boom is inseparable from it.

But it also subtly reframed how we think about AI development. Data became the primary raw material; models became products of data pipelines. We no longer ask what structure or assumptions a model should embody, but whether we have enough data to let the structure emerge.

In both paradigms, however, one thing remains constant: what ultimately matters is the resulting model. The model is what gets deployed, evaluated, trusted, or blamed.

Which brings me back to the synthetic data.

The synthetic data paradox

To generate synthetic data, we need a model of the phenomenon we care about. That model, explicit or implicit, captures assumptions about distributions, correlations, constraints, and causal structure. We then use the synthetic data produced by that model as input to a learning algorithm, in order to train another model: the one we intend to deploy.

This raises a simple but uncomfortable question:

If we already have a model capable of generating the relevant data, why is that model not already sufficient?

Of course, there are answers. The generative model may be incomplete, approximate, or used only to augment sparse real-world data. The learning algorithm may impose inductive biases we value. The separation of models may offer practical convenience.

But the paradox remains: synthetic data introduces an extra layer of indirection whose necessity is rarely examined. We create a model to generate data so that we can train a model. Because our methodology insists that models must come from data.

Data as offering

At this point, synthetic data starts to look less like a technical solution and more like a ritual.

The dominance of data-driven thinking has produced an ecosystem where:

  • progress is measured in dataset size,
  • credibility is linked to scale,
  • and methodological legitimacy flows from the presence of “enough data”.

When reality fails to provide the required offerings, we manufacture them. The pipelines continue to run. The deity is fed.

This is not a rejection of data. Data matters. Synthetic data has valid uses, particularly for testing, robustness, privacy preservation, and controlled experimentation.

What deserves scrutiny is the all-consuming idea that data is always the core epistemic ingredient. That more data is always the answer, even when we already possess structured understanding.

Anticipating the obvious objections

But hybrid approaches exist.” Yes, and they are among the most promising directions in the field. Yet even hybrids often treat structured knowledge as a way to improve data efficiency, rather than as a first-class modelling choice. The hierarchy remains intact.

But synthetic data improves coverage.” Coverage of what, exactly? If the synthetic data generator encodes the same blind spots, assumptions, or simplifications as the target model, coverage may be illusory. More samples do not necessarily mean more understanding.

But learning from data avoids human bias.” Only if we forget that all data is constructed: the result of human decisions about what is observed, when and where it is recorded, how often it is collected, how it is cleaned, maintained, and ultimately used. Bias does not disappear when it is delegated to data; it merely becomes harder to see.

Feet of clay

The deity of data-driven AI has been extraordinarily generous. It has delivered capabilities that were once unimaginable. But its authority rests on a belief that has hardened into dogma.

Synthetic data exposes the cracks.

It reminds us that models do not magically emerge from data; they are always shaped by assumptions, structure, and intent. Sometimes we already know more than our rituals allow us to admit.

Perhaps it is time to look again at what we are really building, and to notice that the deity we serve may, after all, have feet of clay. 


Comments