Loading blog post, please wait Loading blog post...

Listen to the AI-generated audio version of this piece.

00:00
00:00

Governments and companies are beginning to use AI models to simulate engagement with real communities, forecasting how “rural voters” might react to climate policy or how neighborhoods might respond to zoning changes. But this raises a foundational question:

How do we know when a model is genuinely representing a population, rather than producing a fluent stereotype?

The Collective Intelligence Project’s Evan Hadfield is seeking to answer that question through a project they call the Digital Twin Evaluation Framework (DTEF). Although “digital twin” is most often used to describe virtual replicas of physical systems, here it refers to virtual stand-ins for public attitudes, what some researchers call silicon samples.

The authors have not yet published the whole framework and methods. Instead, they are sharing the project through a series of posts that outline their proposed approach for evaluating how effectively AI models simulate public opinion.

At its core, the DTEF asks whether AI models can reflect the distribution of opinions within a demographic group, rather than merely the median view or a learned pattern.

What the Framework Will Test

DTEF draws on CIP’s Global Dialogues, a dataset of deliberations and surveys about people’s attitudes toward AI.

In the evaluation, a group of people responds to a hypothetical question. The AI model is given the group’s demographic profile and examples of their responses to previous questions, and then asked to predict the group’s opinion distribution on the same question. The model’s prediction is compared with the real distribution of human responses to assess how well it represents the group.

The DTEF tests whether a model can mirror real opinion patterns rather than rely on learned assumptions.

The framework aims to produce performance scores showing where a model succeeds, where it fails, and which populations it struggles to represent, a stress test for representational reliability. CIP suggests that policymakers, developers, and civic organizations could eventually use these results to judge when synthetic data might be trustworthy and when a model is likely to produce overconfident or inaccurate simulations of human attitudes.

Importantly, DTEF measures how well a model represents a group, but it does not address when, whether, or under what conditions such synthetic data should be used in policymaking.

Reflections on the Upcoming Framework

These reflections are not the authors' conclusions, but rather governance questions the framework raises for policymakers.

1. What Counts as “Representative Enough”?

Institutions lack a shared benchmark for what makes a model “representative enough” to inform real decisions. Even when models are tested, reviews often measure technical performance rather than whether predictions align with actual human patterns. DTEF highlights this gap but cannot, on its own, resolve the broader question of when a synthetic public should be trusted at all.

2. When Is Synthetic Input Appropriate?

AI-generated public opinion is often promoted as a way to reduce the cost of engagement. But without clear limits, synthetic inputs risk substituting for genuine participation. Their appeal is obvious: fast, inexpensive, and repeatable.

If DTEF reveals uneven performance across groups, it underscores the paucity of guidance for policymakers. When should a model supplement engagement? What verification should precede its use? And where should synthetic publics never replace real people?

 Synthetic publics will not fail loudly. They will fail confidently and persuasively.

3. What Makes Synthetic Representation Legitimate?

Silicon samples are optimized for prediction, not democratic fairness. Even a highly representative model may not be the right tool. Many decisions require input from the people most affected, rather than from a statistically representative population.

Representativeness is only one dimension of legitimacy and is often not the most important.

Why These Questions Matter

Digital twins represent the early architecture of a potential alternative representational infrastructure. Once agencies begin using them to pressure-test rules or allocate resources, the model becomes a proxy for the public, shaping decisions quietly, before anyone notices the shift.

The risk is drift: AI systems becoming default decision-makers because they are convenient, not because they are legitimate.

Synthetic consultation can edge out slower, contested forms of genuine engagement, especially when budgets are tight and political timelines are short. And because "digital twins" inherit the worldview of their training data, they may amplify the perspectives of the digitally visible while excluding the communities most affected by policy.

To safeguard legitimacy, policymakers will need to ensure that communities can see, challenge, and correct how they are represented; that models are validated against real population data rather than internal benchmarks; that synthetic publics do not replace statutory public input; and that agencies treat model outputs as signals rather than substitutes for democratic voice.

As digital twins of human behavior and opinion move from experiments to infrastructure, policymakers should be asking:

  • Who controls synthetic publics?

  • Who benefits from their use?

  • What kind of democratic future is being built in our name?

If we don’t set the terms under which AI represents us now, we risk inheriting a future in which institutions answer to synthetic publics rather than real ones.

Tags