Research Radar: StatGPT and the Fourth Wave of Open Data

Despite decades of investment in statistical systems and open data initiatives, official data remains difficult to discover, interpret, and apply in practice. The challenge is no longer one of availability, but of (re)usability. This persistent gap underscores a broader paradox at the heart of contemporary data governance: data may be open, yet it remains functionally inaccessible for many intended users.

In this context, the International Monetary Fund has been a pioneer in exploring how artificial intelligence and open data can intersect to address this usability challenge. Its StatGPT: AI for Official Statistics report, by James Tebrake, Bachir Boukherouaa, Jeff Danforth, and Niva Harikrishnan, offers a timely and important contribution to this evolving conversation - pointing toward a future where AI can make official data more navigable, interpretable, and actionable.

The data challenge is no longer just about availability, but about (re)usability.

The report provides a detailed account of the friction users face across the data lifecycle. Even highly motivated users must navigate fragmented portals, inconsistent terminology, and siloed datasets, often spending significant time assembling information that should be readily accessible.

The result is a fragmented ecosystem in which metadata and data are distributed across institutions and platforms, forcing users to navigate multiple systems and standards—and to reconstruct context—before they can assess whether the data is re-usable.

This resonates strongly with broader observations across the open data ecosystem: access alone does not guarantee impact. Without the ability to meaningfully engage with data, openness risks becoming performative rather than transformative.

The Promise and Limits of AI-Mediated Data Access

In this context, the report turns to generative AI. At first glance, large language models appear to offer a breakthrough. They can enable conversational access to complex datasets, lowering the barriers to entry. Tasks that previously took hours—searching, filtering, downloading, and combining datasets—can now be completed in minutes.

This aligns closely with what we have described in our work as the emerging “Fourth Wave of Open Data”: a shift from static access to dynamic, interaction-based engagement with data. This builds on earlier phases of open data, from transparency to open-by-default publication to purpose-driven reuse, toward a model in which data is accessed and reused through AI-mediated interaction.

However, the report is equally clear about the limitations of current AI approaches.

When asked to retrieve numerical data, general-purpose models frequently produce results that are not just incorrect, but ‘reasonably incorrect’ and plausible enough to pass casual scrutiny.

When asked to retrieve numerical data, general-purpose models frequently produce results that are not just incorrect, but ‘reasonably incorrect’ - values that appear plausible enough to pass casual scrutiny. This outcome is not surprising, since these systems are fundamentally statistical in nature, meaning they are designed not to verify whether a number is authoritative or exact, but to predict plausible outputs from patterns in data.

This is a critical insight. In domains such as official statistics, where precision is non-negotiable, even small deviations can undermine trust and lead to flawed decisions. The report’s empirical testing—showing relatively low accuracy rates across repeated queries—underscores the risks of relying on generative AI as a source of truth.

StatGPT and the Shift to AI as Interface

These limitations point to a deeper question about responsibility. If AI systems are not designed to distinguish between plausible and authoritative answers, then who is responsible for ensuring that users receive correct information? Should the companies behind these systems be expected to return authoritative data, or should the onus fall on data producers to ensure their data is structured in ways that AI systems can reliably access and interpret?

In practice, the report suggests that while both sides play a role, the burden increasingly falls on data producers to ensure that official statistics can be accessed and understood correctly in an AI-mediated environment.

This reflects a broader shift in the role of national statistical offices, from passive data producers to active stewards of trusted data, which requires them to design outputs not only for human users but also for machine consumption, strengthen attribution, and engage more actively with technology platforms and AI developers.

It is this reframing away from expecting AI to produce authoritative answers and toward ensuring that it can reliably retrieve them that helps explain the logic behind the proposed solution, StatGPT.

AI becomes an interface layer, not a knowledge producer.

Rather than using AI to generate answers, the system uses AI to interpret user intent and translate it into structured queries against authoritative data sources. In other words, AI becomes an interface layer, not a knowledge producer. This distinction is subtle but profound. It preserves the usability benefits of natural language interaction while anchoring outputs in verified, source-of-truth data.

This approach strongly echoes the direction outlined in our own work on the Fourth Wave of Open Data. In this new phase, data is no longer something users download and analyze separately; it becomes something they interact with through intelligent systems. But for this interaction to be meaningful, data must be “AI-ready”—structured, interoperable, richly annotated, and accessible through machine-readable interfaces.

The IMF report reinforces this point by emphasizing the need for robust APIs, improved metadata standards, and harmonized data models.

This is not simply a matter of formatting, but of ensuring that AI systems can reliably connect user intent to the correct data. For AI systems to function effectively, they must be able to map natural language queries to precise statistical concepts, something that depends on rich, consistent metadata and clearly defined data structures. Metadata plays a central role not only in locating data, but in enabling systems to assess its relevance, reliability, and comparability across sources.

Without this, even well-designed systems can return the wrong series, units, or time frames. In practice, gaps in metadata can lead systems to make implicit assumptions—for example, equating a general concept like “inflation” with whichever indicator is most readily available—introducing subtle but significant errors. Formats designed primarily for human navigation, such as spreadsheets or static tables, further limit AI systems' ability to retrieve authoritative values, whereas APIs enable precise, structured queries that return the exact published figure.

These technical limitations, in turn, have important governance implications.

Toward Trustworthy and Governable Data Ecosystems

As data is increasingly accessed through intermediated AI systems, questions of ownership, attribution, and accountability become more pressing. The current ecosystem already suffers from issues such as duplicated datasets, outdated versions, and unclear provenance, often obscuring the original source of the data and making it difficult for users to verify, interpret, or update what they are using.

In many cases, data is copied and redistributed across platforms in ways that break the chain of ownership, leaving users uncertain about who produced the data or where to turn for clarification. The absence of clear standards for attribution, versioning, and ownership undermines trust in both the data and the systems that rely on it.

The Fourth Wave of Open Data is not only about technological innovation; it is also about rethinking the institutional and governance arrangements that underpin data access.

Addressing this requires making ownership explicit. Systems must not only return data, but clearly indicate who produced it and, where possible, retrieve it directly from the original source rather than from secondary aggregators.

This is where the connection to data stewardship efforts becomes particularly salient. The Fourth Wave of Open Data is not only about technological innovation; it is also about rethinking the institutional and governance arrangements that underpin data access.

Concepts such as data stewardship, social license, and trusted intermediaries become critical in ensuring that increased accessibility does not come at the expense of legitimacy. These concepts matter because increased access alone guarantees neither public trust nor public value; legitimacy and utility both depend on whether data is shared and reused through institutions and practices that are responsible, accountable, and oriented toward solving real-world problems.

From a policy and practice perspective, the report carries several important implications.

First, investments in open data must now extend beyond publication toward enabling interaction.

Second, AI adoption strategies must be grounded in an understanding of the limitations of current models, particularly in high-stakes domains.

Third, building AI-ready data systems requires coordinated action across technical, organizational, and governance dimensions. Achieving this will require collaboration between data producers, technology providers, and standards-setting bodies to ensure that data systems and AI systems evolve together.

The challenge and opportunity lie in designing data ecosystems where usability and trust are not in tension, but mutually reinforcing.

Ultimately, StatGPT reinforces a central insight of the Fourth Wave: the future of data is not simply more openness, but more meaningful and trustworthy use. AI can play a powerful role in unlocking the value of data, but only when coupled with systems that ensure accuracy, provenance, and accountability. The challenge—and opportunity—lies in designing data ecosystems where usability and trust are not in tension, but are mutually reinforcing.

In that sense, the report is both a technical proposal and a broader call to action. It invites the official statistics community and the wider data ecosystem to move beyond access as an end in itself toward a model of data engagement fit for an AI-driven world.

Header image credit: Elise Racine & The Bigger Picture / https://betterimagesofai.org / https://creativecommons.org/licenses/by/4.0/

The Promise and Limits of AI-Mediated Data Access

StatGPT and the Shift to AI as Interface

Toward Trustworthy and Governable Data Ecosystems

Tags

Solving Public Problems with Artificial Intelligence

AI in the Classroom: Key Insights from the Rethinking Reading Workshop Series

California's Digital Democracy: Using AI to Enhance Legislative Transparency