Loading blog post, please wait Loading blog post...

Listen to the AI-generated audio version of this piece.

00:00
00:00

AI is hungry. Like Audrey II in Little Shop of Horrors, modern models feed on data. The breakthroughs of the past decade didn’t appear out of thin air; they grew from public choices to make government and other information legally and technically reusable. 

That decision unlocked enormous value, and it gives us leverage now. As companies run low on usable web text to train the next generation of models, the public sector holds a resource that can both spur innovation and steer it toward public purpose: open, well-governed public data.

The hidden scaffolding behind AI’s leap

This week’s Open Government Partnership Summit in Spain celebrates over fifteen years of government open data policy and a commitment by public organizations at every level to open and publish the data they collect. 

Today’s major models learned to read and reason on the back of public datasets: regulatory filings, scientific repositories, legislative records, as well as vast crawls of the open web. The US Patent and Trademark Office data, which we worked to publish as part of the Open Government Initiative under the Obama Administration, were core to large language model training (at least of the open models they could scrutinize). Those corpora didn’t materialize by accident. They were assembled because governments declared information a public asset and invested in publishing it in usable form.  

Today’s major models learned to read and reason on the back of public datasets

When we launched data.gov with 47 datasets in 2009, the goal of our open government effort was both to make the government more transparent and to spur innovation by researchers, entrepreneurs, and members of the public using data collected and paid for by taxpayer dollars. 

If we want AI that helps solve hard problems—drafting clearer laws, expanding access to services, improving health and climate resilience—we need to accelerate investment in open data and do so in ways that spur the development of the AI we want.

If we want AI that helps solve hard problems—drafting clearer laws, expanding access to services, improving health and climate resilience—we need to accelerate investment in open data and do so in ways that spur the development of the AI we want.

Data tied to purpose

As The GovLab’s Open Data Policy Lab notes: open and accessible government data can “improve the quality of the generative AI output but also help expand generative AI use cases and democratize access to open data.” 

By opening up their own data, governments can build new and better services for all of us.  In Indiana, for example, the Indiana.gov assistant sits on top of agency documents and databases. Instead of sending residents spelunking through PDFs, it answers questions in plain language and points to the right form or program, thanks to the underlying information that has been cleaned up and exposed in ways a model can use.

South Korea’s AI Hub has already provided millions of records to train applications like TTCare, a mobile application to analyze eye and skin disease symptoms in pets. The app's AI model was trained on roughly one million pieces of data—half of which came from the South Korean government's AI Hub. 

In Abu Dhabi, Bayaan ingests official statistics and lets policymakers pose natural-language questions—“How did youth unemployment change after 2022?”—and get back charts and citations that trace every claim to the source. The output is useful because it’s accountable to public data.

With secure access to administrative data from the UK Biobank and validated against Denmark’s national health records, European researchers built the Delphi-2M Model. Delphi doesn’t just predict whether someone might get cancer or diabetes; it can simulate the course of more than a thousand diseases over a lifetime. That kind of leap is only possible because governments invested in collecting, standardizing, and securely sharing their data.

Helsinki goes a step further. Beyond publishing datasets, the city keeps an AI Register—a public catalog of which municipal systems use AI, what data they rely on, what decisions they influence, and who is responsible. It’s a plain promise of visibility and recourse: residents can see how tools that affect them are governed.

America’s opportunity, and its bottleneck

The United States sits on vast, high-quality information, from the Congressional Record and the Federal Register to state and municipal collections. Yet too much of it remains locked in PDFs or spread across portals. If we want AI to do real public work, “AI-ready” has to become part of our basic digital plumbing.

That means using AI itself to clean and de-duplicate, anonymizing where needed, publishing machine-readable versions with clear licenses, and, crucially, keeping high-value corpora current and backed-up.

Volume and diversity of data matter for innovation, but curation is what makes models useful in public life. Feed a model the right bill texts, agency manuals, and city forms and it starts “speaking the rules,” producing summaries, drafts, and answers that match local practice instead of hallucinating it. 

In New Jersey, for example, we used the data the state collects from training providers together with open federal labor market data to create personalized career-pathway tools. In Massachusetts, AI for Impact students are building grant-writing tools with the help of open grants data, which they have used to fine-tune the model they have built.

None of this requires a moonshot, just focusing on curating the AI we need and the data required to get us there. If Congress, courts, and agencies published consistently structured texts, a public model fine-tuned on the Congressional Record could let any resident query legislative history in plain English, with citations. Mining archived rulemaking comments could surface expertise agencies routinely miss. Combining county budgets, zoning codes, transit schedules, and health resources could produce local assistants that guide residents through concrete choices—permits, routes, services—with multilingual support. 

Public inputs, public returns

Openness should serve institutions as well as innovators. If companies draw heavily on taxpayer-funded datasets, the public should see tangible returns with practical reciprocity tied to access.

Just as companies sponsor park cleanups, they should be able to support the hard work of preparing public data—cleaning, documenting, and publishing it in usable formats. But the data itself isn’t for sale. Open data is a shared public asset, and any private funding must come with no exclusivity, no early-access privileges, and no strings that limit reuse. The only acceptable outcome of private support is better public data, which is freely available to everyone, under open licenses, with clear provenance and privacy safeguards.

Contributions should be governed by public MOUs: no exclusivity clauses; open licenses; publish-all improvements (code, schemas, QA reports); independent privacy review; and agency control over priorities so investment doesn’t skew what gets opened.

Where a model is fine-tuned on public corpora, the developer should also grant a royalty-free civic license for the specific capability that public data enables. 

Where governments host secure enclaves for sensitive records, vendors can accept independent audits and publish documented benchmarks. Where adapters or evaluation suites are developed with public inputs, they can be contributed back so improvements persist even if a vendor changes. And where agencies face compute constraints, providers can dedicate a slice of capacity for public-interest uses. The point isn’t to add hoops; it’s to translate public inputs into public benefits.

Build the pipes and keep them public

Treat open data like roads or power lines: shared infrastructure that others can build on. That means investing in stewardship or the unglamorous work of cleaning, documenting, versioning, and keeping APIs up. 

Low-risk datasets—laws, budgets, schedules, manuals, non-identifying statistics—should be prepared for training and benchmarking and made broadly available.

Sensitive administrative records should stay inside secure research environments where code goes to the data, not the other way around; training runs and evaluations are monitored, access is logged, and outputs are vetted before anything leaves. 

Synthetic data can help teams prototype without touching real records. South Korea hosts 87,000 open public datasets and generates synthetic data that mirrors real-world patterns while protecting privacy.

 But it should complement, not replace, governed access, and findings should be validated against protected sources before they shape policy or services.

It means creating multi-state data collaboratives with common schemas so adapters and benchmarks travel, and secure enclaves where code is brought to sensitive data under watch. 

And it means measuring what works—accuracy, parity across languages and reading levels, citation support, robustness—and sharing results so the floor rises for everyone.

Feed the beast carefully

The first wave of AI rode on public data. The next wave can renew the bargain. If public information fuels private innovation, then private innovation should strengthen public capacity in exchange. That’s not only fair; it’s how we align AI with democracy’s needs.

This isn’t about transparency for the sake of it. It’s about building a data ecosystem that makes democratic AI possible: tools that explain the law in plain language, widen participation, help institutions see patterns in what people ask for, and improve how services get delivered. 

Image: Handmade Baby Audrey 2 Plant from Little Shop of Horrors by Jamie (via Wikimedia Commons), licensed CC BY-SA 2.0.

Tags