Loading blog post, please wait Loading blog post...

Listen to the AI-generated audio version of this piece.

00:00
00:00

Engaged California is a state program that gives Californians a unique opportunity to share their thoughts and connect with others on topics that matter to them. It creates new opportunities for Californians to connect with their government to inform and shape policy through honest, respectful discussions.

When the California Office of Data and Innovation’s Engaged California team launched its second public engagement last Fall—this time asking state employees how to make government more efficient—we expected a lot of unstructured, qualitative input.

What we didn’t fully anticipate was how much the analysis process itself would teach us: not just how AI would be helpful in analyzing the results, but where it can steer you wrong if you’re not careful.

What we didn’t fully anticipate was how much the analysis process itself would teach us: not just how AI would be helpful in analyzing the results, but where it can steer you wrong if you’re not careful.

Over 10 weeks in 2025, 1,469 employees participated, leaving 2,477 comments full of ideas for making government work more efficiently.

That’s a lot of signal. Here’s how we worked through it, where AI helped, and where humans had to stay in the loop to catch errors and garner trust in the results.

Note: If you want to dive deeper into our work, you’ll notice links throughout this post to our GitHub repository, where you can view the SQL, dbt, and Python code (including specific large language model prompts) we used to transform and analyze the data.

We used AI for two different jobs

AI played two distinct roles for us: one during the engagement, and another after it closed.

During the engagement, we built a Streamlit dashboard that enabled staff members to write their own prompts to explore the comments using a large language model (LLM). It let our team and leadership ask plain-language questions about the live stream of comments—things like:

  • What themes are popping up this week?

  • How often are people bringing up training?

That gave us an early signal and helped us head into the final analysis phase with some grounded hypotheses.

After the engagement closed, the real work started: synthesizing 2,477 comments into a final report that would allow the audience to actually navigate.

The pivot: from “top solutions” to “themes.”

Our initial goal sounded straightforward: identify the top solutions employees proposed.

People wrote long, layered responses. A single comment could include multiple problems, lots of context, one or more policy ideas, and a personal story. To count the most popular ideas, we decided to use AI to extract discrete problem-solution statements from each comment. 

To pull shorter, more focused solution statements from longer comments, we used Snowflake Cortex and a custom prompt to extract them

Even with these shorter statements focused on specific actions and ideas, we couldn’t get to clean, matchable items that clustered at a meaningful scale—most “similar enough” solutions only showed up a handful of times.

Instead of forcing a definitive ranking of specific ideas that employees wanted most, we focused on organizing what they said so people could explore it for themselves.

So we changed the goal. Instead of forcing a definitive ranking of specific ideas that employees wanted most, we focused on organizing what they said so people could explore it for themselves.

That decision shaped our final output: 2,627 extracted ideas categorized into 10 themes and 65 subthemes—grouped for browsability and to assess the prevalence of broader themes.

Where humans stayed in the loop

AI did the heavy lifting at scale, but it didn’t run unattended. We built human checkpoints on purpose.

Before we labeled anything with AI, our User Research team hand-categorized a random sample of 100 comments to build the initial taxonomy.

Researchers worked together, hand-labeling comments, debating category names, and checking each other’s labels until they had shared definitions they trusted.

That dataset became our reference point. With the category names and their definitions in hand, we used Snowflake’s AI_CLASSIFY to apply them across the full dataset of comments.

You can see that work here.

Once we had a fully labeled dataset, we set up a QA loop – first with data engineers, who ran a gamut of tests and random sampling checks on the data labeled by the AI, then with the UX researchers. 

In each iteration of the QA loop, we identified specific examples of things the AI got wrong – whether that be a specific comment that was labeled incorrectly, or a pattern that seemed to emerge, such as “this particular label gets applied a lot incorrectly, let’s try to figure out why.”

One example that stood out: while most comments received between 1-4 subthemes, sometimes, when the AI couldn’t figure out how to label a particular comment, it would apply all 65 subthemes. 

We used this feedback to tweak the prompt. For example, we changed theme names, added missing themes, made theme definitions more descriptive, and fed the AI_CLASSIFY prompt examples of how specific comments should be labeled.

At the very end, we found there were still some edge cases, so rather than further making changes to the model or the prompt, we hand-labeled a few of them at the end of our pipeline.

Every stage of this work went through a round of human-led QA.

Again, every stage of this work went through a round of human-led QA. This helped us better understand how to use the AI tools, build trust in the results, and develop a framework for future analyses like this.

Learnings

Using AI on open-ended qualitative data at this scale was new territory for us. The ideas employees shared are already informing real programs: the Governor’s Innovation Fellows, results.ca.gov, and case studies happening across the state.

A few lessons we’re carrying forward:

  • Prompt scope matters. The messier the input data, the more specific you have to be with the LLM prompts about what you want back.

  • When labeling text data, start with humans. Having researchers build the taxonomy first enabled the AI to gain critical context from domain experts (in this case, government employees) and simplified the analytics process.

  • Be honest about what the data can’t support. Moving away from rankings was the right call at the time, and it resulted in a rich dataset. Prominent themes remain at the top of the report, while nuanced details from the comments are maintained.

  • Put AI analysis in version control. All of the LLM prompts we used that touched the final dataset are in our version-controlled codebase. There are many benefits to this, one of which is that we know what prompts were used – making them reusable and auditable for the future.

The findings from this engagement are live at https://engaged.ca.gov/stateemployees/efficiency/, including a full CSV download of the raw data. The code, prompts, and models are on GitHub. 

Tags