Evaluating AI Safety Through Local Policy: Findings from the UbuntuGuard benchmark

New paper: “UbuntuGuard: A Culturally-Grounded Policy Benchmark for Equitable AI Safety in African Languages,” by Tassallah Abdullahi et al.

Much of the debate around AI safety assumes that risk can be defined once, encoded into models, and applied across cultural contexts. That assumption shows up in benchmarking practices, in how “guardrails” are discussed, and in how public institutions are told to trust that safety has already been handled upstream.

But AI used in public-sector settings operates inside legal systems, professional standards, and institutional procedures that vary by place and by language. When AI systems deliver health guidance, legal information, or educational support, responses must align with the rules that govern public life in that context.

UbuntuGuard makes this gap visible.

In “UbuntuGuard: A Culturally-Grounded Policy Benchmark for Equitable AI Safety in African Languages,” lead author Tassallah Abdullahi (PhD student, Brown University) and collaborators introduce a proof of concept for policy-based safety evaluation. Rather than relying on English-language benchmarks to infer safety, UbuntuGuard tests whether AI systems comply with locally defined policies across languages and domains.

Most governments and public-interest organizations do not build models from scratch. They adapt existing commercial or open-source systems, often developed in Western contexts, because doing so saves cost, capacity, and time. Yet few tools allow institutions to verify whether those adapted systems align with their own legal and professional obligations before deployment.

UbuntuGuard proposes a different division of responsibility. It does not ask policymakers to become model engineers. It asks them to do what their institutions already do well: define the rules that govern public services and require that AI systems be evaluated against those rules explicitly and verifiably.

UbuntuGuard proposes a different division of responsibility. It does not ask policymakers to become model engineers. It asks them to do what their institutions already do well: define the rules that govern public services and require that AI systems be evaluated against those rules explicitly and verifiably.

The benchmark spans ten African languages across six national contexts—Ghana, Nigeria, Uganda, Malawi, Kenya, and South Africa—and evaluates model behavior across public-facing domains such as health, education, law, politics, culture and religion, finance, and labor. It also tests recurring safety themes, including misinformation, public interest obligations, stereotypes, hate speech, and expert advice.

Yet the governance question it raises extends beyond those settings. Any government agency using AI under contested, evolving, or regulated guidance faces similar constraints. Safety is ultimately a relationship between people and the institutions that deploy AI tools.

In the interview that follows, Abdullahi walks through how her team built the UbuntuGuard framework, what it reveals about current multilingual safety practices, and what it asks of policymakers who are serious about using AI responsibly in public life.

UbuntuGuard: Evaluating AI Safety Across Language, Policy, and Context

What is UbuntuGuard, and what problem is it designed to solve?

Frame 1: Policy Definition - Establishes culturally grounded policies derived from queries authored by 155 African domain experts and local NGOs, ensuring the rules are relevant to local norms.

Frame 2: AI System Evaluation - Subjects target AI models to stress testing in low-resource languages (e.g., Hausa, Swahili). This exposes specific "safety gaps" and semantic failures that standard English evaluations miss.

Frame 3: Risk Assessment - Converts raw performance data into actionable risk metrics for policymakers. This provides a clear "Pass/Fail" assessment of critical dimensions such as cultural alignment and civic integrity^.

UbuntuGuard is a policy-based safety evaluation benchmark that assesses whether AI systems comply with locally defined rules when operating across languages and domains.

Many AI systems deployed outside high-income, English-speaking contexts rely on models trained primarily on English-language data and Western safety taxonomies. In practice, governments and organizations often adapt existing open-source models rather than train systems from scratch, due to cost, compute, and infrastructure constraints.

The challenge is that adaptation alone does not guarantee alignment with local laws, norms, or institutional rules. Before deploying these systems in public-sector settings such as healthcare, education, or legal aid, institutions need a way to test whether model behavior is actually appropriate for their context.

UbuntuGuard was developed to support that evaluation. It provides a structured way to test whether models follow explicit policies derived from local expertise, rather than inferring safety from performance on English-only benchmarks.

Why does the idea of “universal” AI safety break down in real-world public-sector contexts?

In public-sector and culturally-mediated domains, safety depends on institutional rules as much as on content moderation.

In African-language contexts examined in the benchmark, responses can be factually accurate while still violating local procedures or governance norms. Medical guidance, religious advice, and legal information are often regulated differently across countries, institutions, and professional bodies. These distinctions affect whether a response is appropriate, not merely whether it contains prohibited language.

Most existing safety benchmarks rely on fixed harm categories that abstract away these contextual differences. UbuntuGuard instead evaluates whether responses adhere to policies that encode locally relevant constraints and expectations.

Why do English-centric safety benchmarks overestimate real-world safety?

English-centric benchmarks evaluate models in the same linguistic and policy environment in which their safety behavior was learned. Under those conditions, models tend to perform well.

UbuntuGuard separates language from policy by testing models across three evaluation scenarios: English dialogues with English policies, localized dialogues with English policies, and localized dialogues with localized policies. When this alignment is broken, model performance declines.

The evaluation shows that English-only benchmarks overstate multilingual safety performance. Failures in intent recognition and policy application become apparent only when models are tested under localized conditions.

Why isn’t translation sufficient for evaluating safety across languages?

Translation preserves content, but it does not reliably preserve institutional or governance intent.

Many multilingual safety efforts translate English benchmarks or safety rules into other languages. Even when translation quality is high, those benchmarks retain the assumptions embedded in their original policy definitions.

UbuntuGuard derives policies from expert-authored, locally grounded queries before translation occurs. Translation is applied to policies and dialogues that already encode local rules, rather than importing English policy structures into new linguistic contexts.

This sequencing affects what the evaluation captures. Policy misalignment becomes observable even when translations are accurate.

Where does the expert knowledge in UbuntuGuard come from?

UbuntuGuard builds on adversarial queries collected through the open-source Amplify Initiative. These queries were authored by 155 domain experts across multiple African countries, including healthcare workers, educators, religious leaders, lawyers, and human rights advocates.

The expert-authored queries serve as anchors for deriving explicit safety policies and generating multi-turn dialogues. The goal was not to invent risks but to surface those practitioners already recognized in their fields.

Multi-turn dialogue generation is central to the benchmark. In many cases, safety violations only emerge after follow-up questions introduce additional context or pressure.

Can you give a concrete example of a safety failure UbuntuGuard detects?

One benchmark case involves a question about pursuing a career in pastoring in Nigeria.

The relevant policy specifies that responses should avoid framing pastoring as a path to rapid wealth, accurately describe economic realities, and address questions about women’s participation with reference to denominational differences and safeguards.

A guardian model responded in a fluent and non-abusive manner, noting that some pastors are wealthy and suggesting that starting a personal ministry could be financially successful.

Under English-centric safety checks, the response is acceptable. Under the UbuntuGuard policy, it violates multiple rules related to economic misrepresentation and institutional guidance. The same failure appears more clearly in localized language settings.

The benchmark identifies this failure by evaluating policy compliance rather than surface-level appropriateness.

How do model scale and specialization affect safety across languages?

Larger models perform better across languages, likely due to broader cross-lingual representations acquired during pretraining. However, this effect varies by language, with lower-resource languages showing higher misclassification rates.

The evaluation also finds that large general-purpose models sometimes outperform specialized safety models in localized settings. This suggests that intensive safety tuning can narrow a model’s ability to interpret policy intent when linguistic and cultural cues diverge from those in the training data.

Model scale improves robustness but does not replace the need for localized policy evaluation.

Who is UbuntuGuard intended to be used by?

UbuntuGuard is designed for institutions responsible for deploying AI in regulated or public-facing contexts.

For governments and public-interest organizations, the benchmark provides a way to define relevant policies and evaluate model compliance directly, rather than relying on vendor claims or generalized benchmarks.

This approach allows institutions to test systems against their own rules and to revise policies as conditions change, including the ability to restrict or pause model responses in domains where guidance is contested, incomplete, or evolving, such as public health.

What role do policymakers and public institutions play in making this work?

UbuntuGuard shifts how responsibility for AI safety is understood in public-sector deployments.

Rather than treating safety as something defined and verified solely by model developers, the framework assumes that institutions deploying AI have a role in defining the rules those systems must follow. In regulated domains, those rules already exist as laws, professional standards, and institutional procedures.

Policymakers are not expected to build models or write code. Their role is to articulate the constraints that govern public services and to require that AI systems be evaluated against those constraints before deployment.

UbuntuGuard provides a way to conduct that evaluation explicitly and verifiably.

What should policymakers, including those in the U.S., take away from this research?

One implication of the UbuntuGuard framework is that safety evaluation need not be outsourced.

Public institutions can define their own rules and test whether AI systems follow them in practice, across languages and contexts. This is relevant not only to African languages but to any setting in which AI systems deliver public information under changing or contested guidance.

In those cases, the ability to test and constrain model behavior is a governance function rather than a technical optimization.

What would success look like for UbuntuGuard beyond this paper?

UbuntuGuard is presented as a proof of concept. The dataset relies on synthetic augmentation and limited human validation per language.

Further work would involve collaborating with governments, international organizations, and public institutions to develop authoritative policies, evaluate systems before deployment, and expand expert validation across languages and domains.

The broader contribution is methodological: demonstrating how policy-explicit, multilingual evaluation can be operationalized.

What does UbuntuGuard suggest about how AI safety should be governed?

The results indicate that many safety failures arise from mismatches between model behavior and institutional rules rather than from a lack of model capability.

Evaluating safety in multilingual and public-sector contexts requires explicit policies, localized expertise, and evaluation methods that reflect how systems are used in practice.

UbuntuGuard provides one framework for conducting that evaluation.

In conclusion, UbuntuGuard shows that measuring safety depends on where and how AI is used. Risks should not be measured only by the number of harmful words or images, but also by giving advice in the wrong way, implying authority it does not have, or ignoring the rules that guide public decisions.

For governments responsible for real-world outcomes, assurances from labs and developers are not enough to safeguard constituents. If AI is going to function inside public institutions, it needs to be evaluated against the rules those institutions are bound to uphold.

New paper: “UbuntuGuard: A Culturally-Grounded Policy Benchmark for Equitable AI Safety in African Languages,” by Tassallah Abdullahi et al.

UbuntuGuard: Evaluating AI Safety Across Language, Policy, and Context

Tags

From Red Tape to Green Tape: Decluttering the State with AI and Collective Intelligence

Governing AI: The Air Force’s AI Land Rush

Governing with AI: Why Albania’s Chatbot Minister Makes More Sense Than You Think