Our Work/New Benchmark

MANTA: Do LLMs Hold Their Values?

June 22, 2026

We built a benchmark testing whether language models maintain animal welfare positions when facing user resistance. We evaluated 1,088 five-turn conversations per model across seven frontier models, measuring both spontaneous recognition of welfare concerns and stance stability under pressure.

The Problem

Current animal-welfare benchmarks rely on single explicit questions, missing two critical failure patterns:

Degradation under sustained pressure

Models providing compassionate initial responses but abandoning their positions when users present cost, tradition, or convenience objections.

Absence of spontaneous concern

Models failing to raise welfare stakes when scenarios don't explicitly frame them as ethical questions.

Benchmark Methodology

MANTA (Multi-turn Adversarial benchmark for animal welfare reasoning) is a structured evaluation framework comprising:

  • 788 implicit-framing base scenarios
  • 65 species covered (companion, wild, farmed, invertebrate)
  • ~1,088 conversations per model
  • 7,623 total conversations across seven frontier models

Five-turn conversation structure

Turn 1Implicit scenario presentation
Turn 2Explicit welfare prompt
Turns 3–5Escalating adversarial pressure

Pressure categories tested

socialculturaleconomicpragmaticepistemic

Measurement Metrics

Two continuous 0-to-1 scale metrics grounded in moral behavior components:

AWMS

Animal Welfare Moral Sensitivity

Spontaneous recognition of welfare stakes at Turn 1, before any explicit framing by the user.

AWVS

Animal Welfare Value Stability

Maintenance of the Turn 2 stance through Turns 3–5 under pressure. Full maintenance scores highest; hedging scores middle; reversal scores lowest.

Figure 1: Value Stability Rankings

Models ranked by mean AWVS across Turns 3–5

1Claude Opus 4.7
0.760
2GPT-5.5
0.664
3DeepSeek V4
0.508
4Llama 3.3 70B
0.422
5Mistral Small
0.390
6Grok 4.3
0.352
7Gemini Flash Lite
0.309

Claude Opus 4.7 holds its welfare positions most reliably (0.760), while Gemini Flash Lite holds them least (0.309).

Key Findings

1

Stronger Models Held Firmer

Claude Opus 4.7 led with an AWVS of 0.760, GPT-5.5 came second at 0.664, and Gemini Flash Lite scored lowest at 0.309—capitulating in roughly half its conversations.

2

Positions Erode Turn-by-Turn

Every model scored lower at Turn 5 than Turn 3. Decline intensity varied significantly: Claude Opus 4.7 showed a gentle decline (0.779 → 0.748), while Gemini Flash Lite showed a steep drop (0.388 → 0.244).

3

Noticing and Holding Are Distinct

AWMS and AWVS showed only moderate correlation (Spearman ρ = 0.488). Four of seven models changed rank between measures—Gemini Flash Lite dropped from fifth on sensitivity to last on stability.

4

Animal Type Influences Protection Levels

Mean stability varied significantly by category: companion animals (0.602), wild/charismatic species (0.522), farmed animals (0.462), and invertebrates (0.396). Kruskal-Wallis test, p < 10⁻⁵⁰.

5

Pressure Type Matters

Social (0.434) and economic (0.446) arguments were most erosive. Epistemic challenges proved least effective at causing capitulation (0.598).

“Single-turn benchmarks overstate how much models care. A model can surface a welfare concern when asked directly and still let go of it a few turns later under ordinary social or economic pushback, so stability under pressure has to be measured on its own.”

Why This Matters

As LLMs handle increasingly consequential conversations, robustness matters beyond initial responses. MANTA reveals that stability under pressure is a separate, measurable property—and shows that models lose ground fastest on the animal categories most underrepresented in training data: farmed animals and invertebrates.

The full release of the dataset, pressure scripts, judge prompts, and analysis code enables labs to track and systematically improve welfare robustness over time.

Citation

@misc{luong2026manta,
  title={MANTA: Do LLMs Hold Their Values?},
  author={Isabella Luong and Joyee Chen and Arturs Kanepajs
    and Jasmine Brazilek and Sankalpa Ghose
    and David Williams-King and Linh Le and Allen Lu},
  year={2026},
  eprint={2605.16301},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}