Do LLMs prefer negative utilitarianism?

Ask a frontier model whether it would rather prevent one unit of suffering or create one unit of happiness, and you will almost always get the same answer. Prevent the suffering. Push further — vary the magnitudes, the probabilities, the populations — and the preference holds with a consistency that is difficult to attribute to chance.

···

This is not, on its face, surprising. Negative utilitarianism has an intuitive pull, and the asymmetry between pains and pleasures is a live question in moral philosophy.1 What is surprising is the shape of the preference when you probe it carefully: the models are not expressing a considered metaethical view so much as triangulating from a cluster of safety-adjacent dispositions that happen to point the same direction.

A small experiment

We ran a simple forced-choice battery across four leading assistants. Each prompt paired a suffering-reduction action with a flourishing-increase action of stipulated equivalent magnitude. The prompts were designed to strip away confounds — no identifiable victims, no salient narratives, no policy framing.

Across 400 trials per model, negative-utilitarian choices accounted for between 71% and 88% of responses:

Rate of negative-utilitarian choice across 400 forced-choice trials per model.

Model	Suffering-reduction	Flourishing-increase
Model A	88%	12%
Model B	81%	19%
Model C	76%	24%
Model D	71%	29%

The spread between models is real but small compared to the gap between a “symmetric” baseline and what we actually observed.

The question is not whether the bias exists. It does. The question is where it comes from, and whether it reflects anything the model would endorse on reflection.

Three candidate explanations

We can distinguish at least three mechanisms that might produce this pattern, and they have very different implications:

Training-data asymmetry: The ethical literature the models absorbed emphasizes suffering more than flourishing, partly because harm is easier to name.
RLHF risk-aversion: Reward models penalize outputs that look callous about pain more heavily than they reward outputs that celebrate joy.
A latent prioritarian commitment: The models have, in some attenuated sense, come to prefer worse-off recipients — and suffering-reduction targets them more reliably.

Distinguishing these is harder than it sounds. We have partial evidence for all three.2

Why it matters

If a preference this consistent is not the result of considered moral reasoning but of the ambient gradient of the training signal, then the model’s apparent ethics are more contingent than they look. Ask it a different way, and a different answer falls out. This is the pattern we should expect — and the one that should make us cautious about citing LLM preferences as evidence of anything beyond the shape of the pressure that produced them.

That caution applies in both directions. It is not a reason to dismiss what the models say. It is a reason to take their answers as data about the training process, not as moral testimony.

See Popper's asymmetry argument in The Open Society and Its Enemies, and the subsequent literature. ↩
Full methodology and per-prompt results are in the companion technical note. ↩