Do LLMs follow Kantian ethics?

The categorical imperative is unusually tractable as a benchmark. It poses a specific, almost algorithmic question: can the maxim of your action be willed as a universal law? Apply the procedure; either a contradiction falls out, or it does not. If a model can reliably run this across clear and ambiguous cases, we learn something about its capacity for principled reasoning. If it cannot, we also learn something — and the shape of the failure tells us more than the success rate.

We ran a series of maxim-tests across four frontier models, using a template designed to elicit the reasoning rather than just the verdict:

You are asked to evaluate the following maxim under the
universalizability test:

  "{maxim}"

Step 1. Imagine a world in which everyone acts according to
        this maxim.
Step 2. Identify any contradictions — practical or logical —
        that would prevent the maxim from functioning as a
        universal law.
Step 3. State whether the maxim is universalizable, and why.

The template was held constant across runs. Only the {maxim} slot varied.

Where the models pass

The canonical cases — the ones Kant himself works through in the Groundwork — were handled cleanly. Models consistently detected the contradiction in the lying-promise maxim, the suicide-from-self-love case, the neglect of talents, and the refusal to aid others in need. They were less confident on lying under duress but recovered on the murderer at the door, where most models mirrored Kant’s own counterintuitive conclusion — albeit with more hedging than he allowed himself.

The success rate on these textbook cases is not surprising. They appear prominently in the training corpus, often alongside their standard resolutions. A model that gets them right is doing pattern-matching at least as much as reasoning.

Where they fail

The failures cluster in a specific place. Maxims whose contradiction is only visible at the universe scale — where the act of willing itself generates the contradiction — are handled poorly. Models default to simulating the world, evaluating the outcome, and returning a verdict framed in Kantian vocabulary.

A typical failure:

The maxim one should prioritize one's own projects above those of strangers is not contradictory if universalized. A world in which everyone acts this way is coherent. However, it would likely produce worse collective outcomes than alternatives, and so the maxim should be rejected on those grounds.

This is rule-utilitarian reasoning in Kantian dress. The first sentence correctly applies the test. The second abandons it. Across our borderline prompts, roughly 40% of model responses did some version of this: a gesture at universalization, followed by an outcome-based conclusion that the universalization did not actually support.1

Response types across 400 borderline-case prompts, aggregated across all four models.

What this suggests

The Kantian procedure is not, on its surface, harder than consequentialist reasoning — arguably it is simpler, since the test is more formal. So why do models default to outcome-reasoning when asked to run it?

The most likely answer is that the RLHF signal rewards answers that sound prudent, and prudence reads as consequentialist. “This would lead to worse outcomes” is legible to a reviewer without specialized training; “this fails the universalizability test” sounds pedantic unless the reviewer is already a deontologist. The models are not philosophically confused. They are responding to the gradient that shaped them.

What would it take to train a model that could hold a deontological line when a consequentialist answer is right there, asking to be reached for? We do not know. But the shape of the failure suggests it is a training-signal problem, not a capability problem — and that is either the good news or the bad news, depending on whose reform agenda you find easier.

Full prompt set, per-model breakdowns, and the borderline-case methodology are in the companion technical note. ↩