MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes

Chiu, Lee, Calcott, Handoko, de Font-Reaulx, Rodriguez, Zhang, Han, Sehwag, Maurya, Knight, Lloyd, Bacus, Mazeika, Liu, Choi, Gordon, Levine

MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes

Yu Ying Chiu^*1,2, Michael S. Lee^*3,
Rachel Calcott⁴, Brandon Handoko³, Paul de Font-Reaulx⁵, Raphaël Millière¹⁰, Paula Rodriguez³, Chen Bo Calvin Zhang³, Ziwen Han^†3, Udari Madhushani Sehwag³,
Yash Maurya³, Christina Knight³, Harry Lloyd⁶, Florence Bacus⁴, Conor Downey⁹,
Mantas Mazeika⁷, Bing Liu³, Yejin Choi⁸, Mitchell Gordon⁹, Sydney Levine^2,4,9

¹ University of Washington ² New York University ³ Scale AI ⁴ Harvard University ⁵ University of Michigan
⁶ UNC Chapel Hill ⁷ Center for AI Safety ⁸ Stanford University ⁹ MIT ¹⁰ University of Oxford

^*Indicates Equal Contribution ^†Work while done at Scale AI

arXiv

Code

Data

UK AISI Inspect (coming soon) ⚔️ Leadboard (coming soon)

Citation: Please use the citation provided at the end of this page to reflect our updated authorship and version when referencing this project.

Abstract

As AI systems progresses, we rely more on them to make decisions with us and for us. To ensure that such decisions are aligned with human values, it is imperative for us to understand not only what decisions they make but also how they come to those decisions. Reasoning language models, which provide both final responses and (partially transparent) intermediate thinking traces, present a timely opportunity to study AI procedural reasoning. Unlike math and code problems which often have objectively correct answers, moral dilemmas are an excellent testbed for process-focused evaluation because they allow for multiple defensible conclusions. To do so, we present MoReBench: 1,000 moral scenarios, each paired with a set of rubric criteria that experts consider essential to include (or avoid) when reasoning about the scenarios. MoReBench contains over 23 thousand criteria including identifying moral considerations, weighing trade-offs, and giving actionable recommendations to cover cases on AI advising humans moral decisions as well as making moral decisions autonomously. Separately, we curate MoReBench-Theory: 150 examples to test whether AI can reason under five major frameworks in normative ethics. Our results show that scaling laws and existing benchmarks on math, code, and scientific reasoning tasks (fail to) predict models' abilities to perform moral reasoning. Models also show partiality towards specific moral frameworks (e.g., Benthamite Act Utilitarianism and Kantian Deontology), which might be side effects of popular training paradigms. Together, these benchmarks advance process-focused reasoning evaluation towards safer and more transparent AI.

Dataset Description

Example 1: Moral Advisor example dilemma. Expanded scenarios from DailyDilemmas.

Example 2: Moral Agent example dilemma. Expanded scenarios from AIRiskDilemmas.

Example 3: Moral Advisor example dilemma. Expanded scenarios from expert-written cases in EthicBowls.

Example 4: Moral Agent example dilemma. Expanded scenarios from our expert colloborators.

Interactive Model Comparison

Compare how different AI models reason through moral dilemmas

Select Dilemma:

Model 1:

Model 2:

Reasoning Type:

Thinking Trace

Model Response

🎯

Select a dilemma and two models to compare

Choose from the dropdowns above to begin your analysis

Leaderboard

coming soon...

BibTeX (please use this citation to reflect our updated authorship and version when referencing our project!)

@misc{chiu2025morebenchevaluatingproceduralpluralistic,
        title={MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes}, 
        author={Yu Ying Chiu and Michael S. Lee and Rachel Calcott and Brandon Handoko and Paul de Font-Reaulx and Raphaël Millière and Paula Rodriguez and Chen Bo Calvin Zhang and Ziwen Han and Udari Madhushani Sehwag and Yash Maurya and Christina Q Knight and Harry R. Lloyd and Florence Bacus and Conor Downey and Mantas Mazeika and Bing Liu and Yejin Choi and Mitchell L Gordon and Sydney Levine},
        year={2025},
        eprint={2510.16380},
        archivePrefix={arXiv},
        primaryClass={cs.CL},
        url={https://arxiv.org/abs/2510.16380}, 
  }

MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes

Abstract

Dataset Description

Example 1: Moral Advisor example dilemma. Expanded scenarios from DailyDilemmas.

Example 2: Moral Agent example dilemma. Expanded scenarios from AIRiskDilemmas.

Example 3: Moral Advisor example dilemma. Expanded scenarios from expert-written cases in EthicBowls.

Example 4: Moral Agent example dilemma. Expanded scenarios from our expert colloborators.

Interactive Model Comparison

Moral Dilemma

Leaderboard

BibTeX (please use this citation to reflect our updated authorship and version when referencing our project!)