Key Points

1. The dominant practice of AI alignment assumes that preferences are an adequate representation of human values, that human rationality can be understood in terms of maximizing the satisfaction of preferences, and that AI systems should be aligned with the preferences of one or more humans.

2. Preferences fail to capture the thick semantic content of human values, and utility representations neglect the possible incommensurability of those values.

3. Rational agents need not comply with expected utility theory, which is also silent on which preferences are normatively acceptable.

4. Instead of alignment with the preferences of a human user, developer, or humanity-writ-large, AI systems should be aligned with normative standards appropriate to their social roles, as negotiated and agreed upon by all relevant stakeholders.

5. Rational choice theory is an inadequate descriptive model of human behavior and decision-making, as humans are not perfectly or noisily rational.

6. Reward and utility functions cannot represent all human preferences, which may be incomplete due to incommensurable values.

7. Expected utility theory is unnecessary and insufficient for rational agency, and should not be the sole basis for the design or analysis of AI systems.

8. Reward learning and preference matching are only appropriate for AI systems with sufficiently local uses and scopes, not for globally-scoped AI assistants.

9. Multi-principal alignment should not be framed as the aggregation of elicited preferences, but rather as the alignment of AI systems with negotiated normative standards that promote mutual benefit and limit harm despite our plural and divergent values.

Summary

The paper challenges the assumptions of the dominant "preferentist" approach to AI alignment. This approach assumes that (1) preferences are an adequate representation of human values, (2) human rationality can be understood in terms of maximizing the satisfaction of preferences, and (3) AI systems should be aligned with the preferences of one or more humans to ensure safe and value-aligned behavior.

Examining the Limits of Rational Choice Theory
The paper first examines the limits of rational choice theory as a descriptive model of human decision-making. It explains how preferences fail to capture the rich semantic content of human values, and how utility representations neglect the possible incommensurability of those values. The paper then critiques the normativity of expected utility theory (EUT), arguing that rational agents need not comply with EUT, while highlighting how EUT provides no guidance on which preferences are normatively acceptable. Based on these limitations, the paper advocates for a reframing of the targets of AI alignment. Instead of aligning AI systems with the preferences of a human user, developer, or humanity-writ-large, AI systems should be aligned with normative standards appropriate to their social roles, such as the role of a general-purpose assistant. These normative standards should be negotiated and agreed upon by all relevant stakeholders.

An Alternative Conception of Alignment
On this alternative conception of alignment, a multiplicity of AI systems will be able to serve diverse ends, aligned with normative standards that promote mutual benefit and limit harm despite our plural and divergent values. The paper suggests that this move beyond preferences as the target of alignment is necessary to truly address the challenges posed by the value alignment problem, including the problems of social choice, anti-social preferences, preference change, and the difficulty of inferring preferences from human behavior.

Reference: https://arxiv.org/abs/2408.169...