an argument for prosocial agents

To build machine-learning systems (agents) that are useful in the real world, they need to be able to cooperate with each other and with humans, not only at deployment but throughout their lifetime. I predicate my research on this and the following beliefs:

Multi-agent deployment: Transformative AI is unlikely to appear in isolation, but will instead be developed by multiple competing actors, giving rise to a large population of capable agents [1, 2].

Multi-agent training: Transformative AI is likely to be produced by an automatic curriculum, and one promising approach for this is multi-agent curricula [3].

Lifetime learning: Agents are not only trained then deployed in the real-world, but are continually trained afterwards. This training will likely be decentralised and within a population of competing agents, as such, the boundary between training and deployment will become less clear.

This creates the following concerns:

It is difficult to deploy cooperative agents: Apriori, agents that are cooperative or altruistic are vulnerable to being exploited by more selfish agents. Thus, agents successful in competitive markets are likely to be selfishly motivated. Large populations of selfish agents easily lead to catastrophic events, for example large market failures [4] or resource exhaustion [5].

It is difficult (post-deployment) to train cooperative agents: Agents that are trained in a population of selfish agents will be unable to develop cooperative strategies [6]. This restricts our ability to create cooperative agents over time, thus making selfish (misaligned) AI more likely.

It is difficult to de-risk multi-agent systems: Transformative AI trained by multi-agent interactions are not only shaped by their reward function but the interactions with other agents in their population [7]. System failure cannot be attributed to a single-agent, thus work on (single agent) interpretability or reward modelling is likely insufficient to de-risk these interactions [8].

To address the risk from these concerns, we need to:

Develop prosocial agents: Altruistic agents do not survive in competitive markets whilst the behaviour of selfish agents leads to cooperation failure [9]. Thus we require prosocial agents - those which actively seek (cooperative) optimal policies whilst being intolerant of exploitation. This behaviour mitigates the aforementioned concerns whilst sustaining the mechanisms of a competitive market.

Ensure multi-agent curricula incentivise prosociality: We need to guarantee that weak AI retains its prosociality when trained post-deployment. Thus to ensure training still produces aligned AI, we need to build our understanding of these multi-agent systems, and build methods that continually incentivise prosociality.

[1] https://www.alignmentforum.org/posts/dSAJdi99XmqftqXXq/eight-claims-about-multi-agent-agi-safety [2] Critch, Andrew, and David Krueger. “AI Research Considerations for Human Existential Safety (ARCHES).” arXiv preprint arXiv:2006.04948 (2020). [3] Leibo, Joel Z., et al. “Autocurricula and the emergence of innovation from social interaction: A manifesto for multi-agent intelligence research.” arXiv preprint arXiv:1903.00742 (2019). [4] https://en.wikipedia.org/wiki/2010_flash_crash [5] https://en.wikipedia.org/wiki/Resource_depletion [6] Axelrod, Robert, and William Donald Hamilton. “The evolution of cooperation.” science 211.4489 (1981): 1390-1396. [7]https://www.alignmentforum.org/posts/BXMCgpktdiawT3K5v/multi-agent-safety [8] Yudkowsky, Eliezer. Inadequate equilibria: Where and how civilizations get stuck. Machine Intelligence Research Institute, 2017. [9] https://longtermrisk.org/research-agenda