Scalable AI Safety via Doubly-Efficient Debate

Jonah Brown-Cohen, Geoffrey Irving, Georgios Piliouras
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:4585-4602, 2024.

Abstract

The emergence of pre-trained AI systems with powerful capabilities across a diverse and ever-increasing set of complex domains has raised a critical challenge for AI safety as tasks can become too complicated for humans to judge directly. Irving et al (2018). proposed a debate method in this direction with the goal of pitting the power of such AI models against each other until the problem of identifying (mis)-alignment is broken down into a manageable subtask. While the promise of this approach is clear, the original framework was based on the assumption that the honest strategy is able to simulate deterministic AI systems for an exponential number of steps, limiting its applicability. In this paper, we show how to address these challenges by designing a new set of debate protocols where the honest strategy can always succeed using a simulation of a polynomial number of steps, whilst being able to verify the alignment of stochastic AI systems, even when the dishonest strategy is allowed to use exponentially many simulation steps.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-brown-cohen24a, title = {Scalable {AI} Safety via Doubly-Efficient Debate}, author = {Brown-Cohen, Jonah and Irving, Geoffrey and Piliouras, Georgios}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {4585--4602}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/brown-cohen24a/brown-cohen24a.pdf}, url = {https://proceedings.mlr.press/v235/brown-cohen24a.html}, abstract = {The emergence of pre-trained AI systems with powerful capabilities across a diverse and ever-increasing set of complex domains has raised a critical challenge for AI safety as tasks can become too complicated for humans to judge directly. Irving et al (2018). proposed a debate method in this direction with the goal of pitting the power of such AI models against each other until the problem of identifying (mis)-alignment is broken down into a manageable subtask. While the promise of this approach is clear, the original framework was based on the assumption that the honest strategy is able to simulate deterministic AI systems for an exponential number of steps, limiting its applicability. In this paper, we show how to address these challenges by designing a new set of debate protocols where the honest strategy can always succeed using a simulation of a polynomial number of steps, whilst being able to verify the alignment of stochastic AI systems, even when the dishonest strategy is allowed to use exponentially many simulation steps.} }
Endnote
%0 Conference Paper %T Scalable AI Safety via Doubly-Efficient Debate %A Jonah Brown-Cohen %A Geoffrey Irving %A Georgios Piliouras %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-brown-cohen24a %I PMLR %P 4585--4602 %U https://proceedings.mlr.press/v235/brown-cohen24a.html %V 235 %X The emergence of pre-trained AI systems with powerful capabilities across a diverse and ever-increasing set of complex domains has raised a critical challenge for AI safety as tasks can become too complicated for humans to judge directly. Irving et al (2018). proposed a debate method in this direction with the goal of pitting the power of such AI models against each other until the problem of identifying (mis)-alignment is broken down into a manageable subtask. While the promise of this approach is clear, the original framework was based on the assumption that the honest strategy is able to simulate deterministic AI systems for an exponential number of steps, limiting its applicability. In this paper, we show how to address these challenges by designing a new set of debate protocols where the honest strategy can always succeed using a simulation of a polynomial number of steps, whilst being able to verify the alignment of stochastic AI systems, even when the dishonest strategy is allowed to use exponentially many simulation steps.
APA
Brown-Cohen, J., Irving, G. & Piliouras, G.. (2024). Scalable AI Safety via Doubly-Efficient Debate. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:4585-4602 Available from https://proceedings.mlr.press/v235/brown-cohen24a.html.

Related Material