Adversarial Robustness in Two-Stage Learning-to-Defer: Algorithms and Guarantees

Yannis Montreuil, Axel Carlier, Lai Xing Ng, Wei Tsang Ooi
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:44699-44725, 2025.

Abstract

Two-stage Learning-to-Defer (L2D) enables optimal task delegation by assigning each input to either a fixed main model or one of several offline experts, supporting reliable decision-making in complex, multi-agent environments. However, existing L2D frameworks assume clean inputs and are vulnerable to adversarial perturbations that can manipulate query allocation—causing costly misrouting or expert overload. We present the first comprehensive study of adversarial robustness in two-stage L2D systems. We introduce two novel attack strategies—untargeted and targeted—which respectively disrupt optimal allocations or force queries to specific agents. To defend against such threats, we propose SARD, a convex learning algorithm built on a family of surrogate losses that are provably Bayes-consistent and $(\mathcal{R}, \mathcal{G})$-consistent. These guarantees hold across classification, regression, and multi-task settings. Empirical results demonstrate that SARD significantly improves robustness under adversarial attacks while maintaining strong clean performance, marking a critical step toward secure and trustworthy L2D deployment.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-montreuil25a, title = {Adversarial Robustness in Two-Stage Learning-to-Defer: Algorithms and Guarantees}, author = {Montreuil, Yannis and Carlier, Axel and Ng, Lai Xing and Ooi, Wei Tsang}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {44699--44725}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/montreuil25a/montreuil25a.pdf}, url = {https://proceedings.mlr.press/v267/montreuil25a.html}, abstract = {Two-stage Learning-to-Defer (L2D) enables optimal task delegation by assigning each input to either a fixed main model or one of several offline experts, supporting reliable decision-making in complex, multi-agent environments. However, existing L2D frameworks assume clean inputs and are vulnerable to adversarial perturbations that can manipulate query allocation—causing costly misrouting or expert overload. We present the first comprehensive study of adversarial robustness in two-stage L2D systems. We introduce two novel attack strategies—untargeted and targeted—which respectively disrupt optimal allocations or force queries to specific agents. To defend against such threats, we propose SARD, a convex learning algorithm built on a family of surrogate losses that are provably Bayes-consistent and $(\mathcal{R}, \mathcal{G})$-consistent. These guarantees hold across classification, regression, and multi-task settings. Empirical results demonstrate that SARD significantly improves robustness under adversarial attacks while maintaining strong clean performance, marking a critical step toward secure and trustworthy L2D deployment.} }
Endnote
%0 Conference Paper %T Adversarial Robustness in Two-Stage Learning-to-Defer: Algorithms and Guarantees %A Yannis Montreuil %A Axel Carlier %A Lai Xing Ng %A Wei Tsang Ooi %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-montreuil25a %I PMLR %P 44699--44725 %U https://proceedings.mlr.press/v267/montreuil25a.html %V 267 %X Two-stage Learning-to-Defer (L2D) enables optimal task delegation by assigning each input to either a fixed main model or one of several offline experts, supporting reliable decision-making in complex, multi-agent environments. However, existing L2D frameworks assume clean inputs and are vulnerable to adversarial perturbations that can manipulate query allocation—causing costly misrouting or expert overload. We present the first comprehensive study of adversarial robustness in two-stage L2D systems. We introduce two novel attack strategies—untargeted and targeted—which respectively disrupt optimal allocations or force queries to specific agents. To defend against such threats, we propose SARD, a convex learning algorithm built on a family of surrogate losses that are provably Bayes-consistent and $(\mathcal{R}, \mathcal{G})$-consistent. These guarantees hold across classification, regression, and multi-task settings. Empirical results demonstrate that SARD significantly improves robustness under adversarial attacks while maintaining strong clean performance, marking a critical step toward secure and trustworthy L2D deployment.
APA
Montreuil, Y., Carlier, A., Ng, L.X. & Ooi, W.T.. (2025). Adversarial Robustness in Two-Stage Learning-to-Defer: Algorithms and Guarantees. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:44699-44725 Available from https://proceedings.mlr.press/v267/montreuil25a.html.

Related Material