To Steer or Not to Steer? Mechanistic Error Reduction with Abstention for Language Models

Anna Hedström, Salim I. Amoukou, Tom Bewley, Saumitra Mishra, Manuela Veloso
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:22924-22945, 2025.

Abstract

We introduce Mechanistic Error Reduction with Abstention (MERA), a principled framework for steering language models (LMs) to mitigate errors through selective, adaptive interventions. Unlike existing methods that rely on fixed, manually tuned steering strengths, often resulting in under or oversteering, MERA addresses these limitations by (i) optimising the intervention direction, and (ii) calibrating when and how much to steer, thereby provably improving performance or abstaining when no confident correction is possible. Experiments across diverse datasets and LM families demonstrate safe, effective, non-degrading error correction and that MERA outperforms existing baselines. Moreover, MERA can be applied on top of existing steering techniques to further enhance their performance, establishing it as a general-purpose and efficient approach to mechanistic activation steering.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-hedstrom25a, title = {To Steer or Not to Steer? {M}echanistic Error Reduction with Abstention for Language Models}, author = {Hedstr\"{o}m, Anna and I. Amoukou, Salim and Bewley, Tom and Mishra, Saumitra and Veloso, Manuela}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {22924--22945}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/hedstrom25a/hedstrom25a.pdf}, url = {https://proceedings.mlr.press/v267/hedstrom25a.html}, abstract = {We introduce Mechanistic Error Reduction with Abstention (MERA), a principled framework for steering language models (LMs) to mitigate errors through selective, adaptive interventions. Unlike existing methods that rely on fixed, manually tuned steering strengths, often resulting in under or oversteering, MERA addresses these limitations by (i) optimising the intervention direction, and (ii) calibrating when and how much to steer, thereby provably improving performance or abstaining when no confident correction is possible. Experiments across diverse datasets and LM families demonstrate safe, effective, non-degrading error correction and that MERA outperforms existing baselines. Moreover, MERA can be applied on top of existing steering techniques to further enhance their performance, establishing it as a general-purpose and efficient approach to mechanistic activation steering.} }
Endnote
%0 Conference Paper %T To Steer or Not to Steer? Mechanistic Error Reduction with Abstention for Language Models %A Anna Hedström %A Salim I. Amoukou %A Tom Bewley %A Saumitra Mishra %A Manuela Veloso %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-hedstrom25a %I PMLR %P 22924--22945 %U https://proceedings.mlr.press/v267/hedstrom25a.html %V 267 %X We introduce Mechanistic Error Reduction with Abstention (MERA), a principled framework for steering language models (LMs) to mitigate errors through selective, adaptive interventions. Unlike existing methods that rely on fixed, manually tuned steering strengths, often resulting in under or oversteering, MERA addresses these limitations by (i) optimising the intervention direction, and (ii) calibrating when and how much to steer, thereby provably improving performance or abstaining when no confident correction is possible. Experiments across diverse datasets and LM families demonstrate safe, effective, non-degrading error correction and that MERA outperforms existing baselines. Moreover, MERA can be applied on top of existing steering techniques to further enhance their performance, establishing it as a general-purpose and efficient approach to mechanistic activation steering.
APA
Hedström, A., I. Amoukou, S., Bewley, T., Mishra, S. & Veloso, M.. (2025). To Steer or Not to Steer? Mechanistic Error Reduction with Abstention for Language Models. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:22924-22945 Available from https://proceedings.mlr.press/v267/hedstrom25a.html.

Related Material