Learning Distribution-wise Control in Representation Space for Language Models

Chunyuan Deng, Ruidi Chang, Hanjie Chen
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:13044-13068, 2025.

Abstract

Interventions in language models (LMs) are applied strategically to steer model behavior during the forward pass. Learnable interventions, also known as representation fine-tuning, aim to apply pointwise control within the concept subspace and have proven effective in altering high-level behaviors. In this work, we extend this approach to the distribution level, enabling the model to learn not only pointwise transformations but also the surrounding regions of the concept subspace. We demonstrate that these methods perform effectively in early layers, with larger standard deviations correlating strongly with improved performance. Across eight commonsense reasoning and seven arithmetic reasoning benchmarks, our distribution-wise interventions consistently outperform pointwise interventions in controllability and robustness. These results illustrate that distribution-wise interventions provide a more comprehensive method for steering model behavior and enabling finer-grained control over language models. The code is at: https://github.com/chili-lab/D-Intervention.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-deng25a, title = {Learning Distribution-wise Control in Representation Space for Language Models}, author = {Deng, Chunyuan and Chang, Ruidi and Chen, Hanjie}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {13044--13068}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/deng25a/deng25a.pdf}, url = {https://proceedings.mlr.press/v267/deng25a.html}, abstract = {Interventions in language models (LMs) are applied strategically to steer model behavior during the forward pass. Learnable interventions, also known as representation fine-tuning, aim to apply pointwise control within the concept subspace and have proven effective in altering high-level behaviors. In this work, we extend this approach to the distribution level, enabling the model to learn not only pointwise transformations but also the surrounding regions of the concept subspace. We demonstrate that these methods perform effectively in early layers, with larger standard deviations correlating strongly with improved performance. Across eight commonsense reasoning and seven arithmetic reasoning benchmarks, our distribution-wise interventions consistently outperform pointwise interventions in controllability and robustness. These results illustrate that distribution-wise interventions provide a more comprehensive method for steering model behavior and enabling finer-grained control over language models. The code is at: https://github.com/chili-lab/D-Intervention.} }
Endnote
%0 Conference Paper %T Learning Distribution-wise Control in Representation Space for Language Models %A Chunyuan Deng %A Ruidi Chang %A Hanjie Chen %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-deng25a %I PMLR %P 13044--13068 %U https://proceedings.mlr.press/v267/deng25a.html %V 267 %X Interventions in language models (LMs) are applied strategically to steer model behavior during the forward pass. Learnable interventions, also known as representation fine-tuning, aim to apply pointwise control within the concept subspace and have proven effective in altering high-level behaviors. In this work, we extend this approach to the distribution level, enabling the model to learn not only pointwise transformations but also the surrounding regions of the concept subspace. We demonstrate that these methods perform effectively in early layers, with larger standard deviations correlating strongly with improved performance. Across eight commonsense reasoning and seven arithmetic reasoning benchmarks, our distribution-wise interventions consistently outperform pointwise interventions in controllability and robustness. These results illustrate that distribution-wise interventions provide a more comprehensive method for steering model behavior and enabling finer-grained control over language models. The code is at: https://github.com/chili-lab/D-Intervention.
APA
Deng, C., Chang, R. & Chen, H.. (2025). Learning Distribution-wise Control in Representation Space for Language Models. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:13044-13068 Available from https://proceedings.mlr.press/v267/deng25a.html.

Related Material