Discovering Bias in Latent Space: An Unsupervised Debiasing Approach

Dyah Adila, Shuai Zhang, Boran Han, Bernie Wang
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:246-261, 2024.

Abstract

The question-answering (QA) capabilities of foundation models are highly sensitive to prompt variations, rendering their performance susceptible to superficial, non-meaning-altering changes. This vulnerability often stems from the model’s preference or bias towards specific input characteristics, such as option position or superficial image features in multi-modal settings. We propose to rectify this bias directly in the model’s internal representation. Our approach, SteerFair, finds the bias direction in the model’s representation space and steers activation values away from it during inference. Specifically, we exploit the observation that bias often adheres to simple association rules, such as the spurious association between the first option and correctness likelihood. Next, we construct demonstrations of these rules from unlabeled samples and use them to identify the bias directions. We empirically show that SteerFair significantly reduces instruction-tuned model performance variance across prompt modifications on three benchmark tasks. Remarkably, our approach surpasses a supervised baseline with 100 labels by an average of 10.86% accuracy points and 12.95 score points and matches the performance with 500 labels.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-adila24a, title = {Discovering Bias in Latent Space: An Unsupervised Debiasing Approach}, author = {Adila, Dyah and Zhang, Shuai and Han, Boran and Wang, Bernie}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {246--261}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/adila24a/adila24a.pdf}, url = {https://proceedings.mlr.press/v235/adila24a.html}, abstract = {The question-answering (QA) capabilities of foundation models are highly sensitive to prompt variations, rendering their performance susceptible to superficial, non-meaning-altering changes. This vulnerability often stems from the model’s preference or bias towards specific input characteristics, such as option position or superficial image features in multi-modal settings. We propose to rectify this bias directly in the model’s internal representation. Our approach, SteerFair, finds the bias direction in the model’s representation space and steers activation values away from it during inference. Specifically, we exploit the observation that bias often adheres to simple association rules, such as the spurious association between the first option and correctness likelihood. Next, we construct demonstrations of these rules from unlabeled samples and use them to identify the bias directions. We empirically show that SteerFair significantly reduces instruction-tuned model performance variance across prompt modifications on three benchmark tasks. Remarkably, our approach surpasses a supervised baseline with 100 labels by an average of 10.86% accuracy points and 12.95 score points and matches the performance with 500 labels.} }
Endnote
%0 Conference Paper %T Discovering Bias in Latent Space: An Unsupervised Debiasing Approach %A Dyah Adila %A Shuai Zhang %A Boran Han %A Bernie Wang %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-adila24a %I PMLR %P 246--261 %U https://proceedings.mlr.press/v235/adila24a.html %V 235 %X The question-answering (QA) capabilities of foundation models are highly sensitive to prompt variations, rendering their performance susceptible to superficial, non-meaning-altering changes. This vulnerability often stems from the model’s preference or bias towards specific input characteristics, such as option position or superficial image features in multi-modal settings. We propose to rectify this bias directly in the model’s internal representation. Our approach, SteerFair, finds the bias direction in the model’s representation space and steers activation values away from it during inference. Specifically, we exploit the observation that bias often adheres to simple association rules, such as the spurious association between the first option and correctness likelihood. Next, we construct demonstrations of these rules from unlabeled samples and use them to identify the bias directions. We empirically show that SteerFair significantly reduces instruction-tuned model performance variance across prompt modifications on three benchmark tasks. Remarkably, our approach surpasses a supervised baseline with 100 labels by an average of 10.86% accuracy points and 12.95 score points and matches the performance with 500 labels.
APA
Adila, D., Zhang, S., Han, B. & Wang, B.. (2024). Discovering Bias in Latent Space: An Unsupervised Debiasing Approach. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:246-261 Available from https://proceedings.mlr.press/v235/adila24a.html.

Related Material