BanditQ:Fair Bandits with Guaranteed Rewards

Abhishek Sinha
Proceedings of the Fortieth Conference on Uncertainty in Artificial Intelligence, PMLR 244:3227-3244, 2024.

Abstract

Classic no-regret multi-armed bandit algorithms, including the Upper Confidence Bound (UCB), Hedge, and EXP3, are inherently unfair by design. Their unfairness stems from their objective of playing the most rewarding arm as frequently as possible while ignoring the rest. In this paper, we consider a fair prediction problem in the stochastic setting with a guaranteed minimum rate of accrual of rewards for each arm. We study the problem in both full-information and bandit feedback settings. Combining queueing-theoretic techniques with adversarial bandits, we propose a new online policy called BanditQ that achieves the target reward rates while conceding a regret and target rate violation penalty of at most $O(T^{\frac{3}{4}}).$ The regret bound in the full-information setting can be further improved to $O(\sqrt{T})$ under either a monotonicity assumption or when considering time-averaged regret. The proposed policy is efficient and admits a black-box reduction from the fair prediction problem to the standard adversarial MAB problem. The analysis of the BanditQ policy involves a new self-bounding inequality, which might be of independent interest.

Cite this Paper


BibTeX
@InProceedings{pmlr-v244-sinha24a, title = {BanditQ:Fair Bandits with Guaranteed Rewards}, author = {Sinha, Abhishek}, booktitle = {Proceedings of the Fortieth Conference on Uncertainty in Artificial Intelligence}, pages = {3227--3244}, year = {2024}, editor = {Kiyavash, Negar and Mooij, Joris M.}, volume = {244}, series = {Proceedings of Machine Learning Research}, month = {15--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v244/main/assets/sinha24a/sinha24a.pdf}, url = {https://proceedings.mlr.press/v244/sinha24a.html}, abstract = {Classic no-regret multi-armed bandit algorithms, including the Upper Confidence Bound (UCB), Hedge, and EXP3, are inherently unfair by design. Their unfairness stems from their objective of playing the most rewarding arm as frequently as possible while ignoring the rest. In this paper, we consider a fair prediction problem in the stochastic setting with a guaranteed minimum rate of accrual of rewards for each arm. We study the problem in both full-information and bandit feedback settings. Combining queueing-theoretic techniques with adversarial bandits, we propose a new online policy called BanditQ that achieves the target reward rates while conceding a regret and target rate violation penalty of at most $O(T^{\frac{3}{4}}).$ The regret bound in the full-information setting can be further improved to $O(\sqrt{T})$ under either a monotonicity assumption or when considering time-averaged regret. The proposed policy is efficient and admits a black-box reduction from the fair prediction problem to the standard adversarial MAB problem. The analysis of the BanditQ policy involves a new self-bounding inequality, which might be of independent interest.} }
Endnote
%0 Conference Paper %T BanditQ:Fair Bandits with Guaranteed Rewards %A Abhishek Sinha %B Proceedings of the Fortieth Conference on Uncertainty in Artificial Intelligence %C Proceedings of Machine Learning Research %D 2024 %E Negar Kiyavash %E Joris M. Mooij %F pmlr-v244-sinha24a %I PMLR %P 3227--3244 %U https://proceedings.mlr.press/v244/sinha24a.html %V 244 %X Classic no-regret multi-armed bandit algorithms, including the Upper Confidence Bound (UCB), Hedge, and EXP3, are inherently unfair by design. Their unfairness stems from their objective of playing the most rewarding arm as frequently as possible while ignoring the rest. In this paper, we consider a fair prediction problem in the stochastic setting with a guaranteed minimum rate of accrual of rewards for each arm. We study the problem in both full-information and bandit feedback settings. Combining queueing-theoretic techniques with adversarial bandits, we propose a new online policy called BanditQ that achieves the target reward rates while conceding a regret and target rate violation penalty of at most $O(T^{\frac{3}{4}}).$ The regret bound in the full-information setting can be further improved to $O(\sqrt{T})$ under either a monotonicity assumption or when considering time-averaged regret. The proposed policy is efficient and admits a black-box reduction from the fair prediction problem to the standard adversarial MAB problem. The analysis of the BanditQ policy involves a new self-bounding inequality, which might be of independent interest.
APA
Sinha, A.. (2024). BanditQ:Fair Bandits with Guaranteed Rewards. Proceedings of the Fortieth Conference on Uncertainty in Artificial Intelligence, in Proceedings of Machine Learning Research 244:3227-3244 Available from https://proceedings.mlr.press/v244/sinha24a.html.

Related Material