Opportunistic Strategies for Generalized No-Regret Problems

Andrey Bernstein; Shie Mannor; Nahum Shimkin

Opportunistic Strategies for Generalized No-Regret Problems

Andrey Bernstein, Shie Mannor, Nahum Shimkin

Proceedings of the 26th Annual Conference on Learning Theory, PMLR 30:158-171, 2013.

Abstract

No-regret algorithms has played a key role in on-line learning and prediction problems. In this paper, we focus on a generalized no-regret problem with vector-valued rewards, defined in terms of a desired reward set of the agent. For each mixed action q of the opponent, the agent has a set R(q) where the average reward should reside. In addition, the agent has a response mixed action p which brings the expected reward under these two actions, r(p, q), to R(q). If a strategy of the agent ensures that the average reward converges to R(\barq_n), where \barq_n is the empirical distribution of the opponent’s actions, for any strategy of the opponent, we say that it is a no-regret strategy with respect to R(q). The standard no-regret problem is obtained as a special case for scalar rewards and R(q) = r ∈R: r ≥r(q), where r(q) = \max_p r(p, q). For this problem, the multifunction R(q) is convex, and no-regret strategies can be devised. Our main interest in this paper is in cases where this convexity property does not hold. The best that can be guaranteed in general then is the convergence of the average reward to R^c(\barq_n), the convex hull of R(\barq_n). However, as the game unfolds, it may turn out that the opponent’s choices of actions are limited in some way. If these restrictions were known in advance, the agent could possibly ensure convergence of the average reward to some desired subset of R^c(\barq_n), or even approach R(\barq_n) itself. We formulate appropriate goals for opportunistic no-regret strategies, in the sense that they may exploit such limitations on the opponent’s action sequence in an on-line manner, without knowing them beforehand. As the main technical tool, we propose a class of approachability algorithms that rely on a calibrated forecast of the opponent’s actions, which are opportunistic in the above mentioned sense. As an application, we consider the online no-regret problem with average cost constraints, introduced in Mannor, Tsitsiklis, and Yu (2009), where the best-response-in-hindsight is not generally attainable, but only its convex relaxation. Our proposed algorithm applied to this problem does attain the best-response-in-hindsight if the opponent’s play happens to be stationary (either in terms of its mixed actions, or the empirical frequencies of its pure actions).

Cite this Paper

BibTeX


@InProceedings{pmlr-v30-Bernstein13,
  title = 	 {Opportunistic Strategies for Generalized No-Regret Problems},
  author = 	 {Bernstein, Andrey and Mannor, Shie and Shimkin, Nahum},
  booktitle = 	 {Proceedings of the 26th Annual Conference on Learning Theory},
  pages = 	 {158--171},
  year = 	 {2013},
  editor = 	 {Shalev-Shwartz, Shai and Steinwart, Ingo},
  volume = 	 {30},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Princeton, NJ, USA},
  month = 	 {12--14 Jun},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v30/Bernstein13.pdf},
  url = 	 {https://proceedings.mlr.press/v30/Bernstein13.html},
  abstract = 	 {No-regret algorithms has played a key role in on-line learning and prediction problems. In this paper, we focus on a generalized no-regret problem with vector-valued rewards, defined in terms of a desired reward set of the agent. For each mixed action q of the opponent, the agent has a set R(q) where the average reward should reside. In addition, the agent has a response mixed action p which brings the expected reward under these two actions, r(p, q), to R(q). If a strategy of the agent ensures that the average reward converges to R(\barq_n), where \barq_n is the empirical distribution of the opponent’s actions, for any strategy of the opponent, we say that it is a no-regret strategy with respect to R(q). The standard no-regret problem is obtained as a special case for scalar rewards and R(q) = r ∈R: r ≥r(q), where r(q) = \max_p r(p, q). For this problem, the multifunction R(q) is convex, and no-regret strategies can be devised. Our main interest in this paper is in cases where this convexity property does not hold. The best that can be guaranteed in general then is the convergence of the average reward to R^c(\barq_n), the convex hull of R(\barq_n). However, as the game unfolds, it may turn out that the opponent’s choices of actions are limited in some way. If  these restrictions were known in advance, the agent could possibly ensure convergence of the average reward to some desired subset of R^c(\barq_n), or even approach R(\barq_n) itself. We formulate appropriate goals for opportunistic no-regret strategies, in the sense that they may exploit such limitations on the opponent’s action sequence in an on-line manner, without knowing them beforehand. As the main technical tool, we propose a class of approachability algorithms that rely on a calibrated forecast of the opponent’s actions, which are opportunistic in the above mentioned sense. As an application, we consider the online no-regret problem with average cost constraints, introduced in Mannor, Tsitsiklis, and Yu (2009), where the best-response-in-hindsight is not generally attainable, but only its convex relaxation. Our proposed algorithm applied to this problem does attain the best-response-in-hindsight if the opponent’s play happens to be stationary (either in terms of its mixed actions, or the empirical frequencies of its pure actions).}
}

Endnote

%0 Conference Paper
%T Opportunistic Strategies for Generalized No-Regret Problems
%A Andrey Bernstein
%A Shie Mannor
%A Nahum Shimkin
%B Proceedings of the 26th Annual Conference on Learning Theory
%C Proceedings of Machine Learning Research
%D 2013
%E Shai Shalev-Shwartz
%E Ingo Steinwart	
%F pmlr-v30-Bernstein13
%I PMLR
%P 158--171
%U https://proceedings.mlr.press/v30/Bernstein13.html
%V 30
%X No-regret algorithms has played a key role in on-line learning and prediction problems. In this paper, we focus on a generalized no-regret problem with vector-valued rewards, defined in terms of a desired reward set of the agent. For each mixed action q of the opponent, the agent has a set R(q) where the average reward should reside. In addition, the agent has a response mixed action p which brings the expected reward under these two actions, r(p, q), to R(q). If a strategy of the agent ensures that the average reward converges to R(\barq_n), where \barq_n is the empirical distribution of the opponent’s actions, for any strategy of the opponent, we say that it is a no-regret strategy with respect to R(q). The standard no-regret problem is obtained as a special case for scalar rewards and R(q) = r ∈R: r ≥r(q), where r(q) = \max_p r(p, q). For this problem, the multifunction R(q) is convex, and no-regret strategies can be devised. Our main interest in this paper is in cases where this convexity property does not hold. The best that can be guaranteed in general then is the convergence of the average reward to R^c(\barq_n), the convex hull of R(\barq_n). However, as the game unfolds, it may turn out that the opponent’s choices of actions are limited in some way. If  these restrictions were known in advance, the agent could possibly ensure convergence of the average reward to some desired subset of R^c(\barq_n), or even approach R(\barq_n) itself. We formulate appropriate goals for opportunistic no-regret strategies, in the sense that they may exploit such limitations on the opponent’s action sequence in an on-line manner, without knowing them beforehand. As the main technical tool, we propose a class of approachability algorithms that rely on a calibrated forecast of the opponent’s actions, which are opportunistic in the above mentioned sense. As an application, we consider the online no-regret problem with average cost constraints, introduced in Mannor, Tsitsiklis, and Yu (2009), where the best-response-in-hindsight is not generally attainable, but only its convex relaxation. Our proposed algorithm applied to this problem does attain the best-response-in-hindsight if the opponent’s play happens to be stationary (either in terms of its mixed actions, or the empirical frequencies of its pure actions).

RIS


TY  - CPAPER
TI  - Opportunistic Strategies for Generalized No-Regret Problems
AU  - Andrey Bernstein
AU  - Shie Mannor
AU  - Nahum Shimkin
BT  - Proceedings of the 26th Annual Conference on Learning Theory
DA  - 2013/06/13
ED  - Shai Shalev-Shwartz
ED  - Ingo Steinwart	
ID  - pmlr-v30-Bernstein13
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 30
SP  - 158
EP  - 171
L1  - http://proceedings.mlr.press/v30/Bernstein13.pdf
UR  - https://proceedings.mlr.press/v30/Bernstein13.html
AB  - No-regret algorithms has played a key role in on-line learning and prediction problems. In this paper, we focus on a generalized no-regret problem with vector-valued rewards, defined in terms of a desired reward set of the agent. For each mixed action q of the opponent, the agent has a set R(q) where the average reward should reside. In addition, the agent has a response mixed action p which brings the expected reward under these two actions, r(p, q), to R(q). If a strategy of the agent ensures that the average reward converges to R(\barq_n), where \barq_n is the empirical distribution of the opponent’s actions, for any strategy of the opponent, we say that it is a no-regret strategy with respect to R(q). The standard no-regret problem is obtained as a special case for scalar rewards and R(q) = r ∈R: r ≥r(q), where r(q) = \max_p r(p, q). For this problem, the multifunction R(q) is convex, and no-regret strategies can be devised. Our main interest in this paper is in cases where this convexity property does not hold. The best that can be guaranteed in general then is the convergence of the average reward to R^c(\barq_n), the convex hull of R(\barq_n). However, as the game unfolds, it may turn out that the opponent’s choices of actions are limited in some way. If  these restrictions were known in advance, the agent could possibly ensure convergence of the average reward to some desired subset of R^c(\barq_n), or even approach R(\barq_n) itself. We formulate appropriate goals for opportunistic no-regret strategies, in the sense that they may exploit such limitations on the opponent’s action sequence in an on-line manner, without knowing them beforehand. As the main technical tool, we propose a class of approachability algorithms that rely on a calibrated forecast of the opponent’s actions, which are opportunistic in the above mentioned sense. As an application, we consider the online no-regret problem with average cost constraints, introduced in Mannor, Tsitsiklis, and Yu (2009), where the best-response-in-hindsight is not generally attainable, but only its convex relaxation. Our proposed algorithm applied to this problem does attain the best-response-in-hindsight if the opponent’s play happens to be stationary (either in terms of its mixed actions, or the empirical frequencies of its pure actions).
ER  -

APA


Bernstein, A., Mannor, S. & Shimkin, N.. (2013). Opportunistic Strategies for Generalized No-Regret Problems. Proceedings of the 26th Annual Conference on Learning Theory, in Proceedings of Machine Learning Research 30:158-171 Available from https://proceedings.mlr.press/v30/Bernstein13.html.

Related Material

Download PDF