Follow the Leader with Dropout Perturbations

Tim Van Erven; Wojciech Kotłowski; Manfred K. Warmuth

Follow the Leader with Dropout Perturbations

Tim Van Erven, Wojciech Kotłowski, Manfred K. Warmuth

Proceedings of The 27th Conference on Learning Theory, PMLR 35:949-974, 2014.

Abstract

We consider online prediction with expert advice. Over the course of many trials, the goal of the learning algorithm is to achieve small additional loss (i.e. regret) compared to the loss of the best from a set of K experts. The two most popular algorithms are Hedge/Weighted Majority and Follow the Perturbed Leader (FPL). The latter algorithm first perturbs the loss of each expert by independent additive noise drawn from a fixed distribution, and then predicts with the expert of minimum perturbed loss (“the leader”) where ties are broken uniformly at random. To achieve the optimal worst-case regret as a function of the loss L^* of the best expert in hindsight, the two types of algorithms need to tune their learning rate or noise magnitude, respectively, as a function of L^*. Instead of perturbing the losses of the experts with additive noise, we randomly set them to 0 or 1 before selecting the leader. We show that our perturbations are an instance of dropout — because experts may be interpreted as features — although for non-binary losses the dropout probability needs to be made dependent on the losses to get good regret bounds. We show that this simple, tuning-free version of the FPL algorithm achieves two feats: optimal worst-case O(\sqrtL^* \ln K + \ln K) regret as a function of L^*, and optimal O(\ln K) regret when the loss vectors are drawn i.i.d. from a fixed distribution and there is a gap between the expected loss of the best expert and all others. A number of recent algorithms from the Hedge family (AdaHedge and FlipFlop) also achieve this, but they employ sophisticated tuning regimes. The dropout perturbation of the losses of the experts result in different noise distributions for each expert (because they depend on the expert’s total loss) and curiously enough no additional tuning is needed: the choice of dropout probability only affects the constants.

Cite this Paper

BibTeX


@InProceedings{pmlr-v35-vanerven14,
  title = 	 {Follow the Leader with Dropout Perturbations},
  author = 	 {Van Erven, Tim and Kotłowski, Wojciech and Warmuth, Manfred K.},
  booktitle = 	 {Proceedings of The 27th Conference on Learning Theory},
  pages = 	 {949--974},
  year = 	 {2014},
  editor = 	 {Balcan, Maria Florina and Feldman, Vitaly and Szepesvári, Csaba},
  volume = 	 {35},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Barcelona, Spain},
  month = 	 {13--15 Jun},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v35/vanerven14.pdf},
  url = 	 {https://proceedings.mlr.press/v35/vanerven14.html},
  abstract = 	 {We consider online prediction with expert advice. Over the course of many trials, the goal of the learning algorithm is to achieve small additional loss (i.e. regret) compared to the loss of the best from a set of K experts. The two most popular algorithms are Hedge/Weighted Majority and Follow the Perturbed Leader (FPL). The latter algorithm first perturbs the loss of each expert by independent additive noise drawn from a fixed distribution, and then predicts with the expert of minimum perturbed loss (“the leader”) where ties are broken uniformly at random. To achieve the optimal worst-case regret as a function of the loss L^* of the best expert in hindsight, the two types of algorithms need to tune their learning rate or noise magnitude, respectively, as a function of L^*. Instead of perturbing the losses of the experts with additive noise, we randomly set them to 0 or 1 before selecting the leader. We show that our perturbations are an instance of dropout — because experts may be interpreted as features — although for non-binary losses the dropout probability needs to be made dependent on the losses to get good regret bounds. We show that this simple, tuning-free version of the FPL algorithm achieves two feats: optimal worst-case O(\sqrtL^* \ln K + \ln K) regret as a function of L^*, and optimal O(\ln K) regret when the loss vectors are drawn i.i.d. from a fixed distribution and there is a gap between the expected loss of the best expert and all others. A number of recent algorithms from the Hedge family (AdaHedge and FlipFlop) also achieve this, but they employ sophisticated tuning regimes. The dropout perturbation of the losses of the experts result in different noise distributions for each expert (because they depend on the expert’s total loss) and curiously enough no additional tuning is needed: the choice of dropout probability only affects the constants.}
}

Endnote

%0 Conference Paper
%T Follow the Leader with Dropout Perturbations
%A Tim Van Erven
%A Wojciech Kotłowski
%A Manfred K. Warmuth
%B Proceedings of The 27th Conference on Learning Theory
%C Proceedings of Machine Learning Research
%D 2014
%E Maria Florina Balcan
%E Vitaly Feldman
%E Csaba Szepesvári	
%F pmlr-v35-vanerven14
%I PMLR
%P 949--974
%U https://proceedings.mlr.press/v35/vanerven14.html
%V 35
%X We consider online prediction with expert advice. Over the course of many trials, the goal of the learning algorithm is to achieve small additional loss (i.e. regret) compared to the loss of the best from a set of K experts. The two most popular algorithms are Hedge/Weighted Majority and Follow the Perturbed Leader (FPL). The latter algorithm first perturbs the loss of each expert by independent additive noise drawn from a fixed distribution, and then predicts with the expert of minimum perturbed loss (“the leader”) where ties are broken uniformly at random. To achieve the optimal worst-case regret as a function of the loss L^* of the best expert in hindsight, the two types of algorithms need to tune their learning rate or noise magnitude, respectively, as a function of L^*. Instead of perturbing the losses of the experts with additive noise, we randomly set them to 0 or 1 before selecting the leader. We show that our perturbations are an instance of dropout — because experts may be interpreted as features — although for non-binary losses the dropout probability needs to be made dependent on the losses to get good regret bounds. We show that this simple, tuning-free version of the FPL algorithm achieves two feats: optimal worst-case O(\sqrtL^* \ln K + \ln K) regret as a function of L^*, and optimal O(\ln K) regret when the loss vectors are drawn i.i.d. from a fixed distribution and there is a gap between the expected loss of the best expert and all others. A number of recent algorithms from the Hedge family (AdaHedge and FlipFlop) also achieve this, but they employ sophisticated tuning regimes. The dropout perturbation of the losses of the experts result in different noise distributions for each expert (because they depend on the expert’s total loss) and curiously enough no additional tuning is needed: the choice of dropout probability only affects the constants.

RIS


TY  - CPAPER
TI  - Follow the Leader with Dropout Perturbations
AU  - Tim Van Erven
AU  - Wojciech Kotłowski
AU  - Manfred K. Warmuth
BT  - Proceedings of The 27th Conference on Learning Theory
DA  - 2014/05/29
ED  - Maria Florina Balcan
ED  - Vitaly Feldman
ED  - Csaba Szepesvári	
ID  - pmlr-v35-vanerven14
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 35
SP  - 949
EP  - 974
L1  - http://proceedings.mlr.press/v35/vanerven14.pdf
UR  - https://proceedings.mlr.press/v35/vanerven14.html
AB  - We consider online prediction with expert advice. Over the course of many trials, the goal of the learning algorithm is to achieve small additional loss (i.e. regret) compared to the loss of the best from a set of K experts. The two most popular algorithms are Hedge/Weighted Majority and Follow the Perturbed Leader (FPL). The latter algorithm first perturbs the loss of each expert by independent additive noise drawn from a fixed distribution, and then predicts with the expert of minimum perturbed loss (“the leader”) where ties are broken uniformly at random. To achieve the optimal worst-case regret as a function of the loss L^* of the best expert in hindsight, the two types of algorithms need to tune their learning rate or noise magnitude, respectively, as a function of L^*. Instead of perturbing the losses of the experts with additive noise, we randomly set them to 0 or 1 before selecting the leader. We show that our perturbations are an instance of dropout — because experts may be interpreted as features — although for non-binary losses the dropout probability needs to be made dependent on the losses to get good regret bounds. We show that this simple, tuning-free version of the FPL algorithm achieves two feats: optimal worst-case O(\sqrtL^* \ln K + \ln K) regret as a function of L^*, and optimal O(\ln K) regret when the loss vectors are drawn i.i.d. from a fixed distribution and there is a gap between the expected loss of the best expert and all others. A number of recent algorithms from the Hedge family (AdaHedge and FlipFlop) also achieve this, but they employ sophisticated tuning regimes. The dropout perturbation of the losses of the experts result in different noise distributions for each expert (because they depend on the expert’s total loss) and curiously enough no additional tuning is needed: the choice of dropout probability only affects the constants.
ER  -

APA


Van Erven, T., Kotłowski, W. & Warmuth, M.K.. (2014). Follow the Leader with Dropout Perturbations. Proceedings of The 27th Conference on Learning Theory, in Proceedings of Machine Learning Research 35:949-974 Available from https://proceedings.mlr.press/v35/vanerven14.html.

Related Material

Download PDF