Follow the Leader with Dropout Perturbations

Tim Van Erven, Wojciech Kotłowski, Manfred K. Warmuth
Proceedings of The 27th Conference on Learning Theory, PMLR 35:949-974, 2014.

Abstract

We consider online prediction with expert advice. Over the course of many trials, the goal of the learning algorithm is to achieve small additional loss (i.e. regret) compared to the loss of the best from a set of K experts. The two most popular algorithms are Hedge/Weighted Majority and Follow the Perturbed Leader (FPL). The latter algorithm first perturbs the loss of each expert by independent additive noise drawn from a fixed distribution, and then predicts with the expert of minimum perturbed loss (“the leader”) where ties are broken uniformly at random. To achieve the optimal worst-case regret as a function of the loss L^* of the best expert in hindsight, the two types of algorithms need to tune their learning rate or noise magnitude, respectively, as a function of L^*. Instead of perturbing the losses of the experts with additive noise, we randomly set them to 0 or 1 before selecting the leader. We show that our perturbations are an instance of dropout — because experts may be interpreted as features — although for non-binary losses the dropout probability needs to be made dependent on the losses to get good regret bounds. We show that this simple, tuning-free version of the FPL algorithm achieves two feats: optimal worst-case O(\sqrtL^* \ln K + \ln K) regret as a function of L^*, and optimal O(\ln K) regret when the loss vectors are drawn i.i.d. from a fixed distribution and there is a gap between the expected loss of the best expert and all others. A number of recent algorithms from the Hedge family (AdaHedge and FlipFlop) also achieve this, but they employ sophisticated tuning regimes. The dropout perturbation of the losses of the experts result in different noise distributions for each expert (because they depend on the expert’s total loss) and curiously enough no additional tuning is needed: the choice of dropout probability only affects the constants.

Cite this Paper


BibTeX
@InProceedings{pmlr-v35-vanerven14, title = {Follow the Leader with Dropout Perturbations}, author = {Van Erven, Tim and Kotłowski, Wojciech and Warmuth, Manfred K.}, booktitle = {Proceedings of The 27th Conference on Learning Theory}, pages = {949--974}, year = {2014}, editor = {Balcan, Maria Florina and Feldman, Vitaly and Szepesvári, Csaba}, volume = {35}, series = {Proceedings of Machine Learning Research}, address = {Barcelona, Spain}, month = {13--15 Jun}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v35/vanerven14.pdf}, url = {https://proceedings.mlr.press/v35/vanerven14.html}, abstract = {We consider online prediction with expert advice. Over the course of many trials, the goal of the learning algorithm is to achieve small additional loss (i.e. regret) compared to the loss of the best from a set of K experts. The two most popular algorithms are Hedge/Weighted Majority and Follow the Perturbed Leader (FPL). The latter algorithm first perturbs the loss of each expert by independent additive noise drawn from a fixed distribution, and then predicts with the expert of minimum perturbed loss (“the leader”) where ties are broken uniformly at random. To achieve the optimal worst-case regret as a function of the loss L^* of the best expert in hindsight, the two types of algorithms need to tune their learning rate or noise magnitude, respectively, as a function of L^*. Instead of perturbing the losses of the experts with additive noise, we randomly set them to 0 or 1 before selecting the leader. We show that our perturbations are an instance of dropout — because experts may be interpreted as features — although for non-binary losses the dropout probability needs to be made dependent on the losses to get good regret bounds. We show that this simple, tuning-free version of the FPL algorithm achieves two feats: optimal worst-case O(\sqrtL^* \ln K + \ln K) regret as a function of L^*, and optimal O(\ln K) regret when the loss vectors are drawn i.i.d. from a fixed distribution and there is a gap between the expected loss of the best expert and all others. A number of recent algorithms from the Hedge family (AdaHedge and FlipFlop) also achieve this, but they employ sophisticated tuning regimes. The dropout perturbation of the losses of the experts result in different noise distributions for each expert (because they depend on the expert’s total loss) and curiously enough no additional tuning is needed: the choice of dropout probability only affects the constants.} }
Endnote
%0 Conference Paper %T Follow the Leader with Dropout Perturbations %A Tim Van Erven %A Wojciech Kotłowski %A Manfred K. Warmuth %B Proceedings of The 27th Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2014 %E Maria Florina Balcan %E Vitaly Feldman %E Csaba Szepesvári %F pmlr-v35-vanerven14 %I PMLR %P 949--974 %U https://proceedings.mlr.press/v35/vanerven14.html %V 35 %X We consider online prediction with expert advice. Over the course of many trials, the goal of the learning algorithm is to achieve small additional loss (i.e. regret) compared to the loss of the best from a set of K experts. The two most popular algorithms are Hedge/Weighted Majority and Follow the Perturbed Leader (FPL). The latter algorithm first perturbs the loss of each expert by independent additive noise drawn from a fixed distribution, and then predicts with the expert of minimum perturbed loss (“the leader”) where ties are broken uniformly at random. To achieve the optimal worst-case regret as a function of the loss L^* of the best expert in hindsight, the two types of algorithms need to tune their learning rate or noise magnitude, respectively, as a function of L^*. Instead of perturbing the losses of the experts with additive noise, we randomly set them to 0 or 1 before selecting the leader. We show that our perturbations are an instance of dropout — because experts may be interpreted as features — although for non-binary losses the dropout probability needs to be made dependent on the losses to get good regret bounds. We show that this simple, tuning-free version of the FPL algorithm achieves two feats: optimal worst-case O(\sqrtL^* \ln K + \ln K) regret as a function of L^*, and optimal O(\ln K) regret when the loss vectors are drawn i.i.d. from a fixed distribution and there is a gap between the expected loss of the best expert and all others. A number of recent algorithms from the Hedge family (AdaHedge and FlipFlop) also achieve this, but they employ sophisticated tuning regimes. The dropout perturbation of the losses of the experts result in different noise distributions for each expert (because they depend on the expert’s total loss) and curiously enough no additional tuning is needed: the choice of dropout probability only affects the constants.
RIS
TY - CPAPER TI - Follow the Leader with Dropout Perturbations AU - Tim Van Erven AU - Wojciech Kotłowski AU - Manfred K. Warmuth BT - Proceedings of The 27th Conference on Learning Theory DA - 2014/05/29 ED - Maria Florina Balcan ED - Vitaly Feldman ED - Csaba Szepesvári ID - pmlr-v35-vanerven14 PB - PMLR DP - Proceedings of Machine Learning Research VL - 35 SP - 949 EP - 974 L1 - http://proceedings.mlr.press/v35/vanerven14.pdf UR - https://proceedings.mlr.press/v35/vanerven14.html AB - We consider online prediction with expert advice. Over the course of many trials, the goal of the learning algorithm is to achieve small additional loss (i.e. regret) compared to the loss of the best from a set of K experts. The two most popular algorithms are Hedge/Weighted Majority and Follow the Perturbed Leader (FPL). The latter algorithm first perturbs the loss of each expert by independent additive noise drawn from a fixed distribution, and then predicts with the expert of minimum perturbed loss (“the leader”) where ties are broken uniformly at random. To achieve the optimal worst-case regret as a function of the loss L^* of the best expert in hindsight, the two types of algorithms need to tune their learning rate or noise magnitude, respectively, as a function of L^*. Instead of perturbing the losses of the experts with additive noise, we randomly set them to 0 or 1 before selecting the leader. We show that our perturbations are an instance of dropout — because experts may be interpreted as features — although for non-binary losses the dropout probability needs to be made dependent on the losses to get good regret bounds. We show that this simple, tuning-free version of the FPL algorithm achieves two feats: optimal worst-case O(\sqrtL^* \ln K + \ln K) regret as a function of L^*, and optimal O(\ln K) regret when the loss vectors are drawn i.i.d. from a fixed distribution and there is a gap between the expected loss of the best expert and all others. A number of recent algorithms from the Hedge family (AdaHedge and FlipFlop) also achieve this, but they employ sophisticated tuning regimes. The dropout perturbation of the losses of the experts result in different noise distributions for each expert (because they depend on the expert’s total loss) and curiously enough no additional tuning is needed: the choice of dropout probability only affects the constants. ER -
APA
Van Erven, T., Kotłowski, W. & Warmuth, M.K.. (2014). Follow the Leader with Dropout Perturbations. Proceedings of The 27th Conference on Learning Theory, in Proceedings of Machine Learning Research 35:949-974 Available from https://proceedings.mlr.press/v35/vanerven14.html.

Related Material