Open Problem: FirstOrder Regret Bounds for Contextual Bandits
[edit]
Proceedings of the 2017 Conference on Learning Theory, PMLR 65:47, 2017.
Abstract
We describe two open problems related to first order regret bounds for contextual bandits. The first asks for an algorithm with a regret bound of $\tilde{\mathcal{O}}(\sqrt{L_⋆}K \ln N)$ where there are $K$ actions, $N$ policies, and $L_⋆$ is the cumulative loss of the best policy. The second asks for an optimizationoracleefficient algorithm with regret $\tilde{\mathcal{O}}(L_⋆^{2/3}poly(K, \ln(N/δ)))$. We describe some positive results, such as an inefficient algorithm for the second problem, and some partial negative results.
Related Material


