Open Problem: First-Order Regret Bounds for Contextual Bandits


Alekh Agarwal, Akshay Krishnamurthy, John Langford, Haipeng Luo, Schapire Robert E. ;
Proceedings of the 2017 Conference on Learning Theory, PMLR 65:4-7, 2017.


We describe two open problems related to first order regret bounds for contextual bandits. The first asks for an algorithm with a regret bound of $\tilde{\mathcal{O}}(\sqrt{L_⋆}K \ln N)$ where there are $K$ actions, $N$ policies, and $L_⋆$ is the cumulative loss of the best policy. The second asks for an optimization-oracle-efficient algorithm with regret $\tilde{\mathcal{O}}(L_⋆^{2/3}poly(K, \ln(N/δ)))$. We describe some positive results, such as an inefficient algorithm for the second problem, and some partial negative results.

Related Material