Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, PMLR 89:2691-2700, 2019.
Given a data set of $(x,y)$ pairs, a common learning task is to fit a model predicting $y$ (a label or dependent variable) conditioned on $x$. This paper considers the similar but much less-understood problem of modeling “higher-order” statistics of $y$’s distribution conditioned on $x$. Such statistics are often challenging to estimate using traditional empirical risk minimization (ERM) approaches. We develop and theoretically analyze an ERM-like approach with multi-observation loss functions. We propose four algorithms formalizing the concept of ERM for this problem, two of which have statistical guarantees in settings allowing both slow and fast convergence rates, but which are out-performed empirically by the other two. Empirical results illustrate potential practicality of these algorithms in low dimensions and significant improvement over standard approaches in some settings.