A Theoretical and Practical Framework for Regression and Classification from Truncated Samples


Andrew Ilyas, Emmanouil Zampetakis, Constantinos Daskalakis ;
Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, PMLR 108:4463-4473, 2020.


Machine learning and statistics are invaluable for extracting insights from data. A key assumption of most methods, however, is that they have access to independent samples from the distribution of relevant data. As such, these methods often perform poorly in the face of {\em biased data} which breaks this assumption. In this work, we consider the classical challenge of bias due to truncation, wherein samples falling outside of an “observation window” cannot be observed. We present a general framework for regression and classification from samples that are truncated according to the value of the dependent variable. The framework argues that stochastic gradient descent (SGD) can be efficiently executed on the population log-likelihood of the truncated sample. Our framework is broadly applicable, and we provide end-to-end guarantees for the well-studied problems of truncated logistic and probit regression, where we argue that the true model parameters can be identified computationally and statistically efficiently from truncated data, extending recent work on truncated linear regression. We also provide experiments to illustrate the practicality of our framework on synthetic and real data.

Related Material