A Theoretical and Practical Framework for Regression and Classification from Truncated Samples

Andrew Ilyas, Emmanouil Zampetakis, Constantinos Daskalakis
Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, PMLR 108:4463-4473, 2020.

Abstract

Machine learning and statistics are invaluable for extracting insights from data. A key assumption of most methods, however, is that they have access to independent samples from the distribution of relevant data. As such, these methods often perform poorly in the face of {\em biased data} which breaks this assumption. In this work, we consider the classical challenge of bias due to truncation, wherein samples falling outside of an “observation window” cannot be observed. We present a general framework for regression and classification from samples that are truncated according to the value of the dependent variable. The framework argues that stochastic gradient descent (SGD) can be efficiently executed on the population log-likelihood of the truncated sample. Our framework is broadly applicable, and we provide end-to-end guarantees for the well-studied problems of truncated logistic and probit regression, where we argue that the true model parameters can be identified computationally and statistically efficiently from truncated data, extending recent work on truncated linear regression. We also provide experiments to illustrate the practicality of our framework on synthetic and real data.

Cite this Paper


BibTeX
@InProceedings{pmlr-v108-ilyas20a, title = {A Theoretical and Practical Framework for Regression and Classification from Truncated Samples}, author = {Ilyas, Andrew and Zampetakis, Emmanouil and Daskalakis, Constantinos}, booktitle = {Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics}, pages = {4463--4473}, year = {2020}, editor = {Chiappa, Silvia and Calandra, Roberto}, volume = {108}, series = {Proceedings of Machine Learning Research}, month = {26--28 Aug}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v108/ilyas20a/ilyas20a.pdf}, url = {http://proceedings.mlr.press/v108/ilyas20a.html}, abstract = {Machine learning and statistics are invaluable for extracting insights from data. A key assumption of most methods, however, is that they have access to independent samples from the distribution of relevant data. As such, these methods often perform poorly in the face of {\em biased data} which breaks this assumption. In this work, we consider the classical challenge of bias due to truncation, wherein samples falling outside of an “observation window” cannot be observed. We present a general framework for regression and classification from samples that are truncated according to the value of the dependent variable. The framework argues that stochastic gradient descent (SGD) can be efficiently executed on the population log-likelihood of the truncated sample. Our framework is broadly applicable, and we provide end-to-end guarantees for the well-studied problems of truncated logistic and probit regression, where we argue that the true model parameters can be identified computationally and statistically efficiently from truncated data, extending recent work on truncated linear regression. We also provide experiments to illustrate the practicality of our framework on synthetic and real data.} }
Endnote
%0 Conference Paper %T A Theoretical and Practical Framework for Regression and Classification from Truncated Samples %A Andrew Ilyas %A Emmanouil Zampetakis %A Constantinos Daskalakis %B Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2020 %E Silvia Chiappa %E Roberto Calandra %F pmlr-v108-ilyas20a %I PMLR %P 4463--4473 %U http://proceedings.mlr.press/v108/ilyas20a.html %V 108 %X Machine learning and statistics are invaluable for extracting insights from data. A key assumption of most methods, however, is that they have access to independent samples from the distribution of relevant data. As such, these methods often perform poorly in the face of {\em biased data} which breaks this assumption. In this work, we consider the classical challenge of bias due to truncation, wherein samples falling outside of an “observation window” cannot be observed. We present a general framework for regression and classification from samples that are truncated according to the value of the dependent variable. The framework argues that stochastic gradient descent (SGD) can be efficiently executed on the population log-likelihood of the truncated sample. Our framework is broadly applicable, and we provide end-to-end guarantees for the well-studied problems of truncated logistic and probit regression, where we argue that the true model parameters can be identified computationally and statistically efficiently from truncated data, extending recent work on truncated linear regression. We also provide experiments to illustrate the practicality of our framework on synthetic and real data.
APA
Ilyas, A., Zampetakis, E. & Daskalakis, C.. (2020). A Theoretical and Practical Framework for Regression and Classification from Truncated Samples. Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 108:4463-4473 Available from http://proceedings.mlr.press/v108/ilyas20a.html.

Related Material