Sample Amplification: Increasing Dataset Size even when Learning is Impossible

Brian Axelrod, Shivam Garg, Vatsal Sharan, Gregory Valiant
Proceedings of the 37th International Conference on Machine Learning, PMLR 119:442-451, 2020.

Abstract

Given data drawn from an unknown distribution, D, to what extent is it possible to “amplify” this dataset and faithfully output an even larger set of samples that appear to have been drawn from D? We formalize this question as follows: an (n,m) amplification procedure takes as input n independent draws from an unknown distribution D, and outputs a set of m > n “samples” which must be indistinguishable from m samples drawn iid from D. We consider this sample amplification problem in two fundamental settings: the case where D is an arbitrary discrete distribution supported on k elements, and the case where D is a d-dimensional Gaussian with unknown mean, and fixed covariance matrix. Perhaps surprisingly, we show a valid amplification procedure exists for both of these settings, even in the regime where the size of the input dataset, n, is significantly less than what would be necessary to learn distribution D to non-trivial accuracy. We also show that our procedures are optimal up to constant factors. Beyond these results, we describe potential applications of such data amplification, and formalize a number of curious directions for future research along this vein.

Cite this Paper


BibTeX
@InProceedings{pmlr-v119-axelrod20a, title = {Sample Amplification: Increasing Dataset Size even when Learning is Impossible}, author = {Axelrod, Brian and Garg, Shivam and Sharan, Vatsal and Valiant, Gregory}, booktitle = {Proceedings of the 37th International Conference on Machine Learning}, pages = {442--451}, year = {2020}, editor = {III, Hal Daumé and Singh, Aarti}, volume = {119}, series = {Proceedings of Machine Learning Research}, month = {13--18 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v119/axelrod20a/axelrod20a.pdf}, url = {http://proceedings.mlr.press/v119/axelrod20a.html}, abstract = {Given data drawn from an unknown distribution, D, to what extent is it possible to “amplify” this dataset and faithfully output an even larger set of samples that appear to have been drawn from D? We formalize this question as follows: an (n,m) amplification procedure takes as input n independent draws from an unknown distribution D, and outputs a set of m > n “samples” which must be indistinguishable from m samples drawn iid from D. We consider this sample amplification problem in two fundamental settings: the case where D is an arbitrary discrete distribution supported on k elements, and the case where D is a d-dimensional Gaussian with unknown mean, and fixed covariance matrix. Perhaps surprisingly, we show a valid amplification procedure exists for both of these settings, even in the regime where the size of the input dataset, n, is significantly less than what would be necessary to learn distribution D to non-trivial accuracy. We also show that our procedures are optimal up to constant factors. Beyond these results, we describe potential applications of such data amplification, and formalize a number of curious directions for future research along this vein.} }
Endnote
%0 Conference Paper %T Sample Amplification: Increasing Dataset Size even when Learning is Impossible %A Brian Axelrod %A Shivam Garg %A Vatsal Sharan %A Gregory Valiant %B Proceedings of the 37th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2020 %E Hal Daumé III %E Aarti Singh %F pmlr-v119-axelrod20a %I PMLR %P 442--451 %U http://proceedings.mlr.press/v119/axelrod20a.html %V 119 %X Given data drawn from an unknown distribution, D, to what extent is it possible to “amplify” this dataset and faithfully output an even larger set of samples that appear to have been drawn from D? We formalize this question as follows: an (n,m) amplification procedure takes as input n independent draws from an unknown distribution D, and outputs a set of m > n “samples” which must be indistinguishable from m samples drawn iid from D. We consider this sample amplification problem in two fundamental settings: the case where D is an arbitrary discrete distribution supported on k elements, and the case where D is a d-dimensional Gaussian with unknown mean, and fixed covariance matrix. Perhaps surprisingly, we show a valid amplification procedure exists for both of these settings, even in the regime where the size of the input dataset, n, is significantly less than what would be necessary to learn distribution D to non-trivial accuracy. We also show that our procedures are optimal up to constant factors. Beyond these results, we describe potential applications of such data amplification, and formalize a number of curious directions for future research along this vein.
APA
Axelrod, B., Garg, S., Sharan, V. & Valiant, G.. (2020). Sample Amplification: Increasing Dataset Size even when Learning is Impossible. Proceedings of the 37th International Conference on Machine Learning, in Proceedings of Machine Learning Research 119:442-451 Available from http://proceedings.mlr.press/v119/axelrod20a.html.

Related Material