Data Amplification: Instance-Optimal Property Estimation

Yi Hao; Alon Orlitsky

Data Amplification: Instance-Optimal Property Estimation

Yi Hao, Alon Orlitsky

Proceedings of the 37th International Conference on Machine Learning, PMLR 119:4049-4059, 2020.

Abstract

The best-known and most commonly used technique for distribution-property estimation uses a plug-in estimator, with empirical frequency replacing the underlying distribution. We present novel linear-time-computable estimators that significantly “amplify” the effective amount of data available. For a large variety of distribution properties including four of the most popular ones and for every underlying distribution, they achieve the accuracy that the empirical-frequency plug-in estimators would attain using a logarithmic-factor more samples. Specifically, for Shannon entropy and a broad class of Lipschitz properties including the

$L_1$ distance to a fixed distribution, the new estimators use

$n$ samples to achieve the accuracy attained by the empirical estimators with

$n\log n$ samples. For support-size and coverage, the new estimators use

$n$ samples to achieve the performance of empirical frequency with sample size

$n$ times the logarithm of the property value. Significantly strengthening the traditional min-max formulation, these results hold not only for the worst distributions, but for each and every underlying distribution. Furthermore, the logarithmic amplification factors are optimal. Experiments on a wide variety of distributions show that the new estimators outperform the previous state-of-the-art estimators designed for each specific property.

Cite this Paper

BibTeX


@InProceedings{pmlr-v119-hao20a,
  title = 	 {Data Amplification: Instance-Optimal Property Estimation},
  author =       {Hao, Yi and Orlitsky, Alon},
  booktitle = 	 {Proceedings of the 37th International Conference on Machine Learning},
  pages = 	 {4049--4059},
  year = 	 {2020},
  editor = 	 {III, Hal Daumé and Singh, Aarti},
  volume = 	 {119},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--18 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v119/hao20a/hao20a.pdf},
  url = 	 {https://proceedings.mlr.press/v119/hao20a.html},
  abstract = 	 {The best-known and most commonly used technique for distribution-property estimation uses a plug-in estimator, with empirical frequency replacing the underlying distribution. We present novel linear-time-computable estimators that significantly “amplify” the effective amount of data available. For a large variety of distribution properties including four of the most popular ones and for every underlying distribution, they achieve the accuracy that the empirical-frequency plug-in estimators would attain using a logarithmic-factor more samples. Specifically, for Shannon entropy and a broad class of Lipschitz properties including the $L_1$ distance to a fixed distribution, the new estimators use $n$ samples to achieve the accuracy attained by the empirical estimators with $n\log n$ samples. For support-size and coverage, the new estimators use $n$ samples to achieve the performance of empirical frequency with sample size $n$ times the logarithm of the property value. Significantly strengthening the traditional min-max formulation, these results hold not only for the worst distributions, but for each and every underlying distribution. Furthermore, the logarithmic amplification factors are optimal. Experiments on a wide variety of distributions show that the new estimators outperform the previous state-of-the-art estimators designed for each specific property.}
}

Endnote

%0 Conference Paper
%T Data Amplification: Instance-Optimal Property Estimation
%A Yi Hao
%A Alon Orlitsky
%B Proceedings of the 37th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2020
%E Hal Daumé III
%E Aarti Singh	
%F pmlr-v119-hao20a
%I PMLR
%P 4049--4059
%U https://proceedings.mlr.press/v119/hao20a.html
%V 119
%X The best-known and most commonly used technique for distribution-property estimation uses a plug-in estimator, with empirical frequency replacing the underlying distribution. We present novel linear-time-computable estimators that significantly “amplify” the effective amount of data available. For a large variety of distribution properties including four of the most popular ones and for every underlying distribution, they achieve the accuracy that the empirical-frequency plug-in estimators would attain using a logarithmic-factor more samples. Specifically, for Shannon entropy and a broad class of Lipschitz properties including the $L_1$ distance to a fixed distribution, the new estimators use $n$ samples to achieve the accuracy attained by the empirical estimators with $n\log n$ samples. For support-size and coverage, the new estimators use $n$ samples to achieve the performance of empirical frequency with sample size $n$ times the logarithm of the property value. Significantly strengthening the traditional min-max formulation, these results hold not only for the worst distributions, but for each and every underlying distribution. Furthermore, the logarithmic amplification factors are optimal. Experiments on a wide variety of distributions show that the new estimators outperform the previous state-of-the-art estimators designed for each specific property.

APA


Hao, Y. & Orlitsky, A.. (2020). Data Amplification: Instance-Optimal Property Estimation. Proceedings of the 37th International Conference on Machine Learning, in Proceedings of Machine Learning Research 119:4049-4059 Available from https://proceedings.mlr.press/v119/hao20a.html.

Data Amplification: Instance-Optimal Property Estimation

Abstract

Cite this Paper

Related Material