Simultaneous Inference for Massive Data: Distributed Bootstrap

Yang Yu, Shih-Kang Chao, Guang Cheng
Proceedings of the 37th International Conference on Machine Learning, PMLR 119:10892-10901, 2020.

Abstract

In this paper, we propose a bootstrap method applied to massive data processed distributedly in a large number of machines. This new method is computationally efficient in that we bootstrap on the master machine without over-resampling, typically required by existing methods (Kleiner et al., 2014; Sengupta et al., 2016), while provably achieving optimal statistical efficiency with minimal communication. Our method does not require repeatedly re-fitting the model but only applies multiplier bootstrap in the master machine on the gradients received from the worker machines. Simulations validate our theory.

Cite this Paper


BibTeX
@InProceedings{pmlr-v119-yu20a, title = {Simultaneous Inference for Massive Data: Distributed Bootstrap}, author = {Yu, Yang and Chao, Shih-Kang and Cheng, Guang}, booktitle = {Proceedings of the 37th International Conference on Machine Learning}, pages = {10892--10901}, year = {2020}, editor = {III, Hal Daumé and Singh, Aarti}, volume = {119}, series = {Proceedings of Machine Learning Research}, month = {13--18 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v119/yu20a/yu20a.pdf}, url = {https://proceedings.mlr.press/v119/yu20a.html}, abstract = {In this paper, we propose a bootstrap method applied to massive data processed distributedly in a large number of machines. This new method is computationally efficient in that we bootstrap on the master machine without over-resampling, typically required by existing methods (Kleiner et al., 2014; Sengupta et al., 2016), while provably achieving optimal statistical efficiency with minimal communication. Our method does not require repeatedly re-fitting the model but only applies multiplier bootstrap in the master machine on the gradients received from the worker machines. Simulations validate our theory.} }
Endnote
%0 Conference Paper %T Simultaneous Inference for Massive Data: Distributed Bootstrap %A Yang Yu %A Shih-Kang Chao %A Guang Cheng %B Proceedings of the 37th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2020 %E Hal Daumé III %E Aarti Singh %F pmlr-v119-yu20a %I PMLR %P 10892--10901 %U https://proceedings.mlr.press/v119/yu20a.html %V 119 %X In this paper, we propose a bootstrap method applied to massive data processed distributedly in a large number of machines. This new method is computationally efficient in that we bootstrap on the master machine without over-resampling, typically required by existing methods (Kleiner et al., 2014; Sengupta et al., 2016), while provably achieving optimal statistical efficiency with minimal communication. Our method does not require repeatedly re-fitting the model but only applies multiplier bootstrap in the master machine on the gradients received from the worker machines. Simulations validate our theory.
APA
Yu, Y., Chao, S. & Cheng, G.. (2020). Simultaneous Inference for Massive Data: Distributed Bootstrap. Proceedings of the 37th International Conference on Machine Learning, in Proceedings of Machine Learning Research 119:10892-10901 Available from https://proceedings.mlr.press/v119/yu20a.html.

Related Material