Data Driven Resource Allocation for Distributed Learning

Travis Dick, Mu Li, Venkata Krishna Pillutla, Colin White, Nina Balcan, Alex Smola
Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, PMLR 54:662-671, 2017.

Abstract

In distributed machine learning, data is dispatched to multiple machines for processing. Motivated by the fact that similar data points often belong to the same or similar classes, and more generally, classification rules of high accuracy tend to be “locally simple but globally complex” (Vapnik and Bottou, 1993), we propose data dependent dispatching that takes advantage of such structure. We present an in-depth analysis of this model, providing new algorithms with provable worst-case guarantees, analysis proving existing scalable heuristics perform well in natural non worst-case conditions, and techniques for extending a dispatching rule from a small sample to the entire distribution. We overcome novel technical challenges to satisfy important conditions for accurate distributed learning, including fault tolerance and balancedness. We empirically compare our approach with baselines based on random partitioning, balanced partition trees, and locality sensitive hashing, showing that we achieve significantly higher accuracy on both synthetic and real world image and advertising datasets. We also demonstrate that our technique strongly scales with the available computing power.

Cite this Paper


BibTeX
@InProceedings{pmlr-v54-dick17a, title = {{Data Driven Resource Allocation for Distributed Learning}}, author = {Dick, Travis and Li, Mu and Pillutla, Venkata Krishna and White, Colin and Balcan, Nina and Smola, Alex}, booktitle = {Proceedings of the 20th International Conference on Artificial Intelligence and Statistics}, pages = {662--671}, year = {2017}, editor = {Singh, Aarti and Zhu, Jerry}, volume = {54}, series = {Proceedings of Machine Learning Research}, month = {20--22 Apr}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v54/dick17a/dick17a.pdf}, url = {https://proceedings.mlr.press/v54/dick17a.html}, abstract = {In distributed machine learning, data is dispatched to multiple machines for processing. Motivated by the fact that similar data points often belong to the same or similar classes, and more generally, classification rules of high accuracy tend to be “locally simple but globally complex” (Vapnik and Bottou, 1993), we propose data dependent dispatching that takes advantage of such structure. We present an in-depth analysis of this model, providing new algorithms with provable worst-case guarantees, analysis proving existing scalable heuristics perform well in natural non worst-case conditions, and techniques for extending a dispatching rule from a small sample to the entire distribution. We overcome novel technical challenges to satisfy important conditions for accurate distributed learning, including fault tolerance and balancedness. We empirically compare our approach with baselines based on random partitioning, balanced partition trees, and locality sensitive hashing, showing that we achieve significantly higher accuracy on both synthetic and real world image and advertising datasets. We also demonstrate that our technique strongly scales with the available computing power.} }
Endnote
%0 Conference Paper %T Data Driven Resource Allocation for Distributed Learning %A Travis Dick %A Mu Li %A Venkata Krishna Pillutla %A Colin White %A Nina Balcan %A Alex Smola %B Proceedings of the 20th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2017 %E Aarti Singh %E Jerry Zhu %F pmlr-v54-dick17a %I PMLR %P 662--671 %U https://proceedings.mlr.press/v54/dick17a.html %V 54 %X In distributed machine learning, data is dispatched to multiple machines for processing. Motivated by the fact that similar data points often belong to the same or similar classes, and more generally, classification rules of high accuracy tend to be “locally simple but globally complex” (Vapnik and Bottou, 1993), we propose data dependent dispatching that takes advantage of such structure. We present an in-depth analysis of this model, providing new algorithms with provable worst-case guarantees, analysis proving existing scalable heuristics perform well in natural non worst-case conditions, and techniques for extending a dispatching rule from a small sample to the entire distribution. We overcome novel technical challenges to satisfy important conditions for accurate distributed learning, including fault tolerance and balancedness. We empirically compare our approach with baselines based on random partitioning, balanced partition trees, and locality sensitive hashing, showing that we achieve significantly higher accuracy on both synthetic and real world image and advertising datasets. We also demonstrate that our technique strongly scales with the available computing power.
APA
Dick, T., Li, M., Pillutla, V.K., White, C., Balcan, N. & Smola, A.. (2017). Data Driven Resource Allocation for Distributed Learning. Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 54:662-671 Available from https://proceedings.mlr.press/v54/dick17a.html.

Related Material