Reducing Data Loading Bottleneck with Coarse Feature Vectors for Large Scale Learning

Shingo Takamatsu, Carlos Guestrin
Proceedings of the 3rd International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, PMLR 36:46-60, 2014.

Abstract

In large scale learning, disk I/O for data loading is often the runtime bottleneck. We propose a lossy data compression scheme with a fast decompression to reduce disk I/O, allocating fewer than the standard 32 bits for each real value in the data set. We theoretically show that the estimation error induced by the loss in compression decreases exponentially with the number of the bits used per value. Our experiments show the proposed method achieves excellent performance with a small number of bits and substantial speedups during training.

Cite this Paper


BibTeX
@InProceedings{pmlr-v36-takamatsu14, title = {Reducing Data Loading Bottleneck with Coarse Feature Vectors for Large Scale Learning}, author = {Takamatsu, Shingo and Guestrin, Carlos}, booktitle = {Proceedings of the 3rd International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications}, pages = {46--60}, year = {2014}, editor = {Fan, Wei and Bifet, Albert and Yang, Qiang and Yu, Philip S.}, volume = {36}, series = {Proceedings of Machine Learning Research}, address = {New York, New York, USA}, month = {24 Aug}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v36/takamatsu14.pdf}, url = {https://proceedings.mlr.press/v36/takamatsu14.html}, abstract = {In large scale learning, disk I/O for data loading is often the runtime bottleneck. We propose a lossy data compression scheme with a fast decompression to reduce disk I/O, allocating fewer than the standard 32 bits for each real value in the data set. We theoretically show that the estimation error induced by the loss in compression decreases exponentially with the number of the bits used per value. Our experiments show the proposed method achieves excellent performance with a small number of bits and substantial speedups during training.} }
Endnote
%0 Conference Paper %T Reducing Data Loading Bottleneck with Coarse Feature Vectors for Large Scale Learning %A Shingo Takamatsu %A Carlos Guestrin %B Proceedings of the 3rd International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications %C Proceedings of Machine Learning Research %D 2014 %E Wei Fan %E Albert Bifet %E Qiang Yang %E Philip S. Yu %F pmlr-v36-takamatsu14 %I PMLR %P 46--60 %U https://proceedings.mlr.press/v36/takamatsu14.html %V 36 %X In large scale learning, disk I/O for data loading is often the runtime bottleneck. We propose a lossy data compression scheme with a fast decompression to reduce disk I/O, allocating fewer than the standard 32 bits for each real value in the data set. We theoretically show that the estimation error induced by the loss in compression decreases exponentially with the number of the bits used per value. Our experiments show the proposed method achieves excellent performance with a small number of bits and substantial speedups during training.
RIS
TY - CPAPER TI - Reducing Data Loading Bottleneck with Coarse Feature Vectors for Large Scale Learning AU - Shingo Takamatsu AU - Carlos Guestrin BT - Proceedings of the 3rd International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications DA - 2014/08/13 ED - Wei Fan ED - Albert Bifet ED - Qiang Yang ED - Philip S. Yu ID - pmlr-v36-takamatsu14 PB - PMLR DP - Proceedings of Machine Learning Research VL - 36 SP - 46 EP - 60 L1 - http://proceedings.mlr.press/v36/takamatsu14.pdf UR - https://proceedings.mlr.press/v36/takamatsu14.html AB - In large scale learning, disk I/O for data loading is often the runtime bottleneck. We propose a lossy data compression scheme with a fast decompression to reduce disk I/O, allocating fewer than the standard 32 bits for each real value in the data set. We theoretically show that the estimation error induced by the loss in compression decreases exponentially with the number of the bits used per value. Our experiments show the proposed method achieves excellent performance with a small number of bits and substantial speedups during training. ER -
APA
Takamatsu, S. & Guestrin, C.. (2014). Reducing Data Loading Bottleneck with Coarse Feature Vectors for Large Scale Learning. Proceedings of the 3rd International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, in Proceedings of Machine Learning Research 36:46-60 Available from https://proceedings.mlr.press/v36/takamatsu14.html.

Related Material