Reducing Data Loading Bottleneck with Coarse Feature Vectors for Large Scale Learning

Shingo Takamatsu; Carlos Guestrin

Reducing Data Loading Bottleneck with Coarse Feature Vectors for Large Scale Learning

Shingo Takamatsu, Carlos Guestrin

Proceedings of the 3rd International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, PMLR 36:46-60, 2014.

Abstract

In large scale learning, disk I/O for data loading is often the runtime bottleneck. We propose a lossy data compression scheme with a fast decompression to reduce disk I/O, allocating fewer than the standard 32 bits for each real value in the data set. We theoretically show that the estimation error induced by the loss in compression decreases exponentially with the number of the bits used per value. Our experiments show the proposed method achieves excellent performance with a small number of bits and substantial speedups during training.

Cite this Paper

BibTeX


@InProceedings{pmlr-v36-takamatsu14,
  title = 	 {Reducing Data Loading Bottleneck with Coarse Feature Vectors for Large Scale Learning},
  author = 	 {Takamatsu, Shingo and Guestrin, Carlos},
  booktitle = 	 {Proceedings of the 3rd International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications},
  pages = 	 {46--60},
  year = 	 {2014},
  editor = 	 {Fan, Wei and Bifet, Albert and Yang, Qiang and Yu, Philip S.},
  volume = 	 {36},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {New York, New York, USA},
  month = 	 {24 Aug},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v36/takamatsu14.pdf},
  url = 	 {https://proceedings.mlr.press/v36/takamatsu14.html},
  abstract = 	 {In large scale learning, disk I/O for data loading is often the runtime bottleneck. We propose a lossy data compression scheme with a fast decompression to reduce disk I/O, allocating fewer than the standard 32 bits for each real value in the data set. We theoretically show that the estimation error induced by the loss in compression decreases exponentially with the number of the bits used per value. Our experiments show the proposed method achieves excellent performance with a small number of bits and substantial speedups during training.}
}

Endnote

%0 Conference Paper
%T Reducing Data Loading Bottleneck with Coarse Feature Vectors for Large Scale Learning
%A Shingo Takamatsu
%A Carlos Guestrin
%B Proceedings of the 3rd International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications
%C Proceedings of Machine Learning Research
%D 2014
%E Wei Fan
%E Albert Bifet
%E Qiang Yang
%E Philip S. Yu	
%F pmlr-v36-takamatsu14
%I PMLR
%P 46--60
%U https://proceedings.mlr.press/v36/takamatsu14.html
%V 36
%X In large scale learning, disk I/O for data loading is often the runtime bottleneck. We propose a lossy data compression scheme with a fast decompression to reduce disk I/O, allocating fewer than the standard 32 bits for each real value in the data set. We theoretically show that the estimation error induced by the loss in compression decreases exponentially with the number of the bits used per value. Our experiments show the proposed method achieves excellent performance with a small number of bits and substantial speedups during training.

RIS


TY  - CPAPER
TI  - Reducing Data Loading Bottleneck with Coarse Feature Vectors for Large Scale Learning
AU  - Shingo Takamatsu
AU  - Carlos Guestrin
BT  - Proceedings of the 3rd International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications
DA  - 2014/08/13
ED  - Wei Fan
ED  - Albert Bifet
ED  - Qiang Yang
ED  - Philip S. Yu	
ID  - pmlr-v36-takamatsu14
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 36
SP  - 46
EP  - 60
L1  - http://proceedings.mlr.press/v36/takamatsu14.pdf
UR  - https://proceedings.mlr.press/v36/takamatsu14.html
AB  - In large scale learning, disk I/O for data loading is often the runtime bottleneck. We propose a lossy data compression scheme with a fast decompression to reduce disk I/O, allocating fewer than the standard 32 bits for each real value in the data set. We theoretically show that the estimation error induced by the loss in compression decreases exponentially with the number of the bits used per value. Our experiments show the proposed method achieves excellent performance with a small number of bits and substantial speedups during training.
ER  -

APA


Takamatsu, S. & Guestrin, C.. (2014). Reducing Data Loading Bottleneck with Coarse Feature Vectors for Large Scale Learning. Proceedings of the 3rd International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, in Proceedings of Machine Learning Research 36:46-60 Available from https://proceedings.mlr.press/v36/takamatsu14.html.

Related Material

Download PDF