An Efficient Algorithm for Large Scale Compressive Feature Learning

Hristo Paskov; John Mitchell; Trevor Hastie

An Efficient Algorithm for Large Scale Compressive Feature Learning

Hristo Paskov, John Mitchell, Trevor Hastie

Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, PMLR 33:760-768, 2014.

Abstract

This paper focuses on large-scale unsupervised feature selection from text. We expand upon the recently proposed Compressive Feature Learning (CFL) framework, a method that uses dictionary-based compression to select a K-gram representation for a document corpus. We show that CFL is NP-Complete and provide a novel and efficient approximation algorithm based on a homotopy that transforms a convex relaxation of CFL into the original problem. Our algorithm allows CFL to scale to corpuses comprised of millions of documents because each step is linear in the corpus length and highly parallelizable. We use it to extract features from the BeerAdvocate dataset, a corpus of over 1.5 million beer reviews spanning 10 years. CFL uses two orders of magnitude fewer features than the full trigram space. It beats a standard unigram model in a number of prediction tasks and achieves nearly twice the accuracy on an author identification task.

Cite this Paper

BibTeX


@InProceedings{pmlr-v33-paskov14,
  title = 	 {{An Efficient Algorithm for Large Scale Compressive Feature Learning}},
  author = 	 {Paskov, Hristo and Mitchell, John and Hastie, Trevor},
  booktitle = 	 {Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics},
  pages = 	 {760--768},
  year = 	 {2014},
  editor = 	 {Kaski, Samuel and Corander, Jukka},
  volume = 	 {33},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Reykjavik, Iceland},
  month = 	 {22--25 Apr},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v33/paskov14.pdf},
  url = 	 {https://proceedings.mlr.press/v33/paskov14.html},
  abstract = 	 {This paper focuses on large-scale unsupervised feature selection from text. We expand upon the recently proposed Compressive Feature Learning (CFL) framework, a method that uses dictionary-based compression to select a K-gram representation for a document corpus. We show that CFL is NP-Complete and provide a novel and efficient approximation algorithm based on a homotopy that transforms a convex relaxation of CFL into the original problem. Our algorithm allows CFL to scale to corpuses comprised of millions of documents because each step is linear in the corpus length and highly parallelizable. We use it to extract features from the BeerAdvocate dataset, a corpus of over 1.5 million beer reviews spanning 10 years. CFL uses two orders of magnitude fewer features than the full trigram space. It beats a standard unigram model in a number of prediction tasks and achieves nearly twice the accuracy on an author identification task.}
}

Endnote

%0 Conference Paper
%T An Efficient Algorithm for Large Scale Compressive Feature Learning
%A Hristo Paskov
%A John Mitchell
%A Trevor Hastie
%B Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics
%C Proceedings of Machine Learning Research
%D 2014
%E Samuel Kaski
%E Jukka Corander	
%F pmlr-v33-paskov14
%I PMLR
%P 760--768
%U https://proceedings.mlr.press/v33/paskov14.html
%V 33
%X This paper focuses on large-scale unsupervised feature selection from text. We expand upon the recently proposed Compressive Feature Learning (CFL) framework, a method that uses dictionary-based compression to select a K-gram representation for a document corpus. We show that CFL is NP-Complete and provide a novel and efficient approximation algorithm based on a homotopy that transforms a convex relaxation of CFL into the original problem. Our algorithm allows CFL to scale to corpuses comprised of millions of documents because each step is linear in the corpus length and highly parallelizable. We use it to extract features from the BeerAdvocate dataset, a corpus of over 1.5 million beer reviews spanning 10 years. CFL uses two orders of magnitude fewer features than the full trigram space. It beats a standard unigram model in a number of prediction tasks and achieves nearly twice the accuracy on an author identification task.

RIS


TY  - CPAPER
TI  - An Efficient Algorithm for Large Scale Compressive Feature Learning
AU  - Hristo Paskov
AU  - John Mitchell
AU  - Trevor Hastie
BT  - Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics
DA  - 2014/04/02
ED  - Samuel Kaski
ED  - Jukka Corander	
ID  - pmlr-v33-paskov14
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 33
SP  - 760
EP  - 768
L1  - http://proceedings.mlr.press/v33/paskov14.pdf
UR  - https://proceedings.mlr.press/v33/paskov14.html
AB  - This paper focuses on large-scale unsupervised feature selection from text. We expand upon the recently proposed Compressive Feature Learning (CFL) framework, a method that uses dictionary-based compression to select a K-gram representation for a document corpus. We show that CFL is NP-Complete and provide a novel and efficient approximation algorithm based on a homotopy that transforms a convex relaxation of CFL into the original problem. Our algorithm allows CFL to scale to corpuses comprised of millions of documents because each step is linear in the corpus length and highly parallelizable. We use it to extract features from the BeerAdvocate dataset, a corpus of over 1.5 million beer reviews spanning 10 years. CFL uses two orders of magnitude fewer features than the full trigram space. It beats a standard unigram model in a number of prediction tasks and achieves nearly twice the accuracy on an author identification task.
ER  -

APA


Paskov, H., Mitchell, J. & Hastie, T.. (2014). An Efficient Algorithm for Large Scale Compressive Feature Learning. Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 33:760-768 Available from https://proceedings.mlr.press/v33/paskov14.html.

An Efficient Algorithm for Large Scale Compressive Feature Learning

Abstract

Cite this Paper

Related Material