An Efficient Algorithm for Large Scale Compressive Feature Learning

Hristo Paskov, John Mitchell, Trevor Hastie
Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, PMLR 33:760-768, 2014.

Abstract

This paper focuses on large-scale unsupervised feature selection from text. We expand upon the recently proposed Compressive Feature Learning (CFL) framework, a method that uses dictionary-based compression to select a K-gram representation for a document corpus. We show that CFL is NP-Complete and provide a novel and efficient approximation algorithm based on a homotopy that transforms a convex relaxation of CFL into the original problem. Our algorithm allows CFL to scale to corpuses comprised of millions of documents because each step is linear in the corpus length and highly parallelizable. We use it to extract features from the BeerAdvocate dataset, a corpus of over 1.5 million beer reviews spanning 10 years. CFL uses two orders of magnitude fewer features than the full trigram space. It beats a standard unigram model in a number of prediction tasks and achieves nearly twice the accuracy on an author identification task.

Cite this Paper


BibTeX
@InProceedings{pmlr-v33-paskov14, title = {{An Efficient Algorithm for Large Scale Compressive Feature Learning}}, author = {Paskov, Hristo and Mitchell, John and Hastie, Trevor}, booktitle = {Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics}, pages = {760--768}, year = {2014}, editor = {Kaski, Samuel and Corander, Jukka}, volume = {33}, series = {Proceedings of Machine Learning Research}, address = {Reykjavik, Iceland}, month = {22--25 Apr}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v33/paskov14.pdf}, url = {https://proceedings.mlr.press/v33/paskov14.html}, abstract = {This paper focuses on large-scale unsupervised feature selection from text. We expand upon the recently proposed Compressive Feature Learning (CFL) framework, a method that uses dictionary-based compression to select a K-gram representation for a document corpus. We show that CFL is NP-Complete and provide a novel and efficient approximation algorithm based on a homotopy that transforms a convex relaxation of CFL into the original problem. Our algorithm allows CFL to scale to corpuses comprised of millions of documents because each step is linear in the corpus length and highly parallelizable. We use it to extract features from the BeerAdvocate dataset, a corpus of over 1.5 million beer reviews spanning 10 years. CFL uses two orders of magnitude fewer features than the full trigram space. It beats a standard unigram model in a number of prediction tasks and achieves nearly twice the accuracy on an author identification task.} }
Endnote
%0 Conference Paper %T An Efficient Algorithm for Large Scale Compressive Feature Learning %A Hristo Paskov %A John Mitchell %A Trevor Hastie %B Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2014 %E Samuel Kaski %E Jukka Corander %F pmlr-v33-paskov14 %I PMLR %P 760--768 %U https://proceedings.mlr.press/v33/paskov14.html %V 33 %X This paper focuses on large-scale unsupervised feature selection from text. We expand upon the recently proposed Compressive Feature Learning (CFL) framework, a method that uses dictionary-based compression to select a K-gram representation for a document corpus. We show that CFL is NP-Complete and provide a novel and efficient approximation algorithm based on a homotopy that transforms a convex relaxation of CFL into the original problem. Our algorithm allows CFL to scale to corpuses comprised of millions of documents because each step is linear in the corpus length and highly parallelizable. We use it to extract features from the BeerAdvocate dataset, a corpus of over 1.5 million beer reviews spanning 10 years. CFL uses two orders of magnitude fewer features than the full trigram space. It beats a standard unigram model in a number of prediction tasks and achieves nearly twice the accuracy on an author identification task.
RIS
TY - CPAPER TI - An Efficient Algorithm for Large Scale Compressive Feature Learning AU - Hristo Paskov AU - John Mitchell AU - Trevor Hastie BT - Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics DA - 2014/04/02 ED - Samuel Kaski ED - Jukka Corander ID - pmlr-v33-paskov14 PB - PMLR DP - Proceedings of Machine Learning Research VL - 33 SP - 760 EP - 768 L1 - http://proceedings.mlr.press/v33/paskov14.pdf UR - https://proceedings.mlr.press/v33/paskov14.html AB - This paper focuses on large-scale unsupervised feature selection from text. We expand upon the recently proposed Compressive Feature Learning (CFL) framework, a method that uses dictionary-based compression to select a K-gram representation for a document corpus. We show that CFL is NP-Complete and provide a novel and efficient approximation algorithm based on a homotopy that transforms a convex relaxation of CFL into the original problem. Our algorithm allows CFL to scale to corpuses comprised of millions of documents because each step is linear in the corpus length and highly parallelizable. We use it to extract features from the BeerAdvocate dataset, a corpus of over 1.5 million beer reviews spanning 10 years. CFL uses two orders of magnitude fewer features than the full trigram space. It beats a standard unigram model in a number of prediction tasks and achieves nearly twice the accuracy on an author identification task. ER -
APA
Paskov, H., Mitchell, J. & Hastie, T.. (2014). An Efficient Algorithm for Large Scale Compressive Feature Learning. Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 33:760-768 Available from https://proceedings.mlr.press/v33/paskov14.html.

Related Material