Scaling Graph-based Semi Supervised Learning to Large Number of Labels Using Count-Min Sketch

Partha Talukdar; William Cohen

Scaling Graph-based Semi Supervised Learning to Large Number of Labels Using Count-Min Sketch

Partha Talukdar, William Cohen

Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, PMLR 33:940-947, 2014.

Abstract

Graph-based Semi-supervised learning (SSL) algorithms have been successfully used in a large number of applications. These methods classify initially unlabeled nodes by propagating label information over the structure of graph starting from seed nodes. Graph-based SSL algorithms usually scale linearly with the number of distinct labels (m), and require O(m) space on each node. Unfortunately, there exist many applications of practical significance with very large m over large graphs, demanding better space and time complexity. In this paper, we propose MAD-Sketch, a novel graph-based SSL algorithm which compactly stores label distribution on each node using Count-min Sketch, a randomized data structure. We present theoretical analysis showing that under mild conditions, MAD-Sketch can reduce space complexity at each node from O(m) to O(\log(m)), and achieve similar savings in time complexity as well. We support our analysis through experiments on multiple real world datasets. We observe that MAD-Sketch achieves similar performance as existing state-of-the-art graph-based SSL algorithms, while requiring smaller memory footprint and at the same time achieving up to 10x speedup. We find that MAD-Sketch is able to scale to datasets with one million labels, which is beyond the scope of existing graph-based SSL algorithms.

Cite this Paper

BibTeX


@InProceedings{pmlr-v33-talukdar14,
  title = 	 {{Scaling Graph-based Semi Supervised Learning to Large Number of Labels Using Count-Min Sketch}},
  author = 	 {Talukdar, Partha and Cohen, William},
  booktitle = 	 {Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics},
  pages = 	 {940--947},
  year = 	 {2014},
  editor = 	 {Kaski, Samuel and Corander, Jukka},
  volume = 	 {33},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Reykjavik, Iceland},
  month = 	 {22--25 Apr},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v33/talukdar14.pdf},
  url = 	 {https://proceedings.mlr.press/v33/talukdar14.html},
  abstract = 	 {Graph-based Semi-supervised learning (SSL) algorithms have been successfully used in a large number of applications. These methods classify initially unlabeled nodes by propagating label information over the structure of graph starting from seed nodes. Graph-based SSL algorithms usually scale linearly with the number of distinct labels (m), and require O(m) space on each node. Unfortunately, there exist many applications of practical significance with very large m over large graphs, demanding better space and time complexity. In this paper, we propose MAD-Sketch, a novel graph-based SSL algorithm which compactly stores label distribution on each node using Count-min Sketch, a randomized data structure. We present theoretical analysis showing that under mild conditions, MAD-Sketch can reduce space complexity at each node from O(m) to O(\log(m)), and achieve similar savings in time complexity as well. We support our analysis through experiments on multiple real world datasets. We observe that MAD-Sketch achieves similar performance as existing state-of-the-art graph-based SSL algorithms, while requiring smaller memory footprint and at the same time achieving up to 10x speedup. We find that MAD-Sketch is able to scale to datasets with one million labels, which is beyond the scope of existing graph-based SSL algorithms.}
}

Endnote

%0 Conference Paper
%T Scaling Graph-based Semi Supervised Learning to Large Number of Labels Using Count-Min Sketch
%A Partha Talukdar
%A William Cohen
%B Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics
%C Proceedings of Machine Learning Research
%D 2014
%E Samuel Kaski
%E Jukka Corander	
%F pmlr-v33-talukdar14
%I PMLR
%P 940--947
%U https://proceedings.mlr.press/v33/talukdar14.html
%V 33
%X Graph-based Semi-supervised learning (SSL) algorithms have been successfully used in a large number of applications. These methods classify initially unlabeled nodes by propagating label information over the structure of graph starting from seed nodes. Graph-based SSL algorithms usually scale linearly with the number of distinct labels (m), and require O(m) space on each node. Unfortunately, there exist many applications of practical significance with very large m over large graphs, demanding better space and time complexity. In this paper, we propose MAD-Sketch, a novel graph-based SSL algorithm which compactly stores label distribution on each node using Count-min Sketch, a randomized data structure. We present theoretical analysis showing that under mild conditions, MAD-Sketch can reduce space complexity at each node from O(m) to O(\log(m)), and achieve similar savings in time complexity as well. We support our analysis through experiments on multiple real world datasets. We observe that MAD-Sketch achieves similar performance as existing state-of-the-art graph-based SSL algorithms, while requiring smaller memory footprint and at the same time achieving up to 10x speedup. We find that MAD-Sketch is able to scale to datasets with one million labels, which is beyond the scope of existing graph-based SSL algorithms.

RIS


TY  - CPAPER
TI  - Scaling Graph-based Semi Supervised Learning to Large Number of Labels Using Count-Min Sketch
AU  - Partha Talukdar
AU  - William Cohen
BT  - Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics
DA  - 2014/04/02
ED  - Samuel Kaski
ED  - Jukka Corander	
ID  - pmlr-v33-talukdar14
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 33
SP  - 940
EP  - 947
L1  - http://proceedings.mlr.press/v33/talukdar14.pdf
UR  - https://proceedings.mlr.press/v33/talukdar14.html
AB  - Graph-based Semi-supervised learning (SSL) algorithms have been successfully used in a large number of applications. These methods classify initially unlabeled nodes by propagating label information over the structure of graph starting from seed nodes. Graph-based SSL algorithms usually scale linearly with the number of distinct labels (m), and require O(m) space on each node. Unfortunately, there exist many applications of practical significance with very large m over large graphs, demanding better space and time complexity. In this paper, we propose MAD-Sketch, a novel graph-based SSL algorithm which compactly stores label distribution on each node using Count-min Sketch, a randomized data structure. We present theoretical analysis showing that under mild conditions, MAD-Sketch can reduce space complexity at each node from O(m) to O(\log(m)), and achieve similar savings in time complexity as well. We support our analysis through experiments on multiple real world datasets. We observe that MAD-Sketch achieves similar performance as existing state-of-the-art graph-based SSL algorithms, while requiring smaller memory footprint and at the same time achieving up to 10x speedup. We find that MAD-Sketch is able to scale to datasets with one million labels, which is beyond the scope of existing graph-based SSL algorithms.
ER  -

APA


Talukdar, P. & Cohen, W.. (2014). Scaling Graph-based Semi Supervised Learning to Large Number of Labels Using Count-Min Sketch. Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 33:940-947 Available from https://proceedings.mlr.press/v33/talukdar14.html.

Related Material

Download PDF