Beyond 1/2Approximation for Submodular Maximization on Massive Data Streams
[edit]
Proceedings of the 35th International Conference on Machine Learning, PMLR 80:38293838, 2018.
Abstract
Many tasks in machine learning and data mining, such as data diversification, nonparametric learning, kernel machines, clustering etc., require extracting a small but representative summary from a massive dataset. Often, such problems can be posed as maximizing a submodular set function subject to a cardinality constraint. We consider this question in the streaming setting, where elements arrive over time at a fast pace and thus we need to design an efficient, lowmemory algorithm. One such method, proposed by Badanidiyuru et al. (2014), always finds a 0.5approximate solution. Can this approximation factor be improved? We answer this question affirmatively by designing a new algorithm Salsa for streaming submodular maximization. It is the first lowmemory, singlepass algorithm that improves the factor 0.5, under the natural assumption that elements arrive in a random order. We also show that this assumption is necessary, i.e., that there is no such algorithm with better than 0.5approximation when elements arrive in arbitrary order. Our experiments demonstrate that Salsa significantly outperforms the state of the art in applications related to exemplarbased clustering, social graph analysis, and recommender systems.
Related Material


