Coresets for Vector Summarization with Applications to Network Graphs

Dan Feldman; Sedat Ozer; Daniela Rus

Coresets for Vector Summarization with Applications to Network Graphs

Dan Feldman, Sedat Ozer, Daniela Rus

Proceedings of the 34th International Conference on Machine Learning, PMLR 70:1117-1125, 2017.

Abstract

We provide a deterministic data summarization algorithm that approximates the mean $\bar{p}=\frac{1}{n}\sum_{p\in P} p$ of a set $P$ of $n$ vectors in $\mathbb{R}^d$, by a weighted mean $\tilde{p}$ of a subset of $O(1/\epsilon)$ vectors, i.e., independent of both $n$ and $d$. We prove that the squared Euclidean distance between $\bar{p}$ and $\tilde{p}$ is at most $\epsilon$ multiplied by the variance of $P$. We use this algorithm to maintain an approximated sum of vectors from an unbounded stream, using memory that is independent of $d$, and logarithmic in the $n$ vectors seen so far. Our main application is to extract and represent in a compact way friend groups and activity summaries of users from underlying data exchanges. For example, in the case of mobile networks, we can use GPS traces to identify meetings; in the case of social networks, we can use information exchange to identify friend groups. Our algorithm provably identifies the Heavy Hitter entries in a proximity (adjacency) matrix. The Heavy Hitters can be used to extract and represent in a compact way friend groups and activity summaries of users from underlying data exchanges. We evaluate the algorithm on several large data sets.

Cite this Paper

BibTeX


@InProceedings{pmlr-v70-feldman17a,
  title = 	 {Coresets for Vector Summarization with Applications to Network Graphs},
  author =       {Dan Feldman and Sedat Ozer and Daniela Rus},
  booktitle = 	 {Proceedings of the 34th International Conference on Machine Learning},
  pages = 	 {1117--1125},
  year = 	 {2017},
  editor = 	 {Precup, Doina and Teh, Yee Whye},
  volume = 	 {70},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {06--11 Aug},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v70/feldman17a/feldman17a.pdf},
  url = 	 {https://proceedings.mlr.press/v70/feldman17a.html},
  abstract = 	 {We provide a deterministic data summarization algorithm that approximates the mean $\bar{p}=\frac{1}{n}\sum_{p\in P} p$ of a set $P$ of $n$ vectors in $\mathbb{R}^d$, by a weighted mean $\tilde{p}$ of a subset of $O(1/\epsilon)$ vectors, i.e., independent of both $n$ and $d$. We prove that the squared Euclidean distance between $\bar{p}$ and $\tilde{p}$ is at most $\epsilon$ multiplied by the variance of $P$. We use this algorithm to maintain an approximated sum of vectors from an unbounded stream, using memory that is independent of $d$, and logarithmic in the $n$ vectors seen so far. Our main application is to extract and represent in a compact way friend groups and activity summaries of users from underlying data exchanges. For example, in the case of mobile networks, we can use GPS traces to identify meetings; in the case of social networks, we can use information exchange to identify friend groups. Our algorithm provably identifies the Heavy Hitter entries in a proximity (adjacency) matrix. The Heavy Hitters can be used to extract and represent in a compact way friend groups and activity summaries of users from underlying data exchanges. We evaluate the algorithm on several large data sets.}
}

Endnote

%0 Conference Paper
%T Coresets for Vector Summarization with Applications to Network Graphs
%A Dan Feldman
%A Sedat Ozer
%A Daniela Rus
%B Proceedings of the 34th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2017
%E Doina Precup
%E Yee Whye Teh	
%F pmlr-v70-feldman17a
%I PMLR
%P 1117--1125
%U https://proceedings.mlr.press/v70/feldman17a.html
%V 70
%X We provide a deterministic data summarization algorithm that approximates the mean $\bar{p}=\frac{1}{n}\sum_{p\in P} p$ of a set $P$ of $n$ vectors in $\mathbb{R}^d$, by a weighted mean $\tilde{p}$ of a subset of $O(1/\epsilon)$ vectors, i.e., independent of both $n$ and $d$. We prove that the squared Euclidean distance between $\bar{p}$ and $\tilde{p}$ is at most $\epsilon$ multiplied by the variance of $P$. We use this algorithm to maintain an approximated sum of vectors from an unbounded stream, using memory that is independent of $d$, and logarithmic in the $n$ vectors seen so far. Our main application is to extract and represent in a compact way friend groups and activity summaries of users from underlying data exchanges. For example, in the case of mobile networks, we can use GPS traces to identify meetings; in the case of social networks, we can use information exchange to identify friend groups. Our algorithm provably identifies the Heavy Hitter entries in a proximity (adjacency) matrix. The Heavy Hitters can be used to extract and represent in a compact way friend groups and activity summaries of users from underlying data exchanges. We evaluate the algorithm on several large data sets.

APA


Feldman, D., Ozer, S. & Rus, D.. (2017). Coresets for Vector Summarization with Applications to Network Graphs. Proceedings of the 34th International Conference on Machine Learning, in Proceedings of Machine Learning Research 70:1117-1125 Available from https://proceedings.mlr.press/v70/feldman17a.html.

Related Material

Download PDF