Maximum Coverage in Turnstile Streams with Applications to Fingerprinting Measures

Alina Ene, Alessandro Epasto, Vahab Mirrokni, Hoai-An Nguyen, Huy Nguyen, David Woodruff, Peilin Zhong
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:15346-15369, 2025.

Abstract

In the maximum coverage problem we are given $d$ subsets from a universe $[n]$, and the goal is to output $k$ subsets such that their union covers the largest possible number of distinct items. We present the first algorithm for maximum coverage in the turnstile streaming model, where updates which insert or delete an item from a subset come one-by-one. Notably our algorithm only uses $poly\log n$ update time. We also present turnstile streaming algorithms for targeted and general fingerprinting for risk management where the goal is to determine which features pose the greatest re-identification risk in a dataset. As part of our work, we give a result of independent interest: an algorithm to estimate the complement of the $p^{\text{th}}$ frequency moment of a vector for $p \geq 2$. Empirical evaluation confirms the practicality of our fingerprinting algorithms demonstrating a speedup of up to $210$x over prior work.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-ene25a, title = {Maximum Coverage in Turnstile Streams with Applications to Fingerprinting Measures}, author = {Ene, Alina and Epasto, Alessandro and Mirrokni, Vahab and Nguyen, Hoai-An and Nguyen, Huy and Woodruff, David and Zhong, Peilin}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {15346--15369}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/ene25a/ene25a.pdf}, url = {https://proceedings.mlr.press/v267/ene25a.html}, abstract = {In the maximum coverage problem we are given $d$ subsets from a universe $[n]$, and the goal is to output $k$ subsets such that their union covers the largest possible number of distinct items. We present the first algorithm for maximum coverage in the turnstile streaming model, where updates which insert or delete an item from a subset come one-by-one. Notably our algorithm only uses $poly\log n$ update time. We also present turnstile streaming algorithms for targeted and general fingerprinting for risk management where the goal is to determine which features pose the greatest re-identification risk in a dataset. As part of our work, we give a result of independent interest: an algorithm to estimate the complement of the $p^{\text{th}}$ frequency moment of a vector for $p \geq 2$. Empirical evaluation confirms the practicality of our fingerprinting algorithms demonstrating a speedup of up to $210$x over prior work.} }
Endnote
%0 Conference Paper %T Maximum Coverage in Turnstile Streams with Applications to Fingerprinting Measures %A Alina Ene %A Alessandro Epasto %A Vahab Mirrokni %A Hoai-An Nguyen %A Huy Nguyen %A David Woodruff %A Peilin Zhong %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-ene25a %I PMLR %P 15346--15369 %U https://proceedings.mlr.press/v267/ene25a.html %V 267 %X In the maximum coverage problem we are given $d$ subsets from a universe $[n]$, and the goal is to output $k$ subsets such that their union covers the largest possible number of distinct items. We present the first algorithm for maximum coverage in the turnstile streaming model, where updates which insert or delete an item from a subset come one-by-one. Notably our algorithm only uses $poly\log n$ update time. We also present turnstile streaming algorithms for targeted and general fingerprinting for risk management where the goal is to determine which features pose the greatest re-identification risk in a dataset. As part of our work, we give a result of independent interest: an algorithm to estimate the complement of the $p^{\text{th}}$ frequency moment of a vector for $p \geq 2$. Empirical evaluation confirms the practicality of our fingerprinting algorithms demonstrating a speedup of up to $210$x over prior work.
APA
Ene, A., Epasto, A., Mirrokni, V., Nguyen, H., Nguyen, H., Woodruff, D. & Zhong, P.. (2025). Maximum Coverage in Turnstile Streams with Applications to Fingerprinting Measures. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:15346-15369 Available from https://proceedings.mlr.press/v267/ene25a.html.

Related Material