Coresets for Data Discretization and Sine Wave Fitting

Alaa Maalouf, Murad Tukan, Eric Price, Daniel M. Kane, Dan Feldman
Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, PMLR 151:10622-10639, 2022.


In the monitoring problem, the input is an unbounded stream P=p1,p2 of integers in [N]:={1,,N}, that are obtained from a sensor (such as GPS or heart beats of a human). The goal (e.g., for anomaly detection) is to approximate the n points received so far in P by a single frequency sin, e.g. min, where cost(P,c)=\sum_{i=1}^n \sin^2(\frac{2\pi}{N} p_ic), C\subseteq [N] is a feasible set of solutions, and \lambda is a given regularization function. For any approximation error \varepsilon>0, we prove that every set P of n integers has a weighted subset S\subseteq P (sometimes called core-set) of cardinality |S|\in O(\log(N)^{O(1)}) that approximates cost(P,c) (for every c\in [N]) up to a multiplicative factor of 1\pm\varepsilon. Using known coreset techniques, this implies streaming algorithms using only O((\log(N)\log(n))^{O(1)}) memory. Our results hold for a large family of functions. Experimental results and open source code are provided.

Cite this Paper

@InProceedings{pmlr-v151-maalouf22a, title = { Coresets for Data Discretization and Sine Wave Fitting }, author = {Maalouf, Alaa and Tukan, Murad and Price, Eric and Kane, Daniel M. and Feldman, Dan}, booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics}, pages = {10622--10639}, year = {2022}, editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel}, volume = {151}, series = {Proceedings of Machine Learning Research}, month = {28--30 Mar}, publisher = {PMLR}, pdf = {}, url = {}, abstract = { In the monitoring problem, the input is an unbounded stream $P={p_1,p_2\cdots}$ of integers in $[N]:=\{1,\cdots,N\}$, that are obtained from a sensor (such as GPS or heart beats of a human). The goal (e.g., for anomaly detection) is to approximate the $n$ points received so far in $P$ by a single frequency $\sin$, e.g. $\min_{c\in C}cost(P,c)+\lambda(c)$, where $cost(P,c)=\sum_{i=1}^n \sin^2(\frac{2\pi}{N} p_ic)$, $C\subseteq [N]$ is a feasible set of solutions, and $\lambda$ is a given regularization function. For any approximation error $\varepsilon>0$, we prove that every set $P$ of $n$ integers has a weighted subset $S\subseteq P$ (sometimes called core-set) of cardinality $|S|\in O(\log(N)^{O(1)})$ that approximates $cost(P,c)$ (for every $c\in [N]$) up to a multiplicative factor of $1\pm\varepsilon$. Using known coreset techniques, this implies streaming algorithms using only $O((\log(N)\log(n))^{O(1)})$ memory. Our results hold for a large family of functions. Experimental results and open source code are provided. } }
%0 Conference Paper %T Coresets for Data Discretization and Sine Wave Fitting %A Alaa Maalouf %A Murad Tukan %A Eric Price %A Daniel M. Kane %A Dan Feldman %B Proceedings of The 25th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2022 %E Gustau Camps-Valls %E Francisco J. R. Ruiz %E Isabel Valera %F pmlr-v151-maalouf22a %I PMLR %P 10622--10639 %U %V 151 %X In the monitoring problem, the input is an unbounded stream $P={p_1,p_2\cdots}$ of integers in $[N]:=\{1,\cdots,N\}$, that are obtained from a sensor (such as GPS or heart beats of a human). The goal (e.g., for anomaly detection) is to approximate the $n$ points received so far in $P$ by a single frequency $\sin$, e.g. $\min_{c\in C}cost(P,c)+\lambda(c)$, where $cost(P,c)=\sum_{i=1}^n \sin^2(\frac{2\pi}{N} p_ic)$, $C\subseteq [N]$ is a feasible set of solutions, and $\lambda$ is a given regularization function. For any approximation error $\varepsilon>0$, we prove that every set $P$ of $n$ integers has a weighted subset $S\subseteq P$ (sometimes called core-set) of cardinality $|S|\in O(\log(N)^{O(1)})$ that approximates $cost(P,c)$ (for every $c\in [N]$) up to a multiplicative factor of $1\pm\varepsilon$. Using known coreset techniques, this implies streaming algorithms using only $O((\log(N)\log(n))^{O(1)})$ memory. Our results hold for a large family of functions. Experimental results and open source code are provided.
Maalouf, A., Tukan, M., Price, E., Kane, D.M. & Feldman, D.. (2022). Coresets for Data Discretization and Sine Wave Fitting . Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 151:10622-10639 Available from

Related Material