Active Cost-aware Labeling of Streaming Data

Ting Cai; Kirthevasan Kandasamy

Active Cost-aware Labeling of Streaming Data

Ting Cai, Kirthevasan Kandasamy

Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, PMLR 206:9117-9136, 2023.

Abstract

We study actively labeling streaming data, where an active learner is faced with a stream of data points and must carefully choose which of these points to label via an expensive experiment. Such problems frequently arise in applications such as healthcare and astronomy. We first study a setting when the data’s inputs belong to one of K discrete distributions and formalize this problem via a loss that captures the labeling cost and the prediction error. When the labeling cost is B, our algorithm, which chooses to label a point if the uncertainty is larger than a time and cost dependent threshold, achieves a worst-case upper bound of $\tilde O(B^{\frac{1}{3}} K^{\frac{1}{3}} T^{\frac{2}{3}})$ on the loss after T rounds. We also provide a more nuanced upper bound which demonstrates that the algorithm can adapt to the arrival pattern, and achieves better performance when the arrival pattern is more favorable. We complement both upper bounds with matching lower bounds. We next study this problem when the inputs belong to a continuous domain and the output of the experiment is a smooth function with bounded RKHS norm. After T rounds in d dimensions, we show that the loss is bounded by $\tilde O(B^{\frac{1}{d+3}} T^{\frac{d+2}{d+3}})$ in an RKHS with a squared exponential kernel and by $\tilde O(B^{\frac{1}{2d+3}} T^{\frac{2d+2}{2d+3}})$ in an RKHS with a Matérn kernel. Our empirical evaluation demonstrates that our method outperforms other baselines in several synthetic experiments and two real experiments in medicine and astronomy.

Cite this Paper

BibTeX

@InProceedings{pmlr-v206-cai23a,
  title = 	 {Active Cost-aware Labeling of Streaming Data},
  author =       {Cai, Ting and Kandasamy, Kirthevasan},
  booktitle = 	 {Proceedings of The 26th International Conference on Artificial Intelligence and Statistics},
  pages = 	 {9117--9136},
  year = 	 {2023},
  editor = 	 {Ruiz, Francisco and Dy, Jennifer and van de Meent, Jan-Willem},
  volume = 	 {206},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {25--27 Apr},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v206/cai23a/cai23a.pdf},
  url = 	 {https://proceedings.mlr.press/v206/cai23a.html},
  abstract = 	 {We study actively labeling streaming data, where an active learner is faced with a stream of data points and must carefully choose which of these points to label via an expensive experiment. Such problems frequently arise in applications such as healthcare and astronomy. We first study a setting when the data’s inputs belong to one of K discrete distributions and formalize this problem via a loss that captures the labeling cost and the prediction error. When the labeling cost is B, our algorithm, which chooses to label a point if the uncertainty is larger than a time and cost dependent threshold, achieves a worst-case upper bound of $\tilde O(B^{\frac{1}{3}} K^{\frac{1}{3}} T^{\frac{2}{3}})$ on the loss after T rounds. We also provide a more nuanced upper bound which demonstrates that the algorithm can adapt to the arrival pattern, and achieves better performance when the arrival pattern is more favorable. We complement both upper bounds with matching lower bounds. We next study this problem when the inputs belong to a continuous domain and the output of the experiment is a smooth function with bounded RKHS norm. After T rounds in d dimensions, we show that the loss is bounded by $\tilde O(B^{\frac{1}{d+3}} T^{\frac{d+2}{d+3}})$ in an RKHS with a squared exponential kernel and by $\tilde O(B^{\frac{1}{2d+3}} T^{\frac{2d+2}{2d+3}})$ in an RKHS with a Matérn kernel. Our empirical evaluation demonstrates that our method outperforms other baselines in several synthetic experiments and two real experiments in medicine and astronomy.}
}

Endnote

%0 Conference Paper
%T Active Cost-aware Labeling of Streaming Data
%A Ting Cai
%A Kirthevasan Kandasamy
%B Proceedings of The 26th International Conference on Artificial Intelligence and Statistics
%C Proceedings of Machine Learning Research
%D 2023
%E Francisco Ruiz
%E Jennifer Dy
%E Jan-Willem van de Meent	
%F pmlr-v206-cai23a
%I PMLR
%P 9117--9136
%U https://proceedings.mlr.press/v206/cai23a.html
%V 206
%X We study actively labeling streaming data, where an active learner is faced with a stream of data points and must carefully choose which of these points to label via an expensive experiment. Such problems frequently arise in applications such as healthcare and astronomy. We first study a setting when the data’s inputs belong to one of K discrete distributions and formalize this problem via a loss that captures the labeling cost and the prediction error. When the labeling cost is B, our algorithm, which chooses to label a point if the uncertainty is larger than a time and cost dependent threshold, achieves a worst-case upper bound of $\tilde O(B^{\frac{1}{3}} K^{\frac{1}{3}} T^{\frac{2}{3}})$ on the loss after T rounds. We also provide a more nuanced upper bound which demonstrates that the algorithm can adapt to the arrival pattern, and achieves better performance when the arrival pattern is more favorable. We complement both upper bounds with matching lower bounds. We next study this problem when the inputs belong to a continuous domain and the output of the experiment is a smooth function with bounded RKHS norm. After T rounds in d dimensions, we show that the loss is bounded by $\tilde O(B^{\frac{1}{d+3}} T^{\frac{d+2}{d+3}})$ in an RKHS with a squared exponential kernel and by $\tilde O(B^{\frac{1}{2d+3}} T^{\frac{2d+2}{2d+3}})$ in an RKHS with a Matérn kernel. Our empirical evaluation demonstrates that our method outperforms other baselines in several synthetic experiments and two real experiments in medicine and astronomy.

APA

Cai, T. & Kandasamy, K.. (2023). Active Cost-aware Labeling of Streaming Data. Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 206:9117-9136 Available from https://proceedings.mlr.press/v206/cai23a.html.

Active Cost-aware Labeling of Streaming Data

Abstract

Cite this Paper

Related Material