Margin-based sampling in high dimensions: When being active is less efficient than staying passive

Alexandru Tifrea; Jacob Clarysse; Fanny Yang

Margin-based sampling in high dimensions: When being active is less efficient than staying passive

Alexandru Tifrea, Jacob Clarysse, Fanny Yang

Proceedings of the 40th International Conference on Machine Learning, PMLR 202:34222-34262, 2023.

Abstract

It is widely believed that given the same labeling budget, active learning (AL) algorithms like margin-based active learning achieve better predictive performance than passive learning (PL), albeit at a higher computational cost. Recent empirical evidence suggests that this added cost might be in vain, as margin-based AL can sometimes perform even worse than PL. While existing works offer different explanations in the low-dimensional regime, this paper shows that the underlying mechanism is entirely different in high dimensions: we prove for logistic regression that PL outperforms margin-based AL even for noiseless data and when using the Bayes optimal decision boundary for sampling. Insights from our proof indicate that this high-dimensional phenomenon is exacerbated when the separation between the classes is small. We corroborate this intuition with experiments on 20 high-dimensional datasets spanning a diverse range of applications, from finance and histology to chemistry and computer vision.

Cite this Paper

BibTeX


@InProceedings{pmlr-v202-tifrea23a,
  title = 	 {Margin-based sampling in high dimensions: When being active is less efficient than staying passive},
  author =       {Tifrea, Alexandru and Clarysse, Jacob and Yang, Fanny},
  booktitle = 	 {Proceedings of the 40th International Conference on Machine Learning},
  pages = 	 {34222--34262},
  year = 	 {2023},
  editor = 	 {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan},
  volume = 	 {202},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {23--29 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v202/tifrea23a/tifrea23a.pdf},
  url = 	 {https://proceedings.mlr.press/v202/tifrea23a.html},
  abstract = 	 {It is widely believed that given the same labeling budget, active learning (AL) algorithms like margin-based active learning achieve better predictive performance than passive learning (PL), albeit at a higher computational cost. Recent empirical evidence suggests that this added cost might be in vain, as margin-based AL can sometimes perform even worse than PL. While existing works offer different explanations in the low-dimensional regime, this paper shows that the underlying mechanism is entirely different in high dimensions: we prove for logistic regression that PL outperforms margin-based AL even for noiseless data and when using the Bayes optimal decision boundary for sampling. Insights from our proof indicate that this high-dimensional phenomenon is exacerbated when the separation between the classes is small. We corroborate this intuition with experiments on 20 high-dimensional datasets spanning a diverse range of applications, from finance and histology to chemistry and computer vision.}
}

Endnote

%0 Conference Paper
%T Margin-based sampling in high dimensions: When being active is less efficient than staying passive
%A Alexandru Tifrea
%A Jacob Clarysse
%A Fanny Yang
%B Proceedings of the 40th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2023
%E Andreas Krause
%E Emma Brunskill
%E Kyunghyun Cho
%E Barbara Engelhardt
%E Sivan Sabato
%E Jonathan Scarlett	
%F pmlr-v202-tifrea23a
%I PMLR
%P 34222--34262
%U https://proceedings.mlr.press/v202/tifrea23a.html
%V 202
%X It is widely believed that given the same labeling budget, active learning (AL) algorithms like margin-based active learning achieve better predictive performance than passive learning (PL), albeit at a higher computational cost. Recent empirical evidence suggests that this added cost might be in vain, as margin-based AL can sometimes perform even worse than PL. While existing works offer different explanations in the low-dimensional regime, this paper shows that the underlying mechanism is entirely different in high dimensions: we prove for logistic regression that PL outperforms margin-based AL even for noiseless data and when using the Bayes optimal decision boundary for sampling. Insights from our proof indicate that this high-dimensional phenomenon is exacerbated when the separation between the classes is small. We corroborate this intuition with experiments on 20 high-dimensional datasets spanning a diverse range of applications, from finance and histology to chemistry and computer vision.

APA


Tifrea, A., Clarysse, J. & Yang, F.. (2023). Margin-based sampling in high dimensions: When being active is less efficient than staying passive. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:34222-34262 Available from https://proceedings.mlr.press/v202/tifrea23a.html.

Margin-based sampling in high dimensions: When being active is less efficient than staying passive

Abstract

Cite this Paper

Related Material