Margin-based sampling in high dimensions: When being active is less efficient than staying passive

Alexandru Tifrea, Jacob Clarysse, Fanny Yang
Proceedings of the 40th International Conference on Machine Learning, PMLR 202:34222-34262, 2023.

Abstract

It is widely believed that given the same labeling budget, active learning (AL) algorithms like margin-based active learning achieve better predictive performance than passive learning (PL), albeit at a higher computational cost. Recent empirical evidence suggests that this added cost might be in vain, as margin-based AL can sometimes perform even worse than PL. While existing works offer different explanations in the low-dimensional regime, this paper shows that the underlying mechanism is entirely different in high dimensions: we prove for logistic regression that PL outperforms margin-based AL even for noiseless data and when using the Bayes optimal decision boundary for sampling. Insights from our proof indicate that this high-dimensional phenomenon is exacerbated when the separation between the classes is small. We corroborate this intuition with experiments on 20 high-dimensional datasets spanning a diverse range of applications, from finance and histology to chemistry and computer vision.

Cite this Paper


BibTeX
@InProceedings{pmlr-v202-tifrea23a, title = {Margin-based sampling in high dimensions: When being active is less efficient than staying passive}, author = {Tifrea, Alexandru and Clarysse, Jacob and Yang, Fanny}, booktitle = {Proceedings of the 40th International Conference on Machine Learning}, pages = {34222--34262}, year = {2023}, editor = {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan}, volume = {202}, series = {Proceedings of Machine Learning Research}, month = {23--29 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v202/tifrea23a/tifrea23a.pdf}, url = {https://proceedings.mlr.press/v202/tifrea23a.html}, abstract = {It is widely believed that given the same labeling budget, active learning (AL) algorithms like margin-based active learning achieve better predictive performance than passive learning (PL), albeit at a higher computational cost. Recent empirical evidence suggests that this added cost might be in vain, as margin-based AL can sometimes perform even worse than PL. While existing works offer different explanations in the low-dimensional regime, this paper shows that the underlying mechanism is entirely different in high dimensions: we prove for logistic regression that PL outperforms margin-based AL even for noiseless data and when using the Bayes optimal decision boundary for sampling. Insights from our proof indicate that this high-dimensional phenomenon is exacerbated when the separation between the classes is small. We corroborate this intuition with experiments on 20 high-dimensional datasets spanning a diverse range of applications, from finance and histology to chemistry and computer vision.} }
Endnote
%0 Conference Paper %T Margin-based sampling in high dimensions: When being active is less efficient than staying passive %A Alexandru Tifrea %A Jacob Clarysse %A Fanny Yang %B Proceedings of the 40th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2023 %E Andreas Krause %E Emma Brunskill %E Kyunghyun Cho %E Barbara Engelhardt %E Sivan Sabato %E Jonathan Scarlett %F pmlr-v202-tifrea23a %I PMLR %P 34222--34262 %U https://proceedings.mlr.press/v202/tifrea23a.html %V 202 %X It is widely believed that given the same labeling budget, active learning (AL) algorithms like margin-based active learning achieve better predictive performance than passive learning (PL), albeit at a higher computational cost. Recent empirical evidence suggests that this added cost might be in vain, as margin-based AL can sometimes perform even worse than PL. While existing works offer different explanations in the low-dimensional regime, this paper shows that the underlying mechanism is entirely different in high dimensions: we prove for logistic regression that PL outperforms margin-based AL even for noiseless data and when using the Bayes optimal decision boundary for sampling. Insights from our proof indicate that this high-dimensional phenomenon is exacerbated when the separation between the classes is small. We corroborate this intuition with experiments on 20 high-dimensional datasets spanning a diverse range of applications, from finance and histology to chemistry and computer vision.
APA
Tifrea, A., Clarysse, J. & Yang, F.. (2023). Margin-based sampling in high dimensions: When being active is less efficient than staying passive. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:34222-34262 Available from https://proceedings.mlr.press/v202/tifrea23a.html.

Related Material