Actively Learning Halfspaces without Synthetic Data

Hadley Black, Kasper Green Larsen, Arya Mazumdar, Barna Saha, Geelon So
Proceedings of Thirty Ninth Conference on Learning Theory, PMLR 336:708-728, 2026.

Abstract

In the classic point location problem, one is given an arbitrary dataset X in d-dimensional Euclidean space R^d of n points with query access to an unknown halfspace f: R^d -> {0,1}, and the goal is to learn the label of every point in X. This problem is extremely well-studied and a nearly-optimal \tilde{O}(d log n) query algorithm is known due to Hopkins-Kane-Lovett-Mahajan (FOCS 2020). However, their algorithm is granted the power to query arbitrary points outside of X (point synthesis), and in fact without this power there is an \Omega(n) query lower bound due to Dasgupta (NeurIPS 2004). Nonetheless, query access to arbitrary synthesized data points is unrealistic in many contexts. Our objective in this work is to design efficient algorithms for learning halfspaces without point synthesis. To circumvent the \Omega(n) lower bound, we consider learning halfspaces whose normal vectors come from a known set of size D, and show tight bounds \Theta(D + log n). As a corollary, we obtain an optimal O(d + log n) query deterministic learner for the fundamental class of decision stumps (depth-one decision trees, or axis-aligned halfspaces), closing a previous gap of O(d log n) vs. \Omega(d + log n) left open in the active learning literature. In fact, our algorithm solves the more general problem of learning a Boolean function f over n elements which is monotone under at least one of D provided orderings of these elements. Our technical insight is to exploit the structure in these orderings to essentially perform a binary search in parallel rather than considering each ordering sequentially, and we believe our approach may be of broader interest. Furthermore, we use our exact learning algorithm to obtain nearly optimal algorithms for PAC-learning. We show that O(min(D + log(1/\epsilon), 1/\epsilon) * log D) queries suffice to learn f within error \epsilon, even in a setting when f can be adversarially corrupted on a c\epsilon-fraction of points, for a sufficiently small constant c. This bound is optimal up to a log D factor, including in the realizable setting.

Cite this Paper


BibTeX
@InProceedings{pmlr-v336-black26a, title = {Actively Learning Halfspaces without Synthetic Data}, author = {Black, Hadley and Larsen, Kasper Green and Mazumdar, Arya and Saha, Barna and So, Geelon}, booktitle = {Proceedings of Thirty Ninth Conference on Learning Theory}, pages = {708--728}, year = {2026}, editor = {Hanneke, Steve and Lattimore, Tor}, volume = {336}, series = {Proceedings of Machine Learning Research}, month = {29 Jun--03 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v336/main/assets/black26a/black26a.pdf}, url = {https://proceedings.mlr.press/v336/black26a.html}, abstract = {In the classic point location problem, one is given an arbitrary dataset X in d-dimensional Euclidean space R^d of n points with query access to an unknown halfspace f: R^d -> {0,1}, and the goal is to learn the label of every point in X. This problem is extremely well-studied and a nearly-optimal \tilde{O}(d log n) query algorithm is known due to Hopkins-Kane-Lovett-Mahajan (FOCS 2020). However, their algorithm is granted the power to query arbitrary points outside of X (point synthesis), and in fact without this power there is an \Omega(n) query lower bound due to Dasgupta (NeurIPS 2004). Nonetheless, query access to arbitrary synthesized data points is unrealistic in many contexts. Our objective in this work is to design efficient algorithms for learning halfspaces without point synthesis. To circumvent the \Omega(n) lower bound, we consider learning halfspaces whose normal vectors come from a known set of size D, and show tight bounds \Theta(D + log n). As a corollary, we obtain an optimal O(d + log n) query deterministic learner for the fundamental class of decision stumps (depth-one decision trees, or axis-aligned halfspaces), closing a previous gap of O(d log n) vs. \Omega(d + log n) left open in the active learning literature. In fact, our algorithm solves the more general problem of learning a Boolean function f over n elements which is monotone under at least one of D provided orderings of these elements. Our technical insight is to exploit the structure in these orderings to essentially perform a binary search in parallel rather than considering each ordering sequentially, and we believe our approach may be of broader interest. Furthermore, we use our exact learning algorithm to obtain nearly optimal algorithms for PAC-learning. We show that O(min(D + log(1/\epsilon), 1/\epsilon) * log D) queries suffice to learn f within error \epsilon, even in a setting when f can be adversarially corrupted on a c\epsilon-fraction of points, for a sufficiently small constant c. This bound is optimal up to a log D factor, including in the realizable setting.} }
Endnote
%0 Conference Paper %T Actively Learning Halfspaces without Synthetic Data %A Hadley Black %A Kasper Green Larsen %A Arya Mazumdar %A Barna Saha %A Geelon So %B Proceedings of Thirty Ninth Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2026 %E Steve Hanneke %E Tor Lattimore %F pmlr-v336-black26a %I PMLR %P 708--728 %U https://proceedings.mlr.press/v336/black26a.html %V 336 %X In the classic point location problem, one is given an arbitrary dataset X in d-dimensional Euclidean space R^d of n points with query access to an unknown halfspace f: R^d -> {0,1}, and the goal is to learn the label of every point in X. This problem is extremely well-studied and a nearly-optimal \tilde{O}(d log n) query algorithm is known due to Hopkins-Kane-Lovett-Mahajan (FOCS 2020). However, their algorithm is granted the power to query arbitrary points outside of X (point synthesis), and in fact without this power there is an \Omega(n) query lower bound due to Dasgupta (NeurIPS 2004). Nonetheless, query access to arbitrary synthesized data points is unrealistic in many contexts. Our objective in this work is to design efficient algorithms for learning halfspaces without point synthesis. To circumvent the \Omega(n) lower bound, we consider learning halfspaces whose normal vectors come from a known set of size D, and show tight bounds \Theta(D + log n). As a corollary, we obtain an optimal O(d + log n) query deterministic learner for the fundamental class of decision stumps (depth-one decision trees, or axis-aligned halfspaces), closing a previous gap of O(d log n) vs. \Omega(d + log n) left open in the active learning literature. In fact, our algorithm solves the more general problem of learning a Boolean function f over n elements which is monotone under at least one of D provided orderings of these elements. Our technical insight is to exploit the structure in these orderings to essentially perform a binary search in parallel rather than considering each ordering sequentially, and we believe our approach may be of broader interest. Furthermore, we use our exact learning algorithm to obtain nearly optimal algorithms for PAC-learning. We show that O(min(D + log(1/\epsilon), 1/\epsilon) * log D) queries suffice to learn f within error \epsilon, even in a setting when f can be adversarially corrupted on a c\epsilon-fraction of points, for a sufficiently small constant c. This bound is optimal up to a log D factor, including in the realizable setting.
APA
Black, H., Larsen, K.G., Mazumdar, A., Saha, B. & So, G.. (2026). Actively Learning Halfspaces without Synthetic Data. Proceedings of Thirty Ninth Conference on Learning Theory, in Proceedings of Machine Learning Research 336:708-728 Available from https://proceedings.mlr.press/v336/black26a.html.

Related Material