Linear Regression under Missing or Corrupted Coordinates

Ilias Diakonikolas, Jelena Diakonikolas, Daniel M. Kane, Jasper C. H. Lee, Thanasis Pittas
Proceedings of Thirty Ninth Conference on Learning Theory, PMLR 336:1857-1901, 2026.

Abstract

We study multivariate linear regression under Gaussian covariates in two settings where data may be erased or corrupted by an adversary subject to a coordinate-wise budget. In the incomplete data setting, an adversary may inspect the dataset and delete entries in up to an $\eta$-fraction of samples per coordinate, yielding a strong form of the Missing Not At Random model. In the corrupted data setting, the adversary instead replaces values arbitrarily, and the corruption locations are unknown to the learner. Despite substantial work on missing data, linear regression under such adversarial missingness remains poorly understood, even from an information-theoretic perspective. Unlike the clean setting, where the estimation error vanishes as the number of samples grows, the optimal error in these models remains bounded away from zero and depends on the problem parameters. Our main contribution is a characterization of this error, up to constant factors, over essentially the entire parameter range. Specifically, we establish novel information-theoretic lower bounds on the achievable error and show that they match the guarantees of computationally efficient algorithms. A key implication of our results is that the optimal error in the missing data setting matches that in the corruption setting, indicating that knowledge of the corruption locations provides no general advantage.

Cite this Paper


BibTeX
@InProceedings{pmlr-v336-diakonikolas26b, title = {Linear Regression under Missing or Corrupted Coordinates}, author = {Diakonikolas, Ilias and Diakonikolas, Jelena and Kane, Daniel M. and Lee, Jasper C. H. and Pittas, Thanasis}, booktitle = {Proceedings of Thirty Ninth Conference on Learning Theory}, pages = {1857--1901}, year = {2026}, editor = {Hanneke, Steve and Lattimore, Tor}, volume = {336}, series = {Proceedings of Machine Learning Research}, month = {29 Jun--03 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v336/main/assets/diakonikolas26b/diakonikolas26b.pdf}, url = {https://proceedings.mlr.press/v336/diakonikolas26b.html}, abstract = {We study multivariate linear regression under Gaussian covariates in two settings where data may be erased or corrupted by an adversary subject to a coordinate-wise budget. In the incomplete data setting, an adversary may inspect the dataset and delete entries in up to an $\eta$-fraction of samples per coordinate, yielding a strong form of the Missing Not At Random model. In the corrupted data setting, the adversary instead replaces values arbitrarily, and the corruption locations are unknown to the learner. Despite substantial work on missing data, linear regression under such adversarial missingness remains poorly understood, even from an information-theoretic perspective. Unlike the clean setting, where the estimation error vanishes as the number of samples grows, the optimal error in these models remains bounded away from zero and depends on the problem parameters. Our main contribution is a characterization of this error, up to constant factors, over essentially the entire parameter range. Specifically, we establish novel information-theoretic lower bounds on the achievable error and show that they match the guarantees of computationally efficient algorithms. A key implication of our results is that the optimal error in the missing data setting matches that in the corruption setting, indicating that knowledge of the corruption locations provides no general advantage.} }
Endnote
%0 Conference Paper %T Linear Regression under Missing or Corrupted Coordinates %A Ilias Diakonikolas %A Jelena Diakonikolas %A Daniel M. Kane %A Jasper C. H. Lee %A Thanasis Pittas %B Proceedings of Thirty Ninth Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2026 %E Steve Hanneke %E Tor Lattimore %F pmlr-v336-diakonikolas26b %I PMLR %P 1857--1901 %U https://proceedings.mlr.press/v336/diakonikolas26b.html %V 336 %X We study multivariate linear regression under Gaussian covariates in two settings where data may be erased or corrupted by an adversary subject to a coordinate-wise budget. In the incomplete data setting, an adversary may inspect the dataset and delete entries in up to an $\eta$-fraction of samples per coordinate, yielding a strong form of the Missing Not At Random model. In the corrupted data setting, the adversary instead replaces values arbitrarily, and the corruption locations are unknown to the learner. Despite substantial work on missing data, linear regression under such adversarial missingness remains poorly understood, even from an information-theoretic perspective. Unlike the clean setting, where the estimation error vanishes as the number of samples grows, the optimal error in these models remains bounded away from zero and depends on the problem parameters. Our main contribution is a characterization of this error, up to constant factors, over essentially the entire parameter range. Specifically, we establish novel information-theoretic lower bounds on the achievable error and show that they match the guarantees of computationally efficient algorithms. A key implication of our results is that the optimal error in the missing data setting matches that in the corruption setting, indicating that knowledge of the corruption locations provides no general advantage.
APA
Diakonikolas, I., Diakonikolas, J., Kane, D.M., Lee, J.C.H. & Pittas, T.. (2026). Linear Regression under Missing or Corrupted Coordinates. Proceedings of Thirty Ninth Conference on Learning Theory, in Proceedings of Machine Learning Research 336:1857-1901 Available from https://proceedings.mlr.press/v336/diakonikolas26b.html.

Related Material