[edit]
Linear Regression under Missing or Corrupted Coordinates
Proceedings of Thirty Ninth Conference on Learning Theory, PMLR 336:1857-1901, 2026.
Abstract
We study multivariate linear regression under Gaussian covariates in two settings where data may be erased or corrupted by an adversary subject to a coordinate-wise budget. In the incomplete data setting, an adversary may inspect the dataset and delete entries in up to an $\eta$-fraction of samples per coordinate, yielding a strong form of the Missing Not At Random model. In the corrupted data setting, the adversary instead replaces values arbitrarily, and the corruption locations are unknown to the learner. Despite substantial work on missing data, linear regression under such adversarial missingness remains poorly understood, even from an information-theoretic perspective. Unlike the clean setting, where the estimation error vanishes as the number of samples grows, the optimal error in these models remains bounded away from zero and depends on the problem parameters. Our main contribution is a characterization of this error, up to constant factors, over essentially the entire parameter range. Specifically, we establish novel information-theoretic lower bounds on the achievable error and show that they match the guarantees of computationally efficient algorithms. A key implication of our results is that the optimal error in the missing data setting matches that in the corruption setting, indicating that knowledge of the corruption locations provides no general advantage.