High-Dimensional Gaussian Mean Estimation under Realizable Contamination

Ilias Diakonikolas; Daniel M. Kane; Thanasis Pittas

High-Dimensional Gaussian Mean Estimation under Realizable Contamination

Ilias Diakonikolas, Daniel M. Kane, Thanasis Pittas

Proceedings of Thirty Ninth Conference on Learning Theory, PMLR 336:1814-1856, 2026.

Abstract

We study mean estimation for a Gaussian distribution with identity covariance in $\mathbb{R}^d$ under a missing data scheme termed realizable $\epsilon$-contamination. In this model, an adversary chooses a function $r(x)$ taking values in $[0,\epsilon]$, and each sample $x$ is removed independently with probability $r(x)$. Recent work introduced this model as an intermediate-strength setting between Missing Completely At Random (MCAR), where missingness is independent of the data, and Missing Not At Random (MNAR), where missingness may depend arbitrarily on the sample values and can lead to non-identifiability. Prior work established information-theoretic upper and lower bounds for mean estimation in the realizable contamination model, but the proposed estimators require runtime exponential in the dimension, leaving open the possibility of computationally efficient algorithms in high dimensions. In this work, we establish an information–computation gap in the Statistical Query model and, as a consequence, for low-degree polynomial and polynomial-threshold-function algorithms. Specifically, we show that any such algorithm must either use substantially more samples than information-theoretically necessary or incur exponential runtime. We complement our lower bound with an algorithm whose sample–time tradeoff nearly matches our lower bound. Together, these results provide a qualitative characterization of the computational complexity of Gaussian mean estimation under realizable $\epsilon$-contamination.

Cite this Paper

BibTeX

@InProceedings{pmlr-v336-diakonikolas26a,
  title = 	 {High-Dimensional Gaussian Mean Estimation under Realizable Contamination},
  author =       {Diakonikolas, Ilias and Kane, Daniel M. and Pittas, Thanasis},
  booktitle = 	 {Proceedings of Thirty Ninth Conference on Learning Theory},
  pages = 	 {1814--1856},
  year = 	 {2026},
  editor = 	 {Hanneke, Steve and Lattimore, Tor},
  volume = 	 {336},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {29 Jun--03 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v336/main/assets/diakonikolas26a/diakonikolas26a.pdf},
  url = 	 {https://proceedings.mlr.press/v336/diakonikolas26a.html},
  abstract = 	 {We study mean estimation for a Gaussian distribution with identity covariance in $\mathbb{R}^d$ under a missing data scheme termed realizable $\epsilon$-contamination. In this model, an adversary chooses a function $r(x)$ taking values in $[0,\epsilon]$, and each sample $x$ is removed independently with probability $r(x)$. Recent work introduced this model as an intermediate-strength setting between Missing Completely At Random (MCAR), where missingness is independent of the data, and Missing Not At Random (MNAR), where missingness may depend arbitrarily on the sample values and can lead to non-identifiability. Prior work established information-theoretic upper and lower bounds for mean estimation in the realizable contamination model, but the proposed estimators require runtime exponential in the dimension, leaving open the possibility of computationally efficient algorithms in high dimensions. In this work, we establish an information–computation gap in the Statistical Query model and, as a consequence, for low-degree polynomial and polynomial-threshold-function algorithms. Specifically, we show that any such algorithm must either use substantially more samples than information-theoretically necessary or incur exponential runtime. We complement our lower bound with an algorithm whose sample–time tradeoff nearly matches our lower bound. Together, these results provide a qualitative characterization of the computational complexity of Gaussian mean estimation under realizable $\epsilon$-contamination.}
}

Endnote

%0 Conference Paper
%T High-Dimensional Gaussian Mean Estimation under Realizable Contamination
%A Ilias Diakonikolas
%A Daniel M. Kane
%A Thanasis Pittas
%B Proceedings of Thirty Ninth Conference on Learning Theory
%C Proceedings of Machine Learning Research
%D 2026
%E Steve Hanneke
%E Tor Lattimore	
%F pmlr-v336-diakonikolas26a
%I PMLR
%P 1814--1856
%U https://proceedings.mlr.press/v336/diakonikolas26a.html
%V 336
%X We study mean estimation for a Gaussian distribution with identity covariance in $\mathbb{R}^d$ under a missing data scheme termed realizable $\epsilon$-contamination. In this model, an adversary chooses a function $r(x)$ taking values in $[0,\epsilon]$, and each sample $x$ is removed independently with probability $r(x)$. Recent work introduced this model as an intermediate-strength setting between Missing Completely At Random (MCAR), where missingness is independent of the data, and Missing Not At Random (MNAR), where missingness may depend arbitrarily on the sample values and can lead to non-identifiability. Prior work established information-theoretic upper and lower bounds for mean estimation in the realizable contamination model, but the proposed estimators require runtime exponential in the dimension, leaving open the possibility of computationally efficient algorithms in high dimensions. In this work, we establish an information–computation gap in the Statistical Query model and, as a consequence, for low-degree polynomial and polynomial-threshold-function algorithms. Specifically, we show that any such algorithm must either use substantially more samples than information-theoretically necessary or incur exponential runtime. We complement our lower bound with an algorithm whose sample–time tradeoff nearly matches our lower bound. Together, these results provide a qualitative characterization of the computational complexity of Gaussian mean estimation under realizable $\epsilon$-contamination.

APA

Diakonikolas, I., Kane, D.M. & Pittas, T.. (2026). High-Dimensional Gaussian Mean Estimation under Realizable Contamination. Proceedings of Thirty Ninth Conference on Learning Theory, in Proceedings of Machine Learning Research 336:1814-1856 Available from https://proceedings.mlr.press/v336/diakonikolas26a.html.

Related Material

Download PDF