Distribution-Independent Regression for Generalized Linear Models with Oblivious Corruptions

Ilias Diakonikolas; Sushrut Karmalkar; Jong Ho Park; Christos Tzamos

Distribution-Independent Regression for Generalized Linear Models with Oblivious Corruptions

Ilias Diakonikolas, Sushrut Karmalkar, Jong Ho Park, Christos Tzamos

Proceedings of Thirty Sixth Conference on Learning Theory, PMLR 195:5453-5475, 2023.

Abstract

We demonstrate the first algorithms for the problem of regression for generalized linear models (GLMs) in the presence of additive oblivious noise. We assume we have sample access to examples

$(x, y)$ where

$y$ is a noisy measurement of

$g(w^* \cdot x)$ . In particular,

$y = g(w^* \cdot x) + \xi + \eps$ where

$\xi$ is the oblivious noise drawn independently of

$x$ , satisfying

$\Pr[\xi = 0] \geq o(1)$ , and

$\eps \sim \cN(0, \sigma^2)$ . Our goal is to accurately recover a function

$g(w \cdot x)$ with arbitrarily small error when compared to the true values

$g(w^* \cdot x)$ , rather than the noisy measurements

$y$ . We present an algorithm that tackles the problem in its most general distribution-independent setting, where the solution may not be identifiable. The algorithm is designed to return the solution if it is identifiable, and otherwise return a small list of candidates, one of which is close to the true solution. Furthermore, we characterize a necessary and sufficient condition for identifiability, which holds in broad settings. The problem is identifiable when the quantile at which

$\xi + \eps = 0$ is known, or when the family of hypotheses does not contain candidates that are nearly equal to a translated

$g(w^* \cdot x) + A$ for some real number

$A$ , while also having large error when compared to

$g(w^* \cdot x)$ . This is the first result for GLM regression which can handle more than half the samples being arbitrarily corrupted. Prior work focused largely on the setting of linear regression with oblivious noise, and giving algorithms under more restrictive assumptions.

Cite this Paper

BibTeX


@InProceedings{pmlr-v195-diakonikolas23e,
  title = 	 {Distribution-Independent Regression for Generalized Linear Models with Oblivious Corruptions},
  author =       {Diakonikolas, Ilias and Karmalkar, Sushrut and Park, Jong Ho and Tzamos, Christos},
  booktitle = 	 {Proceedings of Thirty Sixth Conference on Learning Theory},
  pages = 	 {5453--5475},
  year = 	 {2023},
  editor = 	 {Neu, Gergely and Rosasco, Lorenzo},
  volume = 	 {195},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {12--15 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v195/diakonikolas23e/diakonikolas23e.pdf},
  url = 	 {https://proceedings.mlr.press/v195/diakonikolas23e.html},
  abstract = 	 {We demonstrate the first algorithms for the problem of regression for generalized linear models (GLMs) in the presence of additive oblivious noise. We assume we have sample access to examples $(x, y)$ where $y$ is a noisy measurement of $g(w^* \cdot x)$. In particular, $y = g(w^* \cdot x) + \xi + \eps$ where $\xi$ is the oblivious noise drawn independently of $x$, satisfying  $\Pr[\xi = 0] \geq o(1)$, and $\eps \sim \cN(0, \sigma^2)$. Our goal is to accurately recover a function $g(w \cdot x)$ with arbitrarily small error when compared to the true values $g(w^* \cdot x)$, rather than the noisy measurements $y$. We present an algorithm that tackles the problem in its most general distribution-independent setting, where the solution may not be identifiable. The algorithm is designed to return the solution if it is identifiable, and otherwise return a small list of candidates, one of which is close to the true solution. Furthermore, we characterize a necessary and sufficient condition for identifiability, which holds in broad settings. The problem is identifiable when the quantile at which $\xi + \eps = 0$ is known, or when the family of hypotheses does not contain candidates that are nearly equal to a translated $g(w^* \cdot x) + A$ for some real number $A$, while also having large error when compared to $g(w^* \cdot x)$. This is the first result for GLM regression which can handle more than half the samples being arbitrarily corrupted. Prior work focused largely on the setting of linear regression with oblivious noise, and giving algorithms under more restrictive assumptions.  }
}

Endnote

%0 Conference Paper
%T Distribution-Independent Regression for Generalized Linear Models with Oblivious Corruptions
%A Ilias Diakonikolas
%A Sushrut Karmalkar
%A Jong Ho Park
%A Christos Tzamos
%B Proceedings of Thirty Sixth Conference on Learning Theory
%C Proceedings of Machine Learning Research
%D 2023
%E Gergely Neu
%E Lorenzo Rosasco	
%F pmlr-v195-diakonikolas23e
%I PMLR
%P 5453--5475
%U https://proceedings.mlr.press/v195/diakonikolas23e.html
%V 195
%X We demonstrate the first algorithms for the problem of regression for generalized linear models (GLMs) in the presence of additive oblivious noise. We assume we have sample access to examples $(x, y)$ where $y$ is a noisy measurement of $g(w^* \cdot x)$. In particular, $y = g(w^* \cdot x) + \xi + \eps$ where $\xi$ is the oblivious noise drawn independently of $x$, satisfying  $\Pr[\xi = 0] \geq o(1)$, and $\eps \sim \cN(0, \sigma^2)$. Our goal is to accurately recover a function $g(w \cdot x)$ with arbitrarily small error when compared to the true values $g(w^* \cdot x)$, rather than the noisy measurements $y$. We present an algorithm that tackles the problem in its most general distribution-independent setting, where the solution may not be identifiable. The algorithm is designed to return the solution if it is identifiable, and otherwise return a small list of candidates, one of which is close to the true solution. Furthermore, we characterize a necessary and sufficient condition for identifiability, which holds in broad settings. The problem is identifiable when the quantile at which $\xi + \eps = 0$ is known, or when the family of hypotheses does not contain candidates that are nearly equal to a translated $g(w^* \cdot x) + A$ for some real number $A$, while also having large error when compared to $g(w^* \cdot x)$. This is the first result for GLM regression which can handle more than half the samples being arbitrarily corrupted. Prior work focused largely on the setting of linear regression with oblivious noise, and giving algorithms under more restrictive assumptions.

APA


Diakonikolas, I., Karmalkar, S., Park, J.H. & Tzamos, C.. (2023). Distribution-Independent Regression for Generalized Linear Models with Oblivious Corruptions. Proceedings of Thirty Sixth Conference on Learning Theory, in Proceedings of Machine Learning Research 195:5453-5475 Available from https://proceedings.mlr.press/v195/diakonikolas23e.html.

Related Material

Download PDF