Two-temperature logistic regression based on the Tsallis divergence

Ehsan Amid; Manfred K. Warmuth; Sriram Srinivasan

Two-temperature logistic regression based on the Tsallis divergence

Ehsan Amid, Manfred K. Warmuth, Sriram Srinivasan

Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, PMLR 89:2388-2396, 2019.

Abstract

We develop a variant of multiclass logistic regression that is significantly more robust to noise. The algorithm has one weight vector per class and the surrogate loss is a function of the linear activations (one per class). The surrogate loss of an example with linear activation vector $\mathbf{a}$ and class $c$ has the form $-\log_{t_1} \exp_{t_2} (a_c - G_{t_2}(\mathbf{a}))$ where the two temperatures $t_1$ and $t_2$ “temper” the $\log$ and $\exp$, respectively, and $G_{t_2}(\mathbf{a})$ is a scalar value that generalizes the log-partition function. We motivate this loss using the Tsallis divergence. Our method allows transitioning between non-convex and convex losses by the choice of the temperature parameters. As the temperature $t_1$ of the logarithm becomes smaller than the temperature $t_2$ of the exponential, the surrogate loss becomes “quasi convex”. Various tunings of the temperatures recover previous methods and tuning the degree of non-convexity is crucial in the experiments. In particular, quasi-convexity and boundedness of the loss provide significant robustness to the outliers. We explain this by showing that $t_1 < 1$ caps the surrogate loss and $t_2 >1$ makes the predictive distribution have a heavy tail. We show that the surrogate loss is Bayes-consistent, even in the non-convex case. Additionally, we provide efficient iterative algorithms for calculating the log-partition value only in a few number of iterations. Our compelling experimental results on large real-world datasets show the advantage of using the two-temperature variant in the noisy as well as the noise free case.

Cite this Paper

BibTeX

@InProceedings{pmlr-v89-amid19a,
  title = 	 {Two-temperature logistic regression based on the Tsallis divergence},
  author =       {Amid, Ehsan and Warmuth, Manfred K. and Srinivasan, Sriram},
  booktitle = 	 {Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics},
  pages = 	 {2388--2396},
  year = 	 {2019},
  editor = 	 {Chaudhuri, Kamalika and Sugiyama, Masashi},
  volume = 	 {89},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {16--18 Apr},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v89/amid19a/amid19a.pdf},
  url = 	 {https://proceedings.mlr.press/v89/amid19a.html},
  abstract = 	 {We develop a variant of multiclass logistic regression that is significantly more robust to noise. The algorithm has one weight vector per class and the surrogate loss is a function of the linear activations (one per class). The surrogate loss  of an example with linear activation vector $\mathbf{a}$ and class $c$ has the form $-\log_{t_1} \exp_{t_2} (a_c - G_{t_2}(\mathbf{a}))$ where the two temperatures $t_1$ and $t_2$ “temper” the $\log$ and $\exp$, respectively, and $G_{t_2}(\mathbf{a})$ is a scalar value that generalizes the log-partition function. We motivate this loss using the Tsallis divergence. Our method allows transitioning between non-convex and convex losses by the choice of the temperature parameters. As the temperature $t_1$ of the logarithm becomes smaller than the temperature $t_2$ of the exponential, the surrogate loss becomes “quasi convex”. Various tunings of the temperatures recover previous methods and tuning the degree of non-convexity is crucial in the experiments. In particular, quasi-convexity and boundedness of the loss provide significant robustness to the outliers. We explain this by showing that $t_1 < 1$ caps the surrogate loss and $t_2 >1$ makes the predictive distribution have a heavy tail.  We show that the surrogate loss is Bayes-consistent, even in the non-convex case. Additionally, we provide efficient iterative algorithms for calculating the log-partition value only in a few number of iterations. Our compelling experimental results on large real-world datasets show the advantage of using the two-temperature variant in the noisy as well as the noise free case.}
}

Endnote

%0 Conference Paper
%T Two-temperature logistic regression based on the Tsallis divergence
%A Ehsan Amid
%A Manfred K. Warmuth
%A Sriram Srinivasan
%B Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics
%C Proceedings of Machine Learning Research
%D 2019
%E Kamalika Chaudhuri
%E Masashi Sugiyama	
%F pmlr-v89-amid19a
%I PMLR
%P 2388--2396
%U https://proceedings.mlr.press/v89/amid19a.html
%V 89
%X We develop a variant of multiclass logistic regression that is significantly more robust to noise. The algorithm has one weight vector per class and the surrogate loss is a function of the linear activations (one per class). The surrogate loss  of an example with linear activation vector $\mathbf{a}$ and class $c$ has the form $-\log_{t_1} \exp_{t_2} (a_c - G_{t_2}(\mathbf{a}))$ where the two temperatures $t_1$ and $t_2$ “temper” the $\log$ and $\exp$, respectively, and $G_{t_2}(\mathbf{a})$ is a scalar value that generalizes the log-partition function. We motivate this loss using the Tsallis divergence. Our method allows transitioning between non-convex and convex losses by the choice of the temperature parameters. As the temperature $t_1$ of the logarithm becomes smaller than the temperature $t_2$ of the exponential, the surrogate loss becomes “quasi convex”. Various tunings of the temperatures recover previous methods and tuning the degree of non-convexity is crucial in the experiments. In particular, quasi-convexity and boundedness of the loss provide significant robustness to the outliers. We explain this by showing that $t_1 < 1$ caps the surrogate loss and $t_2 >1$ makes the predictive distribution have a heavy tail.  We show that the surrogate loss is Bayes-consistent, even in the non-convex case. Additionally, we provide efficient iterative algorithms for calculating the log-partition value only in a few number of iterations. Our compelling experimental results on large real-world datasets show the advantage of using the two-temperature variant in the noisy as well as the noise free case.

APA

Amid, E., Warmuth, M.K. & Srinivasan, S.. (2019). Two-temperature logistic regression based on the Tsallis divergence. Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 89:2388-2396 Available from https://proceedings.mlr.press/v89/amid19a.html.

Two-temperature logistic regression based on the Tsallis divergence

Abstract

Cite this Paper

Related Material