[edit]
Two-temperature logistic regression based on the Tsallis divergence
Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, PMLR 89:2388-2396, 2019.
Abstract
We develop a variant of multiclass logistic regression that is significantly more robust to noise. The algorithm has one weight vector per class and the surrogate loss is a function of the linear activations (one per class). The surrogate loss of an example with linear activation vector a and class c has the form −logt1expt2(ac−Gt2(a)) where the two temperatures t1 and t2 “temper” the log and exp, respectively, and Gt2(a) is a scalar value that generalizes the log-partition function. We motivate this loss using the Tsallis divergence. Our method allows transitioning between non-convex and convex losses by the choice of the temperature parameters. As the temperature t1 of the logarithm becomes smaller than the temperature t2 of the exponential, the surrogate loss becomes “quasi convex”. Various tunings of the temperatures recover previous methods and tuning the degree of non-convexity is crucial in the experiments. In particular, quasi-convexity and boundedness of the loss provide significant robustness to the outliers. We explain this by showing that t1<1 caps the surrogate loss and t2>1 makes the predictive distribution have a heavy tail. We show that the surrogate loss is Bayes-consistent, even in the non-convex case. Additionally, we provide efficient iterative algorithms for calculating the log-partition value only in a few number of iterations. Our compelling experimental results on large real-world datasets show the advantage of using the two-temperature variant in the noisy as well as the noise free case.