MoXCo: How I learned to stop exploring and love my local minima?

Esha Singh, Shoham Sabach, Yu-Xiang Wang
Conference on Parsimony and Learning, PMLR 280:521-544, 2025.

Abstract

Deep neural networks are well-known for their generalization capabilities, largely attributed to optimizers’ ability to find "good" solutions in high-dimensional loss landscapes. This work aims to deepen the understanding of optimization specifically through the lens of loss landscapes. We propose a generalized framework for adaptive optimization that favors convergence to these "good" solutions. Our approach shifts the optimization paradigm from merely finding solutions quickly to discovering solutions that generalize well, establishing a careful balance between optimization efficiency and model generalization. We empirically validate our claims using two-layer, fully connected neural network with ReLU activation and demonstrate practical applicability through binary quantization of ResNets. Our numerical results demonstrate that these adaptive optimizers facilitate exploration leading to faster convergence speeds and narrow the generalization gap between stochastic gradient descent and other adaptive methods.

Cite this Paper


BibTeX
@InProceedings{pmlr-v280-singh25a, title = {MoXCo: How I learned to stop exploring and love my local minima?}, author = {Singh, Esha and Sabach, Shoham and Wang, Yu-Xiang}, booktitle = {Conference on Parsimony and Learning}, pages = {521--544}, year = {2025}, editor = {Chen, Beidi and Liu, Shijia and Pilanci, Mert and Su, Weijie and Sulam, Jeremias and Wang, Yuxiang and Zhu, Zhihui}, volume = {280}, series = {Proceedings of Machine Learning Research}, month = {24--27 Mar}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v280/main/assets/singh25a/singh25a.pdf}, url = {https://proceedings.mlr.press/v280/singh25a.html}, abstract = {Deep neural networks are well-known for their generalization capabilities, largely attributed to optimizers’ ability to find "good" solutions in high-dimensional loss landscapes. This work aims to deepen the understanding of optimization specifically through the lens of loss landscapes. We propose a generalized framework for adaptive optimization that favors convergence to these "good" solutions. Our approach shifts the optimization paradigm from merely finding solutions quickly to discovering solutions that generalize well, establishing a careful balance between optimization efficiency and model generalization. We empirically validate our claims using two-layer, fully connected neural network with ReLU activation and demonstrate practical applicability through binary quantization of ResNets. Our numerical results demonstrate that these adaptive optimizers facilitate exploration leading to faster convergence speeds and narrow the generalization gap between stochastic gradient descent and other adaptive methods.} }
Endnote
%0 Conference Paper %T MoXCo: How I learned to stop exploring and love my local minima? %A Esha Singh %A Shoham Sabach %A Yu-Xiang Wang %B Conference on Parsimony and Learning %C Proceedings of Machine Learning Research %D 2025 %E Beidi Chen %E Shijia Liu %E Mert Pilanci %E Weijie Su %E Jeremias Sulam %E Yuxiang Wang %E Zhihui Zhu %F pmlr-v280-singh25a %I PMLR %P 521--544 %U https://proceedings.mlr.press/v280/singh25a.html %V 280 %X Deep neural networks are well-known for their generalization capabilities, largely attributed to optimizers’ ability to find "good" solutions in high-dimensional loss landscapes. This work aims to deepen the understanding of optimization specifically through the lens of loss landscapes. We propose a generalized framework for adaptive optimization that favors convergence to these "good" solutions. Our approach shifts the optimization paradigm from merely finding solutions quickly to discovering solutions that generalize well, establishing a careful balance between optimization efficiency and model generalization. We empirically validate our claims using two-layer, fully connected neural network with ReLU activation and demonstrate practical applicability through binary quantization of ResNets. Our numerical results demonstrate that these adaptive optimizers facilitate exploration leading to faster convergence speeds and narrow the generalization gap between stochastic gradient descent and other adaptive methods.
APA
Singh, E., Sabach, S. & Wang, Y.. (2025). MoXCo: How I learned to stop exploring and love my local minima?. Conference on Parsimony and Learning, in Proceedings of Machine Learning Research 280:521-544 Available from https://proceedings.mlr.press/v280/singh25a.html.

Related Material