Modular Duality in Deep Learning

Jeremy Bernstein, Laker Newhouse
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:3920-3930, 2025.

Abstract

An old idea in optimization theory says that since the gradient is a dual vector it may not be subtracted from the weights without first being mapped to the primal space where the weights reside. We take this idea seriously in this paper and construct such a duality map for general neural networks. Our map, which we call modular dualization, forms a unifying theoretical basis for training algorithms that are a) fast and b) scalable. Modular dualization involves first assigning operator norms to layers based on the semantics of each layer, and then using these layerwise norms to recursively induce a duality map on the weight space of the full neural architecture. We derive GPU-friendly algorithms for dualizing Embed, Linear and Conv2D layers—the latter two methods are based on a Newton-Schulz iteration. We conclude with small experiments demonstrating the speed, scalability and novel numerical properties of duality-based optimizers. Our methods were used in the Muon optimizer, which recently set speed records for training NanoGPT and was scaled up to a 1.5 billion parameter transformer.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-bernstein25a, title = {Modular Duality in Deep Learning}, author = {Bernstein, Jeremy and Newhouse, Laker}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {3920--3930}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/bernstein25a/bernstein25a.pdf}, url = {https://proceedings.mlr.press/v267/bernstein25a.html}, abstract = {An old idea in optimization theory says that since the gradient is a dual vector it may not be subtracted from the weights without first being mapped to the primal space where the weights reside. We take this idea seriously in this paper and construct such a duality map for general neural networks. Our map, which we call modular dualization, forms a unifying theoretical basis for training algorithms that are a) fast and b) scalable. Modular dualization involves first assigning operator norms to layers based on the semantics of each layer, and then using these layerwise norms to recursively induce a duality map on the weight space of the full neural architecture. We derive GPU-friendly algorithms for dualizing Embed, Linear and Conv2D layers—the latter two methods are based on a Newton-Schulz iteration. We conclude with small experiments demonstrating the speed, scalability and novel numerical properties of duality-based optimizers. Our methods were used in the Muon optimizer, which recently set speed records for training NanoGPT and was scaled up to a 1.5 billion parameter transformer.} }
Endnote
%0 Conference Paper %T Modular Duality in Deep Learning %A Jeremy Bernstein %A Laker Newhouse %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-bernstein25a %I PMLR %P 3920--3930 %U https://proceedings.mlr.press/v267/bernstein25a.html %V 267 %X An old idea in optimization theory says that since the gradient is a dual vector it may not be subtracted from the weights without first being mapped to the primal space where the weights reside. We take this idea seriously in this paper and construct such a duality map for general neural networks. Our map, which we call modular dualization, forms a unifying theoretical basis for training algorithms that are a) fast and b) scalable. Modular dualization involves first assigning operator norms to layers based on the semantics of each layer, and then using these layerwise norms to recursively induce a duality map on the weight space of the full neural architecture. We derive GPU-friendly algorithms for dualizing Embed, Linear and Conv2D layers—the latter two methods are based on a Newton-Schulz iteration. We conclude with small experiments demonstrating the speed, scalability and novel numerical properties of duality-based optimizers. Our methods were used in the Muon optimizer, which recently set speed records for training NanoGPT and was scaled up to a 1.5 billion parameter transformer.
APA
Bernstein, J. & Newhouse, L.. (2025). Modular Duality in Deep Learning. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:3920-3930 Available from https://proceedings.mlr.press/v267/bernstein25a.html.

Related Material