Efficiently Vectorized MCMC on Modern Accelerators

Hugh Dance, Pierre Glaser, Peter Orbanz, Ryan P Adams
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:12436-12458, 2025.

Abstract

With the advent of automatic vectorization tools (e.g., JAX’s vmap), writing multi-chain MCMC algorithms is often now as simple as invoking those tools on single-chain code. Whilst convenient, for various MCMC algorithms this results in a synchronization problem—loosely speaking, at each iteration all chains running in parallel must wait until the last chain has finished drawing its sample. In this work, we show how to design single-chain MCMC algorithms in a way that avoids synchronization overheads when vectorizing with tools like vmap, by using the framework of finite state machines (FSMs). Using a simplified model, we derive an exact theoretical form of the obtainable speed-ups using our approach, and use it to make principled recommendations for optimal algorithm design. We implement several popular MCMC algorithms as FSMs, including Elliptical Slice Sampling, HMC-NUTS, and Delayed Rejection, demonstrating speed-ups of up to an order of magnitude in experiments.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-dance25a, title = {Efficiently Vectorized {MCMC} on Modern Accelerators}, author = {Dance, Hugh and Glaser, Pierre and Orbanz, Peter and Adams, Ryan P}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {12436--12458}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/dance25a/dance25a.pdf}, url = {https://proceedings.mlr.press/v267/dance25a.html}, abstract = {With the advent of automatic vectorization tools (e.g., JAX’s vmap), writing multi-chain MCMC algorithms is often now as simple as invoking those tools on single-chain code. Whilst convenient, for various MCMC algorithms this results in a synchronization problem—loosely speaking, at each iteration all chains running in parallel must wait until the last chain has finished drawing its sample. In this work, we show how to design single-chain MCMC algorithms in a way that avoids synchronization overheads when vectorizing with tools like vmap, by using the framework of finite state machines (FSMs). Using a simplified model, we derive an exact theoretical form of the obtainable speed-ups using our approach, and use it to make principled recommendations for optimal algorithm design. We implement several popular MCMC algorithms as FSMs, including Elliptical Slice Sampling, HMC-NUTS, and Delayed Rejection, demonstrating speed-ups of up to an order of magnitude in experiments.} }
Endnote
%0 Conference Paper %T Efficiently Vectorized MCMC on Modern Accelerators %A Hugh Dance %A Pierre Glaser %A Peter Orbanz %A Ryan P Adams %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-dance25a %I PMLR %P 12436--12458 %U https://proceedings.mlr.press/v267/dance25a.html %V 267 %X With the advent of automatic vectorization tools (e.g., JAX’s vmap), writing multi-chain MCMC algorithms is often now as simple as invoking those tools on single-chain code. Whilst convenient, for various MCMC algorithms this results in a synchronization problem—loosely speaking, at each iteration all chains running in parallel must wait until the last chain has finished drawing its sample. In this work, we show how to design single-chain MCMC algorithms in a way that avoids synchronization overheads when vectorizing with tools like vmap, by using the framework of finite state machines (FSMs). Using a simplified model, we derive an exact theoretical form of the obtainable speed-ups using our approach, and use it to make principled recommendations for optimal algorithm design. We implement several popular MCMC algorithms as FSMs, including Elliptical Slice Sampling, HMC-NUTS, and Delayed Rejection, demonstrating speed-ups of up to an order of magnitude in experiments.
APA
Dance, H., Glaser, P., Orbanz, P. & Adams, R.P.. (2025). Efficiently Vectorized MCMC on Modern Accelerators. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:12436-12458 Available from https://proceedings.mlr.press/v267/dance25a.html.

Related Material