Compress Then Test: Powerful Kernel Testing in Near-linear Time

Carles Domingo-Enrich, Raaz Dwivedi, Lester Mackey
Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, PMLR 206:1174-1218, 2023.

Abstract

Kernel two-sample testing provides a powerful framework for distinguishing any pair of distributions based on n sample points. However, existing kernel tests either run in $n^2$ time or sacrifice undue power to improve runtime. To address these shortcomings, we introduce Compress Then Test (CTT), a new framework for high-powered kernel testing based on sample compression. CTT cheaply approximates an expensive test by compressing each n point sample into a small but provably high-fidelity coreset. For standard kernels and subexponential distributions, CTT inherits the statistical behavior of a quadratic-time test—recovering the same optimal detection boundary—while running in near-linear time. We couple these advances with cheaper permutation testing, justified by new power analyses; improved time-vs.-quality guarantees for low-rank approximation; and a fast aggregation procedure for identifying especially discriminating kernels. In our experiments with real and simulated data, CTT and its extensions provide 20–200x speed-ups over state-of-the-art approximate MMD tests with no loss of power.

Cite this Paper


BibTeX
@InProceedings{pmlr-v206-domingo-enrich23a, title = {Compress Then Test: Powerful Kernel Testing in Near-linear Time}, author = {Domingo-Enrich, Carles and Dwivedi, Raaz and Mackey, Lester}, booktitle = {Proceedings of The 26th International Conference on Artificial Intelligence and Statistics}, pages = {1174--1218}, year = {2023}, editor = {Ruiz, Francisco and Dy, Jennifer and van de Meent, Jan-Willem}, volume = {206}, series = {Proceedings of Machine Learning Research}, month = {25--27 Apr}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v206/domingo-enrich23a/domingo-enrich23a.pdf}, url = {https://proceedings.mlr.press/v206/domingo-enrich23a.html}, abstract = {Kernel two-sample testing provides a powerful framework for distinguishing any pair of distributions based on n sample points. However, existing kernel tests either run in $n^2$ time or sacrifice undue power to improve runtime. To address these shortcomings, we introduce Compress Then Test (CTT), a new framework for high-powered kernel testing based on sample compression. CTT cheaply approximates an expensive test by compressing each n point sample into a small but provably high-fidelity coreset. For standard kernels and subexponential distributions, CTT inherits the statistical behavior of a quadratic-time test—recovering the same optimal detection boundary—while running in near-linear time. We couple these advances with cheaper permutation testing, justified by new power analyses; improved time-vs.-quality guarantees for low-rank approximation; and a fast aggregation procedure for identifying especially discriminating kernels. In our experiments with real and simulated data, CTT and its extensions provide 20–200x speed-ups over state-of-the-art approximate MMD tests with no loss of power.} }
Endnote
%0 Conference Paper %T Compress Then Test: Powerful Kernel Testing in Near-linear Time %A Carles Domingo-Enrich %A Raaz Dwivedi %A Lester Mackey %B Proceedings of The 26th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2023 %E Francisco Ruiz %E Jennifer Dy %E Jan-Willem van de Meent %F pmlr-v206-domingo-enrich23a %I PMLR %P 1174--1218 %U https://proceedings.mlr.press/v206/domingo-enrich23a.html %V 206 %X Kernel two-sample testing provides a powerful framework for distinguishing any pair of distributions based on n sample points. However, existing kernel tests either run in $n^2$ time or sacrifice undue power to improve runtime. To address these shortcomings, we introduce Compress Then Test (CTT), a new framework for high-powered kernel testing based on sample compression. CTT cheaply approximates an expensive test by compressing each n point sample into a small but provably high-fidelity coreset. For standard kernels and subexponential distributions, CTT inherits the statistical behavior of a quadratic-time test—recovering the same optimal detection boundary—while running in near-linear time. We couple these advances with cheaper permutation testing, justified by new power analyses; improved time-vs.-quality guarantees for low-rank approximation; and a fast aggregation procedure for identifying especially discriminating kernels. In our experiments with real and simulated data, CTT and its extensions provide 20–200x speed-ups over state-of-the-art approximate MMD tests with no loss of power.
APA
Domingo-Enrich, C., Dwivedi, R. & Mackey, L.. (2023). Compress Then Test: Powerful Kernel Testing in Near-linear Time. Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 206:1174-1218 Available from https://proceedings.mlr.press/v206/domingo-enrich23a.html.

Related Material