A Family of Exact Goodness-of-Fit Tests for High-Dimensional Discrete Distributions

Feras A. Saad, Cameron E. Freer, Nathanael L. Ackerman, Vikash K. Mansinghka
Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, PMLR 89:1640-1649, 2019.

Abstract

The objective of goodness-of-fit testing is to assess whether a dataset of observations is likely to have been drawn from a candidate probability distribution. This paper presents a rank-based family of goodness-of-fit tests that is specialized to discrete distributions on high-dimensional domains. The test is readily implemented using a simulation-based, linear-time procedure. The testing procedure can be customized by the practitioner using knowledge of the underlying data domain. Unlike most existing test statistics, the proposed test statistic is distribution-free and its exact (non-asymptotic) sampling distribution is known in closed form. We establish consistency of the test against all alternatives by showing that the test statistic is distributed as a discrete uniform if and only if the samples were drawn from the candidate distribution. We illustrate its efficacy for assessing the sample quality of approximate sampling algorithms over combinatorially large spaces with intractable probabilities, including random partitions in Dirichlet process mixture models and random lattices in Ising models.

Cite this Paper


BibTeX
@InProceedings{pmlr-v89-saad19a, title = {A Family of Exact Goodness-of-Fit Tests for High-Dimensional Discrete Distributions}, author = {Saad, Feras A. and Freer, Cameron E. and Ackerman, Nathanael L. and Mansinghka, Vikash K.}, booktitle = {Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics}, pages = {1640--1649}, year = {2019}, editor = {Chaudhuri, Kamalika and Sugiyama, Masashi}, volume = {89}, series = {Proceedings of Machine Learning Research}, month = {16--18 Apr}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v89/saad19a/saad19a.pdf}, url = {https://proceedings.mlr.press/v89/saad19a.html}, abstract = {The objective of goodness-of-fit testing is to assess whether a dataset of observations is likely to have been drawn from a candidate probability distribution. This paper presents a rank-based family of goodness-of-fit tests that is specialized to discrete distributions on high-dimensional domains. The test is readily implemented using a simulation-based, linear-time procedure. The testing procedure can be customized by the practitioner using knowledge of the underlying data domain. Unlike most existing test statistics, the proposed test statistic is distribution-free and its exact (non-asymptotic) sampling distribution is known in closed form. We establish consistency of the test against all alternatives by showing that the test statistic is distributed as a discrete uniform if and only if the samples were drawn from the candidate distribution. We illustrate its efficacy for assessing the sample quality of approximate sampling algorithms over combinatorially large spaces with intractable probabilities, including random partitions in Dirichlet process mixture models and random lattices in Ising models.} }
Endnote
%0 Conference Paper %T A Family of Exact Goodness-of-Fit Tests for High-Dimensional Discrete Distributions %A Feras A. Saad %A Cameron E. Freer %A Nathanael L. Ackerman %A Vikash K. Mansinghka %B Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2019 %E Kamalika Chaudhuri %E Masashi Sugiyama %F pmlr-v89-saad19a %I PMLR %P 1640--1649 %U https://proceedings.mlr.press/v89/saad19a.html %V 89 %X The objective of goodness-of-fit testing is to assess whether a dataset of observations is likely to have been drawn from a candidate probability distribution. This paper presents a rank-based family of goodness-of-fit tests that is specialized to discrete distributions on high-dimensional domains. The test is readily implemented using a simulation-based, linear-time procedure. The testing procedure can be customized by the practitioner using knowledge of the underlying data domain. Unlike most existing test statistics, the proposed test statistic is distribution-free and its exact (non-asymptotic) sampling distribution is known in closed form. We establish consistency of the test against all alternatives by showing that the test statistic is distributed as a discrete uniform if and only if the samples were drawn from the candidate distribution. We illustrate its efficacy for assessing the sample quality of approximate sampling algorithms over combinatorially large spaces with intractable probabilities, including random partitions in Dirichlet process mixture models and random lattices in Ising models.
APA
Saad, F.A., Freer, C.E., Ackerman, N.L. & Mansinghka, V.K.. (2019). A Family of Exact Goodness-of-Fit Tests for High-Dimensional Discrete Distributions. Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 89:1640-1649 Available from https://proceedings.mlr.press/v89/saad19a.html.

Related Material