A new ranking scheme for modern data and its application to two-sample hypothesis testing

Doudou Zhou, Hao Chen
Proceedings of Thirty Sixth Conference on Learning Theory, PMLR 195:3615-3668, 2023.

Abstract

Rank-based approaches are among the most popular nonparametric methods for univariate data in tackling statistical problems such as hypothesis testing due to their robustness and effectiveness. However, they are unsatisfactory for more complex data. In the era of big data, high-dimensional and non-Euclidean data, such as networks and images, are ubiquitous and pose challenges for statistical analysis. Existing multivariate ranks such as component-wise, spatial, and depth-based ranks do not apply to non-Euclidean data and have limited performance for high-dimensional data. Instead of dealing with the ranks of observations, we propose two types of ranks applicable to complex data based on a similarity graph constructed on observations: a graph-induced rank defined by the inductive nature of the graph and an overall rank defined by the weight of edges in the graph. To illustrate their utilization, both the new ranks are used to construct test statistics for the two-sample hypothesis testing, which converge to the $\chi_2^2$ distribution under the permutation null distribution and some mild conditions of the ranks, enabling an easy type-I error control. Simulation studies show that the new method exhibits good power under a wide range of alternatives compared to existing methods. The new test is illustrated on the New York City taxi data for comparing travel patterns in consecutive months and a brain network dataset comparing male and female subjects.

Cite this Paper


BibTeX
@InProceedings{pmlr-v195-zhou23a, title = {A new ranking scheme for modern data and its application to two-sample hypothesis testing}, author = {Zhou, Doudou and Chen, Hao}, booktitle = {Proceedings of Thirty Sixth Conference on Learning Theory}, pages = {3615--3668}, year = {2023}, editor = {Neu, Gergely and Rosasco, Lorenzo}, volume = {195}, series = {Proceedings of Machine Learning Research}, month = {12--15 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v195/zhou23a/zhou23a.pdf}, url = {https://proceedings.mlr.press/v195/zhou23a.html}, abstract = {Rank-based approaches are among the most popular nonparametric methods for univariate data in tackling statistical problems such as hypothesis testing due to their robustness and effectiveness. However, they are unsatisfactory for more complex data. In the era of big data, high-dimensional and non-Euclidean data, such as networks and images, are ubiquitous and pose challenges for statistical analysis. Existing multivariate ranks such as component-wise, spatial, and depth-based ranks do not apply to non-Euclidean data and have limited performance for high-dimensional data. Instead of dealing with the ranks of observations, we propose two types of ranks applicable to complex data based on a similarity graph constructed on observations: a graph-induced rank defined by the inductive nature of the graph and an overall rank defined by the weight of edges in the graph. To illustrate their utilization, both the new ranks are used to construct test statistics for the two-sample hypothesis testing, which converge to the $\chi_2^2$ distribution under the permutation null distribution and some mild conditions of the ranks, enabling an easy type-I error control. Simulation studies show that the new method exhibits good power under a wide range of alternatives compared to existing methods. The new test is illustrated on the New York City taxi data for comparing travel patterns in consecutive months and a brain network dataset comparing male and female subjects.} }
Endnote
%0 Conference Paper %T A new ranking scheme for modern data and its application to two-sample hypothesis testing %A Doudou Zhou %A Hao Chen %B Proceedings of Thirty Sixth Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2023 %E Gergely Neu %E Lorenzo Rosasco %F pmlr-v195-zhou23a %I PMLR %P 3615--3668 %U https://proceedings.mlr.press/v195/zhou23a.html %V 195 %X Rank-based approaches are among the most popular nonparametric methods for univariate data in tackling statistical problems such as hypothesis testing due to their robustness and effectiveness. However, they are unsatisfactory for more complex data. In the era of big data, high-dimensional and non-Euclidean data, such as networks and images, are ubiquitous and pose challenges for statistical analysis. Existing multivariate ranks such as component-wise, spatial, and depth-based ranks do not apply to non-Euclidean data and have limited performance for high-dimensional data. Instead of dealing with the ranks of observations, we propose two types of ranks applicable to complex data based on a similarity graph constructed on observations: a graph-induced rank defined by the inductive nature of the graph and an overall rank defined by the weight of edges in the graph. To illustrate their utilization, both the new ranks are used to construct test statistics for the two-sample hypothesis testing, which converge to the $\chi_2^2$ distribution under the permutation null distribution and some mild conditions of the ranks, enabling an easy type-I error control. Simulation studies show that the new method exhibits good power under a wide range of alternatives compared to existing methods. The new test is illustrated on the New York City taxi data for comparing travel patterns in consecutive months and a brain network dataset comparing male and female subjects.
APA
Zhou, D. & Chen, H.. (2023). A new ranking scheme for modern data and its application to two-sample hypothesis testing. Proceedings of Thirty Sixth Conference on Learning Theory, in Proceedings of Machine Learning Research 195:3615-3668 Available from https://proceedings.mlr.press/v195/zhou23a.html.

Related Material