[edit]
Measuring Mutual Information Between All Pairs of Variables in Subquadratic Complexity
Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, PMLR 108:4399-4409, 2020.
Abstract
Finding associations between pairs of variables in large datasets is crucial for various disciplines. The brute force method for solving this problem requires computing the mutual information between $\binom{N}{2}$ pairs. In this paper, we consider the problem of finding pairs of variables with high mutual information in sub-quadratic complexity. This problem is analogous to the nearest neighbor search, where the goal is to find pairs among $N$ variables that are similar to each other. To solve this problem, we develop a new algorithm for finding associations based on constructing a decision tree that assigns a hash to each variable, in a way that for pairs with higher mutual information, the chance of having the same hash is higher. For any $1 \leq \lambda \leq 2$, we prove that in the case of binary data, we can reduce the number of necessary mutual information computations for finding all pairs satisfying $I(X, Y) > 2- \lambda$ from $O(N^2)$ to $O(N^\lambda)$, where $I(X,Y)$ is the empirical mutual information between variables $X$ and $Y$. Finally, we confirmed our theory by experiments on simulated and real data. The implementation of our method and experiments is publicly available at \href{https://github.com/mohimanilab/HashMI}{https://github.com/mohimanilab/HashMI}.