Robust Document Distance with Wasserstein-Fisher-Rao metric
Proceedings of The 12th Asian Conference on Machine Learning, PMLR 129:721-736, 2020.
Computing the distance among linguistic objects is an essential problem in natural language processing. The word mover’s distance (WMD) has been successfully applied to measure the document distance by synthesizing the low-level word similarity with the framework of optimal transport (OT). However, due to the global transportation nature of OT, the WMD may overestimate the semantic dissimilarity when documents contain unequal semantic details. In this paper, we propose to address this overestimation issue with a novel Wasserstein-Fisher-Rao (WFR) document distance grounded on unbalanced optimal transport theory. Compared to the WMD, the WFR document distance provides a trade-off between global transportation and local truncation, which leads to a better similarity measure for unequal semantic details. Moreover, an efficient prune strategy is particularly designed for the WFR document distance to facilitate the top-k queries among a large number of documents. Extensive experimental results show that the WFR document distance achieves higher accuracy that WMD and even its supervised variation s-WMD.