Robust Document Distance with Wasserstein-Fisher-Rao metric

Zihao Wang, Datong Zhou, Ming Yang, Yong Zhang, Chenglong Rao, Hao Wu
Proceedings of The 12th Asian Conference on Machine Learning, PMLR 129:721-736, 2020.

Abstract

Computing the distance among linguistic objects is an essential problem in natural language processing. The word mover’s distance (WMD) has been successfully applied to measure the document distance by synthesizing the low-level word similarity with the framework of optimal transport (OT). However, due to the global transportation nature of OT, the WMD may overestimate the semantic dissimilarity when documents contain unequal semantic details. In this paper, we propose to address this overestimation issue with a novel Wasserstein-Fisher-Rao (WFR) document distance grounded on unbalanced optimal transport theory. Compared to the WMD, the WFR document distance provides a trade-off between global transportation and local truncation, which leads to a better similarity measure for unequal semantic details. Moreover, an efficient prune strategy is particularly designed for the WFR document distance to facilitate the top-k queries among a large number of documents. Extensive experimental results show that the WFR document distance achieves higher accuracy that WMD and even its supervised variation s-WMD.

Cite this Paper


BibTeX
@InProceedings{pmlr-v129-wang20c, title = {Robust Document Distance with Wasserstein-Fisher-Rao metric}, author = {Wang, Zihao and Zhou, Datong and Yang, Ming and Zhang, Yong and Rao, Chenglong and Wu, Hao}, booktitle = {Proceedings of The 12th Asian Conference on Machine Learning}, pages = {721--736}, year = {2020}, editor = {Pan, Sinno Jialin and Sugiyama, Masashi}, volume = {129}, series = {Proceedings of Machine Learning Research}, month = {18--20 Nov}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v129/wang20c/wang20c.pdf}, url = {https://proceedings.mlr.press/v129/wang20c.html}, abstract = {Computing the distance among linguistic objects is an essential problem in natural language processing. The word mover’s distance (WMD) has been successfully applied to measure the document distance by synthesizing the low-level word similarity with the framework of optimal transport (OT). However, due to the global transportation nature of OT, the WMD may overestimate the semantic dissimilarity when documents contain unequal semantic details. In this paper, we propose to address this overestimation issue with a novel Wasserstein-Fisher-Rao (WFR) document distance grounded on unbalanced optimal transport theory. Compared to the WMD, the WFR document distance provides a trade-off between global transportation and local truncation, which leads to a better similarity measure for unequal semantic details. Moreover, an efficient prune strategy is particularly designed for the WFR document distance to facilitate the top-k queries among a large number of documents. Extensive experimental results show that the WFR document distance achieves higher accuracy that WMD and even its supervised variation s-WMD.} }
Endnote
%0 Conference Paper %T Robust Document Distance with Wasserstein-Fisher-Rao metric %A Zihao Wang %A Datong Zhou %A Ming Yang %A Yong Zhang %A Chenglong Rao %A Hao Wu %B Proceedings of The 12th Asian Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2020 %E Sinno Jialin Pan %E Masashi Sugiyama %F pmlr-v129-wang20c %I PMLR %P 721--736 %U https://proceedings.mlr.press/v129/wang20c.html %V 129 %X Computing the distance among linguistic objects is an essential problem in natural language processing. The word mover’s distance (WMD) has been successfully applied to measure the document distance by synthesizing the low-level word similarity with the framework of optimal transport (OT). However, due to the global transportation nature of OT, the WMD may overestimate the semantic dissimilarity when documents contain unequal semantic details. In this paper, we propose to address this overestimation issue with a novel Wasserstein-Fisher-Rao (WFR) document distance grounded on unbalanced optimal transport theory. Compared to the WMD, the WFR document distance provides a trade-off between global transportation and local truncation, which leads to a better similarity measure for unequal semantic details. Moreover, an efficient prune strategy is particularly designed for the WFR document distance to facilitate the top-k queries among a large number of documents. Extensive experimental results show that the WFR document distance achieves higher accuracy that WMD and even its supervised variation s-WMD.
APA
Wang, Z., Zhou, D., Yang, M., Zhang, Y., Rao, C. & Wu, H.. (2020). Robust Document Distance with Wasserstein-Fisher-Rao metric. Proceedings of The 12th Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 129:721-736 Available from https://proceedings.mlr.press/v129/wang20c.html.

Related Material