Inverse Visual Question Answering with Multi-Level Attentions

Yaser Alwattar, Yuhong Guo
Proceedings of The 12th Asian Conference on Machine Learning, PMLR 129:449-464, 2020.

Abstract

Inverse Visual Question Answering (iVQA) is a contemporary task emerged from the need of improving visual and language understanding. It tackles the challenging problem of generating a corresponding question for a given image-answer pair. In this paper, we propose a novel deep multi-level attention model to address inverse visual question answering. The proposed model generates regional visual and semantic features at the object level and then enhances them with the answer cue by using attention mechanisms. Two levels of multiple attentions are employed in the model, including the dual attention at the partial question encoding step and the dynamic attention at the question’s next word generation step. We evaluate the proposed model on the VQA V1 dataset. It demonstrates the state-of-the-art performance in terms of multiple commonly used metrics.

Cite this Paper


BibTeX
@InProceedings{pmlr-v129-alwattar20a, title = {Inverse Visual Question Answering with Multi-Level Attentions}, author = {Alwattar, Yaser and Guo, Yuhong}, booktitle = {Proceedings of The 12th Asian Conference on Machine Learning}, pages = {449--464}, year = {2020}, editor = {Pan, Sinno Jialin and Sugiyama, Masashi}, volume = {129}, series = {Proceedings of Machine Learning Research}, month = {18--20 Nov}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v129/alwatter20a/alwattar20a.pdf}, url = {https://proceedings.mlr.press/v129/alwattar20a.html}, abstract = {Inverse Visual Question Answering (iVQA) is a contemporary task emerged from the need of improving visual and language understanding. It tackles the challenging problem of generating a corresponding question for a given image-answer pair. In this paper, we propose a novel deep multi-level attention model to address inverse visual question answering. The proposed model generates regional visual and semantic features at the object level and then enhances them with the answer cue by using attention mechanisms. Two levels of multiple attentions are employed in the model, including the dual attention at the partial question encoding step and the dynamic attention at the question’s next word generation step. We evaluate the proposed model on the VQA V1 dataset. It demonstrates the state-of-the-art performance in terms of multiple commonly used metrics.} }
Endnote
%0 Conference Paper %T Inverse Visual Question Answering with Multi-Level Attentions %A Yaser Alwattar %A Yuhong Guo %B Proceedings of The 12th Asian Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2020 %E Sinno Jialin Pan %E Masashi Sugiyama %F pmlr-v129-alwattar20a %I PMLR %P 449--464 %U https://proceedings.mlr.press/v129/alwattar20a.html %V 129 %X Inverse Visual Question Answering (iVQA) is a contemporary task emerged from the need of improving visual and language understanding. It tackles the challenging problem of generating a corresponding question for a given image-answer pair. In this paper, we propose a novel deep multi-level attention model to address inverse visual question answering. The proposed model generates regional visual and semantic features at the object level and then enhances them with the answer cue by using attention mechanisms. Two levels of multiple attentions are employed in the model, including the dual attention at the partial question encoding step and the dynamic attention at the question’s next word generation step. We evaluate the proposed model on the VQA V1 dataset. It demonstrates the state-of-the-art performance in terms of multiple commonly used metrics.
APA
Alwattar, Y. & Guo, Y.. (2020). Inverse Visual Question Answering with Multi-Level Attentions. Proceedings of The 12th Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 129:449-464 Available from https://proceedings.mlr.press/v129/alwattar20a.html.

Related Material