Inverse Visual Question Answering with Multi-Level Attentions
Proceedings of The 12th Asian Conference on Machine Learning, PMLR 129:449-464, 2020.
Inverse Visual Question Answering (iVQA) is a contemporary task emerged from the need of improving visual and language understanding. It tackles the challenging problem of generating a corresponding question for a given image-answer pair. In this paper, we propose a novel deep multi-level attention model to address inverse visual question answering. The proposed model generates regional visual and semantic features at the object level and then enhances them with the answer cue by using attention mechanisms. Two levels of multiple attentions are employed in the model, including the dual attention at the partial question encoding step and the dynamic attention at the question’s next word generation step. We evaluate the proposed model on the VQA V1 dataset. It demonstrates the state-of-the-art performance in terms of multiple commonly used metrics.