Relation Also Need Attention: Integrating Relation Information Into Image Captioning

Tianyu Chen; Zhixin Li; Tiantao Xian; Canlong Zhang; Huifang Ma

Relation Also Need Attention: Integrating Relation Information Into Image Captioning

Tianyu Chen, Zhixin Li, Tiantao Xian, Canlong Zhang, Huifang Ma

Proceedings of The 13th Asian Conference on Machine Learning, PMLR 157:1537-1552, 2021.

Abstract

Image captioning methods with attention mechanism are leading this field, especially models with global and local attention. But there are few conventional models to integrate the relationship information between various regions of the image. In this paper, this kind of relationship features are embedded into the fused attention mechanism to explore the internal visual and semantic relations between different object regions. Besides, to alleviate the exposure bias problem and make the training process more efficient, we combine Generative Adversarial Network with Reinforcement Learning and employ the greedy decoding method to generate a dynamic baseline reward for self-critical training. Finally, experiments on MSCOCO datasets show that the model can generate more accurate and vivid image captioning sentences and perform better in multiple prevailing metrics than the previous advanced models.

Cite this Paper

BibTeX


@InProceedings{pmlr-v157-chen21d,
  title = 	 {Relation Also Need Attention: Integrating Relation Information Into Image Captioning},
  author =       {Chen, Tianyu and Li, Zhixin and Xian, Tiantao and Zhang, Canlong and Ma, Huifang},
  booktitle = 	 {Proceedings of The 13th Asian Conference on Machine Learning},
  pages = 	 {1537--1552},
  year = 	 {2021},
  editor = 	 {Balasubramanian, Vineeth N. and Tsang, Ivor},
  volume = 	 {157},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {17--19 Nov},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v157/chen21d/chen21d.pdf},
  url = 	 {https://proceedings.mlr.press/v157/chen21d.html},
  abstract = 	 {Image captioning methods with attention mechanism are leading this field, especially models with global and local attention. But there are few conventional models to integrate the relationship information between various regions of the image. In this paper, this kind of relationship features are embedded into the fused attention mechanism to explore the internal visual and semantic relations between different object regions. Besides, to alleviate the exposure bias problem and make the training process more efficient, we combine Generative Adversarial Network with Reinforcement Learning and employ the greedy decoding method to generate a dynamic baseline reward for self-critical training. Finally, experiments on MSCOCO datasets show that the model can generate more accurate and vivid image captioning sentences and perform better in multiple prevailing metrics than the previous advanced models.}
}

Endnote

%0 Conference Paper
%T Relation Also Need Attention: Integrating Relation Information Into Image Captioning
%A Tianyu Chen
%A Zhixin Li
%A Tiantao Xian
%A Canlong Zhang
%A Huifang Ma
%B Proceedings of The 13th Asian Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2021
%E Vineeth N. Balasubramanian
%E Ivor Tsang	
%F pmlr-v157-chen21d
%I PMLR
%P 1537--1552
%U https://proceedings.mlr.press/v157/chen21d.html
%V 157
%X Image captioning methods with attention mechanism are leading this field, especially models with global and local attention. But there are few conventional models to integrate the relationship information between various regions of the image. In this paper, this kind of relationship features are embedded into the fused attention mechanism to explore the internal visual and semantic relations between different object regions. Besides, to alleviate the exposure bias problem and make the training process more efficient, we combine Generative Adversarial Network with Reinforcement Learning and employ the greedy decoding method to generate a dynamic baseline reward for self-critical training. Finally, experiments on MSCOCO datasets show that the model can generate more accurate and vivid image captioning sentences and perform better in multiple prevailing metrics than the previous advanced models.

APA


Chen, T., Li, Z., Xian, T., Zhang, C. & Ma, H.. (2021). Relation Also Need Attention: Integrating Relation Information Into Image Captioning. Proceedings of The 13th Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 157:1537-1552 Available from https://proceedings.mlr.press/v157/chen21d.html.

Related Material

Download PDF