Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Kelvin Xu; Jimmy Ba; Ryan Kiros; Kyunghyun Cho; Aaron Courville; Ruslan Salakhudinov; Rich Zemel; Yoshua Bengio

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, Yoshua Bengio

Proceedings of the 32nd International Conference on Machine Learning, PMLR 37:2048-2057, 2015.

Abstract

Inspired by recent work in machine translation and object detection, we introduce an attention based model that automatically learns to describe the content of images. We describe how we can train this model in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound. We also show through visualization how the model is able to automatically learn to fix its gaze on salient objects while generating the corresponding words in the output sequence. We validate the use of attention with state-of-the-art performance on three benchmark datasets: Flickr8k, Flickr30k and MS COCO.

Cite this Paper

BibTeX


@InProceedings{pmlr-v37-xuc15,
  title = 	 {Show, Attend and Tell: Neural Image Caption Generation with Visual Attention},
  author = 	 {Xu, Kelvin and Ba, Jimmy and Kiros, Ryan and Cho, Kyunghyun and Courville, Aaron and Salakhudinov, Ruslan and Zemel, Rich and Bengio, Yoshua},
  booktitle = 	 {Proceedings of the 32nd International Conference on Machine Learning},
  pages = 	 {2048--2057},
  year = 	 {2015},
  editor = 	 {Bach, Francis and Blei, David},
  volume = 	 {37},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Lille, France},
  month = 	 {07--09 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v37/xuc15.pdf},
  url = 	 {https://proceedings.mlr.press/v37/xuc15.html},
  abstract = 	 {Inspired by recent work in machine translation and object detection, we introduce an attention based model that automatically learns to describe the content of images. We describe how we can train this model in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound. We also show through visualization how the model is able to automatically learn to fix its gaze on salient objects while generating the corresponding words in the output sequence. We validate the use of attention with state-of-the-art performance on three benchmark datasets: Flickr8k, Flickr30k and MS COCO.}
}

Endnote

%0 Conference Paper
%T Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
%A Kelvin Xu
%A Jimmy Ba
%A Ryan Kiros
%A Kyunghyun Cho
%A Aaron Courville
%A Ruslan Salakhudinov
%A Rich Zemel
%A Yoshua Bengio
%B Proceedings of the 32nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2015
%E Francis Bach
%E David Blei	
%F pmlr-v37-xuc15
%I PMLR
%P 2048--2057
%U https://proceedings.mlr.press/v37/xuc15.html
%V 37
%X Inspired by recent work in machine translation and object detection, we introduce an attention based model that automatically learns to describe the content of images. We describe how we can train this model in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound. We also show through visualization how the model is able to automatically learn to fix its gaze on salient objects while generating the corresponding words in the output sequence. We validate the use of attention with state-of-the-art performance on three benchmark datasets: Flickr8k, Flickr30k and MS COCO.

RIS


TY  - CPAPER
TI  - Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
AU  - Kelvin Xu
AU  - Jimmy Ba
AU  - Ryan Kiros
AU  - Kyunghyun Cho
AU  - Aaron Courville
AU  - Ruslan Salakhudinov
AU  - Rich Zemel
AU  - Yoshua Bengio
BT  - Proceedings of the 32nd International Conference on Machine Learning
DA  - 2015/06/01
ED  - Francis Bach
ED  - David Blei	
ID  - pmlr-v37-xuc15
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 37
SP  - 2048
EP  - 2057
L1  - http://proceedings.mlr.press/v37/xuc15.pdf
UR  - https://proceedings.mlr.press/v37/xuc15.html
AB  - Inspired by recent work in machine translation and object detection, we introduce an attention based model that automatically learns to describe the content of images. We describe how we can train this model in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound. We also show through visualization how the model is able to automatically learn to fix its gaze on salient objects while generating the corresponding words in the output sequence. We validate the use of attention with state-of-the-art performance on three benchmark datasets: Flickr8k, Flickr30k and MS COCO.
ER  -

APA


Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R. & Bengio, Y.. (2015). Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. Proceedings of the 32nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 37:2048-2057 Available from https://proceedings.mlr.press/v37/xuc15.html.

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Abstract

Cite this Paper

Related Material