TVT: Two-View Transformer Network for Video Captioning

Ming Chen, Yingming Li, Zhongfei Zhang, Siyu Huang
Proceedings of The 10th Asian Conference on Machine Learning, PMLR 95:847-862, 2018.

Abstract

Video captioning is a task of automatically generating the natural text description of a given video. There are two main challenges in video captioning under the context of an encoder-decoder framework: 1) How to model the sequential information; 2) How to combine the modalities including video and text. For challenge 1), the recurrent neural networks (RNNs) based methods are currently the most common approaches for learning temporal representations of videos, while they suffer from a high computational cost. For challenge 2), the features of different modalities are often roughly concatenated together without insightful discussion. In this paper, we introduce a novel video captioning framework, i.e., Two-View Transformer (TVT). TVT comprises of a backbone of Transformer network for sequential representation and two types of fusion blocks in decoder layers for combining different modalities effectively. Empirical study shows that our TVT model outperforms the state-of-the-art methods on the MSVD dataset and achieves a competitive performance on the MSR-VTT dataset under four common metrics.

Cite this Paper


BibTeX
@InProceedings{pmlr-v95-chen18b, title = {TVT: Two-View Transformer Network for Video Captioning}, author = {Chen, Ming and Li, Yingming and Zhang, Zhongfei and Huang, Siyu}, booktitle = {Proceedings of The 10th Asian Conference on Machine Learning}, pages = {847--862}, year = {2018}, editor = {Zhu, Jun and Takeuchi, Ichiro}, volume = {95}, series = {Proceedings of Machine Learning Research}, month = {14--16 Nov}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v95/chen18b/chen18b.pdf}, url = {https://proceedings.mlr.press/v95/chen18b.html}, abstract = {Video captioning is a task of automatically generating the natural text description of a given video. There are two main challenges in video captioning under the context of an encoder-decoder framework: 1) How to model the sequential information; 2) How to combine the modalities including video and text. For challenge 1), the recurrent neural networks (RNNs) based methods are currently the most common approaches for learning temporal representations of videos, while they suffer from a high computational cost. For challenge 2), the features of different modalities are often roughly concatenated together without insightful discussion. In this paper, we introduce a novel video captioning framework, i.e., Two-View Transformer (TVT). TVT comprises of a backbone of Transformer network for sequential representation and two types of fusion blocks in decoder layers for combining different modalities effectively. Empirical study shows that our TVT model outperforms the state-of-the-art methods on the MSVD dataset and achieves a competitive performance on the MSR-VTT dataset under four common metrics.} }
Endnote
%0 Conference Paper %T TVT: Two-View Transformer Network for Video Captioning %A Ming Chen %A Yingming Li %A Zhongfei Zhang %A Siyu Huang %B Proceedings of The 10th Asian Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2018 %E Jun Zhu %E Ichiro Takeuchi %F pmlr-v95-chen18b %I PMLR %P 847--862 %U https://proceedings.mlr.press/v95/chen18b.html %V 95 %X Video captioning is a task of automatically generating the natural text description of a given video. There are two main challenges in video captioning under the context of an encoder-decoder framework: 1) How to model the sequential information; 2) How to combine the modalities including video and text. For challenge 1), the recurrent neural networks (RNNs) based methods are currently the most common approaches for learning temporal representations of videos, while they suffer from a high computational cost. For challenge 2), the features of different modalities are often roughly concatenated together without insightful discussion. In this paper, we introduce a novel video captioning framework, i.e., Two-View Transformer (TVT). TVT comprises of a backbone of Transformer network for sequential representation and two types of fusion blocks in decoder layers for combining different modalities effectively. Empirical study shows that our TVT model outperforms the state-of-the-art methods on the MSVD dataset and achieves a competitive performance on the MSR-VTT dataset under four common metrics.
APA
Chen, M., Li, Y., Zhang, Z. & Huang, S.. (2018). TVT: Two-View Transformer Network for Video Captioning. Proceedings of The 10th Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 95:847-862 Available from https://proceedings.mlr.press/v95/chen18b.html.

Related Material