Refining Visual Perception for Decoration Display: A Self-Enhanced Deep Captioning Model

Longfei Huang; Xiangyu Wu; Jingyuan Wang; Weili Guo; Yang Yang

Refining Visual Perception for Decoration Display: A Self-Enhanced Deep Captioning Model

Longfei Huang, Xiangyu Wu, Jingyuan Wang, Weili Guo, Yang Yang

Proceedings of the 16th Asian Conference on Machine Learning, PMLR 260:527-542, 2025.

Abstract

Traditional decoration displays usually include renderings and corresponding descriptions to give users a deeper understanding and feeling. Nevertheless, describing massive renderings undoubtedly requires a lot of manpower. Thanks to the development of artificial intelligence, especially deep learning techniques, image captioning has been developed to automatically generate captions for given images. However, the defect of exploring “perceptive’’ words (e.g., bright, capacious, and comfortable, etc) is exposed when transferring existing captioning approaches to the decoration display task. To address this issue, in this paper, we propose a self-enhanced deep captioning model, which generates the captions with visual perception using the designed Self-Enhanced Transformer (SET). In detail, SET first pre-trains the scene-aware encoder, which employs the multi-task-based multi-modal transformer to enhance the perceptive semantics of the visual representations. Then, SET combines the pre-trained encoder with the transformer decoder for fine-tuning and designs a knowledge-enhanced module on the top of the decoder to adaptively fuse the decoded representations and retrieved language cues for making more suitable word prediction. In experiments, we first validate SET on the MS-COCO dataset, and we achieve at least 0.6 improvements on the CIDEr-D score. Furthermore, to address the effectiveness of SET on the decoration display task, we collect a new dataset called DecorationCap. We present a thorough empirical analysis to verify the generality of SET and find that SET surpasses other comparison methods with at least 6.8 improvements on the CIDEr-D score.

Cite this Paper

BibTeX

@InProceedings{pmlr-v260-huang25a,
  title = 	 {{Refining Visual Perception for Decoration Display}: {A} Self-Enhanced Deep Captioning Model},
  author =       {Huang, Longfei and Wu, Xiangyu and Wang, Jingyuan and Guo, Weili and Yang, Yang},
  booktitle = 	 {Proceedings of the 16th Asian Conference on Machine Learning},
  pages = 	 {527--542},
  year = 	 {2025},
  editor = 	 {Nguyen, Vu and Lin, Hsuan-Tien},
  volume = 	 {260},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {05--08 Dec},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v260/main/assets/huang25a/huang25a.pdf},
  url = 	 {https://proceedings.mlr.press/v260/huang25a.html},
  abstract = 	 {Traditional decoration displays usually include renderings and corresponding descriptions to give users a deeper understanding and feeling. Nevertheless, describing massive renderings undoubtedly requires a lot of manpower. Thanks to the development of artificial intelligence, especially deep learning techniques, image captioning has been developed to automatically generate captions for given images. However, the defect of exploring “perceptive’’ words (e.g., bright, capacious, and comfortable, etc) is exposed when transferring existing captioning approaches to the decoration display task. To address this issue, in this paper, we propose a self-enhanced deep captioning model, which generates the captions with visual perception using the designed Self-Enhanced Transformer (SET). In detail, SET first pre-trains the scene-aware encoder, which employs the multi-task-based multi-modal transformer to enhance the perceptive semantics of the visual representations. Then, SET combines the pre-trained encoder with the transformer decoder for fine-tuning and designs a knowledge-enhanced module on the top of the decoder to adaptively fuse the decoded representations and retrieved language cues for making more suitable word prediction. In experiments, we first validate SET on the MS-COCO dataset, and we achieve at least 0.6 improvements on the CIDEr-D score. Furthermore, to address the effectiveness of SET on the decoration display task, we collect a new dataset called DecorationCap. We present a thorough empirical analysis to verify the generality of SET and find that SET surpasses other comparison methods with at least 6.8 improvements on the CIDEr-D score.}
}

Endnote

%0 Conference Paper
%T Refining Visual Perception for Decoration Display: A Self-Enhanced Deep Captioning Model
%A Longfei Huang
%A Xiangyu Wu
%A Jingyuan Wang
%A Weili Guo
%A Yang Yang
%B Proceedings of the 16th Asian Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Vu Nguyen
%E Hsuan-Tien Lin	
%F pmlr-v260-huang25a
%I PMLR
%P 527--542
%U https://proceedings.mlr.press/v260/huang25a.html
%V 260
%X Traditional decoration displays usually include renderings and corresponding descriptions to give users a deeper understanding and feeling. Nevertheless, describing massive renderings undoubtedly requires a lot of manpower. Thanks to the development of artificial intelligence, especially deep learning techniques, image captioning has been developed to automatically generate captions for given images. However, the defect of exploring “perceptive’’ words (e.g., bright, capacious, and comfortable, etc) is exposed when transferring existing captioning approaches to the decoration display task. To address this issue, in this paper, we propose a self-enhanced deep captioning model, which generates the captions with visual perception using the designed Self-Enhanced Transformer (SET). In detail, SET first pre-trains the scene-aware encoder, which employs the multi-task-based multi-modal transformer to enhance the perceptive semantics of the visual representations. Then, SET combines the pre-trained encoder with the transformer decoder for fine-tuning and designs a knowledge-enhanced module on the top of the decoder to adaptively fuse the decoded representations and retrieved language cues for making more suitable word prediction. In experiments, we first validate SET on the MS-COCO dataset, and we achieve at least 0.6 improvements on the CIDEr-D score. Furthermore, to address the effectiveness of SET on the decoration display task, we collect a new dataset called DecorationCap. We present a thorough empirical analysis to verify the generality of SET and find that SET surpasses other comparison methods with at least 6.8 improvements on the CIDEr-D score.

APA

Huang, L., Wu, X., Wang, J., Guo, W. & Yang, Y.. (2025). Refining Visual Perception for Decoration Display: A Self-Enhanced Deep Captioning Model. Proceedings of the 16th Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 260:527-542 Available from https://proceedings.mlr.press/v260/huang25a.html.

Refining Visual Perception for Decoration Display: A Self-Enhanced Deep Captioning Model

Abstract

Cite this Paper

Related Material