ConText: Driving In-context Learning for Text Removal and Segmentation

Fei Zhang; Pei Zhang; Baosong Yang; Fei Huang; Yanfeng Wang; Ya Zhang

ConText: Driving In-context Learning for Text Removal and Segmentation

Fei Zhang, Pei Zhang, Baosong Yang, Fei Huang, Yanfeng Wang, Ya Zhang

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:76998-77016, 2025.

Abstract

This paper presents the first study on adapting the visual in-context learning (V-ICL) paradigm to optical character recognition tasks, specifically focusing on text removal and segmentation. Most existing V-ICL generalists employ a reasoning-as-reconstruction approach: they turn to using a straightforward image-label compositor as the prompt and query input, and then masking the query label to generate the desired output. This direct prompt confines the model to a challenging single-step reasoning process. To address this, we propose a task-chaining compositor in the form of image-removal-segmentation, providing an enhanced prompt that elicits reasoning with enriched intermediates. Additionally, we introduce context-aware aggregation, integrating the chained prompt pattern into the latent query representation, thereby strengthening the model’s in-context reasoning. We also consider the issue of visual heterogeneity, which complicates the selection of homogeneous demonstrations in text recognition. Accordingly, this is effectively addressed through a simple self-prompting strategy, preventing the model’s in-context learnability from devolving into specialist-like, context-free inference. Collectively, these insights culminate in our ConText model, which achieves new state-of-the-art across both in- and out-of-domain benchmarks. The code is available at https://github.com/Ferenas/ConText.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-zhang25dg,
  title = 	 {{C}on{T}ext: Driving In-context Learning for Text Removal and Segmentation},
  author =       {Zhang, Fei and Zhang, Pei and Yang, Baosong and Huang, Fei and Wang, Yanfeng and Zhang, Ya},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {76998--77016},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/zhang25dg/zhang25dg.pdf},
  url = 	 {https://proceedings.mlr.press/v267/zhang25dg.html},
  abstract = 	 {This paper presents the first study on adapting the visual in-context learning (V-ICL) paradigm to optical character recognition tasks, specifically focusing on text removal and segmentation. Most existing V-ICL generalists employ a reasoning-as-reconstruction approach: they turn to using a straightforward image-label compositor as the prompt and query input, and then masking the query label to generate the desired output. This direct prompt confines the model to a challenging single-step reasoning process. To address this, we propose a task-chaining compositor in the form of image-removal-segmentation, providing an enhanced prompt that elicits reasoning with enriched intermediates. Additionally, we introduce context-aware aggregation, integrating the chained prompt pattern into the latent query representation, thereby strengthening the model’s in-context reasoning. We also consider the issue of visual heterogeneity, which complicates the selection of homogeneous demonstrations in text recognition. Accordingly, this is effectively addressed through a simple self-prompting strategy, preventing the model’s in-context learnability from devolving into specialist-like, context-free inference. Collectively, these insights culminate in our ConText model, which achieves new state-of-the-art across both in- and out-of-domain benchmarks. The code is available at https://github.com/Ferenas/ConText.}
}

Endnote

%0 Conference Paper
%T ConText: Driving In-context Learning for Text Removal and Segmentation
%A Fei Zhang
%A Pei Zhang
%A Baosong Yang
%A Fei Huang
%A Yanfeng Wang
%A Ya Zhang
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-zhang25dg
%I PMLR
%P 76998--77016
%U https://proceedings.mlr.press/v267/zhang25dg.html
%V 267
%X This paper presents the first study on adapting the visual in-context learning (V-ICL) paradigm to optical character recognition tasks, specifically focusing on text removal and segmentation. Most existing V-ICL generalists employ a reasoning-as-reconstruction approach: they turn to using a straightforward image-label compositor as the prompt and query input, and then masking the query label to generate the desired output. This direct prompt confines the model to a challenging single-step reasoning process. To address this, we propose a task-chaining compositor in the form of image-removal-segmentation, providing an enhanced prompt that elicits reasoning with enriched intermediates. Additionally, we introduce context-aware aggregation, integrating the chained prompt pattern into the latent query representation, thereby strengthening the model’s in-context reasoning. We also consider the issue of visual heterogeneity, which complicates the selection of homogeneous demonstrations in text recognition. Accordingly, this is effectively addressed through a simple self-prompting strategy, preventing the model’s in-context learnability from devolving into specialist-like, context-free inference. Collectively, these insights culminate in our ConText model, which achieves new state-of-the-art across both in- and out-of-domain benchmarks. The code is available at https://github.com/Ferenas/ConText.

APA

Zhang, F., Zhang, P., Yang, B., Huang, F., Wang, Y. & Zhang, Y.. (2025). ConText: Driving In-context Learning for Text Removal and Segmentation. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:76998-77016 Available from https://proceedings.mlr.press/v267/zhang25dg.html.

ConText: Driving In-context Learning for Text Removal and Segmentation

Abstract

Cite this Paper

Related Material