Preprocessing Pathology Reports for Vision-Language Model Development

Ruben T. Lucassen, Tijn van de Luijtgaarden, Sander P. J. Moonemans, Willeke A. M. Blokx, Mitko Veta
Proceedings of the MICCAI Workshop on Computational Pathology, PMLR 254:61-71, 2024.

Abstract

Pathology reports are increasingly being used for development of vision-language models. Because the reports often include information that cannot directly be derived from paired images, careful selection of information is required to prevent hallucinations in tasks like report generation. In this paper, we present a language model for subsentence segmentation based on the information content, as part of a preprocessing workflow for 27,500 pathology reports of cutaneous melanocytic lesions. After initial clean up, the reports were first translated from Dutch to English and then segmented by separate language models. Both models were developed using an iterative approach, in which the development dataset was expanded with manually corrected model predictions for previously unannotated reports before finetuning the next version of the models. Over the course of eight iterations, the development dataset was in the end scaled up to 1,500 translated and annotated reports. On the independent test set of 3,597 sentences from 150 reports, 219 translation errors (6,1%) of different severities were counted. The subsentence segmentation model achieved a strong predictive performance on the test set with a macro average F1 -score of 0.921 (95% CI, 0.890-0.940) and a weighted average F1 -score of 0.952 (95% CI, 0.944-0.960) over 13 different classes. The remaining 25,850 unannotated reports were translated and segmented using the final models to complete the dataset preprocessing. Differences in word count and class distribution between section types of the reports were explored in preparation for future vision-language modeling. The presented methodology is generic and can, therefore, easily be extended to multiple or different pathology domains beyond melanocytic skin lesions. Code and trained model parameters are made publicly available.

Cite this Paper


BibTeX
@InProceedings{pmlr-v254-lucassen24a, title = {Preprocessing Pathology Reports for Vision-Language Model Development}, author = {Lucassen, Ruben T. and Luijtgaarden, Tijn van de and Moonemans, Sander P. J. and Blokx, Willeke A. M. and Veta, Mitko}, booktitle = {Proceedings of the MICCAI Workshop on Computational Pathology}, pages = {61--71}, year = {2024}, editor = {Ciompi, Francesco and Khalili, Nadieh and Studer, Linda and Poceviciute, Milda and Khan, Amjad and Veta, Mitko and Jiao, Yiping and Haj-Hosseini, Neda and Chen, Hao and Raza, Shan and Minhas, FayyazZlobec, Inti and Burlutskiy, Nikolay and Vilaplana, Veronica and Brattoli, Biagio and Muller, Henning and Atzori, Manfredo and Raza, Shan and Minhas, Fayyaz}, volume = {254}, series = {Proceedings of Machine Learning Research}, month = {06 Oct}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v254/main/assets/lucassen24a/lucassen24a.pdf}, url = {https://proceedings.mlr.press/v254/lucassen24a.html}, abstract = {Pathology reports are increasingly being used for development of vision-language models. Because the reports often include information that cannot directly be derived from paired images, careful selection of information is required to prevent hallucinations in tasks like report generation. In this paper, we present a language model for subsentence segmentation based on the information content, as part of a preprocessing workflow for 27,500 pathology reports of cutaneous melanocytic lesions. After initial clean up, the reports were first translated from Dutch to English and then segmented by separate language models. Both models were developed using an iterative approach, in which the development dataset was expanded with manually corrected model predictions for previously unannotated reports before finetuning the next version of the models. Over the course of eight iterations, the development dataset was in the end scaled up to 1,500 translated and annotated reports. On the independent test set of 3,597 sentences from 150 reports, 219 translation errors (6,1%) of different severities were counted. The subsentence segmentation model achieved a strong predictive performance on the test set with a macro average F1 -score of 0.921 (95% CI, 0.890-0.940) and a weighted average F1 -score of 0.952 (95% CI, 0.944-0.960) over 13 different classes. The remaining 25,850 unannotated reports were translated and segmented using the final models to complete the dataset preprocessing. Differences in word count and class distribution between section types of the reports were explored in preparation for future vision-language modeling. The presented methodology is generic and can, therefore, easily be extended to multiple or different pathology domains beyond melanocytic skin lesions. Code and trained model parameters are made publicly available.} }
Endnote
%0 Conference Paper %T Preprocessing Pathology Reports for Vision-Language Model Development %A Ruben T. Lucassen %A Tijn van de Luijtgaarden %A Sander P. J. Moonemans %A Willeke A. M. Blokx %A Mitko Veta %B Proceedings of the MICCAI Workshop on Computational Pathology %C Proceedings of Machine Learning Research %D 2024 %E Francesco Ciompi %E Nadieh Khalili %E Linda Studer %E Milda Poceviciute %E Amjad Khan %E Mitko Veta %E Yiping Jiao %E Neda Haj-Hosseini %E Hao Chen %E Shan Raza %E Fayyaz MinhasInti Zlobec %E Nikolay Burlutskiy %E Veronica Vilaplana %E Biagio Brattoli %E Henning Muller %E Manfredo Atzori %E Shan Raza %E Fayyaz Minhas %F pmlr-v254-lucassen24a %I PMLR %P 61--71 %U https://proceedings.mlr.press/v254/lucassen24a.html %V 254 %X Pathology reports are increasingly being used for development of vision-language models. Because the reports often include information that cannot directly be derived from paired images, careful selection of information is required to prevent hallucinations in tasks like report generation. In this paper, we present a language model for subsentence segmentation based on the information content, as part of a preprocessing workflow for 27,500 pathology reports of cutaneous melanocytic lesions. After initial clean up, the reports were first translated from Dutch to English and then segmented by separate language models. Both models were developed using an iterative approach, in which the development dataset was expanded with manually corrected model predictions for previously unannotated reports before finetuning the next version of the models. Over the course of eight iterations, the development dataset was in the end scaled up to 1,500 translated and annotated reports. On the independent test set of 3,597 sentences from 150 reports, 219 translation errors (6,1%) of different severities were counted. The subsentence segmentation model achieved a strong predictive performance on the test set with a macro average F1 -score of 0.921 (95% CI, 0.890-0.940) and a weighted average F1 -score of 0.952 (95% CI, 0.944-0.960) over 13 different classes. The remaining 25,850 unannotated reports were translated and segmented using the final models to complete the dataset preprocessing. Differences in word count and class distribution between section types of the reports were explored in preparation for future vision-language modeling. The presented methodology is generic and can, therefore, easily be extended to multiple or different pathology domains beyond melanocytic skin lesions. Code and trained model parameters are made publicly available.
APA
Lucassen, R.T., Luijtgaarden, T.v.d., Moonemans, S.P.J., Blokx, W.A.M. & Veta, M.. (2024). Preprocessing Pathology Reports for Vision-Language Model Development. Proceedings of the MICCAI Workshop on Computational Pathology, in Proceedings of Machine Learning Research 254:61-71 Available from https://proceedings.mlr.press/v254/lucassen24a.html.

Related Material