PathAlign: A vision–language model for whole slide images in histopathology

Faruk Ahmed; Andrew Sellergen; Lin Yang; Shawn Xu; Boris Babenko; Abbi Ward; Niels Olson; Arash Mohtashamian; Yossi Matias; Greg S. Corrado; Quang Duong; Dale R. Webster; Shravya Shetty; Daniel Golden; Yun Liu; David F. Steiner; Ellery Wulczyn

PathAlign: A vision–language model for whole slide images in histopathology

Faruk Ahmed, Andrew Sellergen, Lin Yang, Shawn Xu, Boris Babenko, Abbi Ward, Niels Olson, Arash Mohtashamian, Yossi Matias, Greg S. Corrado, Quang Duong, Dale R. Webster, Shravya Shetty, Daniel Golden, Yun Liu, David F. Steiner, Ellery Wulczyn

Proceedings of the MICCAI Workshop on Computational Pathology, PMLR 254:72-108, 2024.

Abstract

Microscopic interpretation of histopathology images underlies many important diagnostic and treatment decisions. While advances in vision–language modeling raise new oppor- tunities for analysis of such images, the gigapixel-scale size of whole slide images (WSIs) introduces unique challenges. Additionally, pathology reports simultaneously highlight key findings from small regions while also aggregating interpretation across multiple slides, often making it difficult to create robust image–text pairs. As such, pathology reports remain a largely untapped source of supervision in computational pathology, with most efforts relying on region-of-interest annotations or self-supervision at the patch-level. In this work, we develop a vision–language model based on the BLIP-2 framework using WSIs paired with curated text from pathology reports. This enables applications utilizing a shared image–text embedding space, such as text or image retrieval for finding cases of interest, as well as integration of the WSI encoder with a frozen large language model (LLM) for WSI-based generative text capabilities such as report generation or AI-in-the-loop interactions. We utilize a de-identified dataset of over 350,000 WSIs and diagnostic text pairs, spanning a wide range of diagnoses, procedure types, and tissue types. We present pathologist evaluation of text generation and text retrieval using WSI embeddings, as well as results for WSI classification and workflow prioritization (slide-level triaging). Model-generated text for WSIs was rated by pathologists as accurate, without clinically significant error or omission, for 78% of WSIs on average. This work demonstrates exciting potential capabilities for language-aligned WSI embeddings.

Cite this Paper

BibTeX


@InProceedings{pmlr-v254-ahmed24a,
  title = 	 {PathAlign: A vision–language model for whole slide images in histopathology},
  author =       {Ahmed, Faruk and Sellergen, Andrew and Yang, Lin and Xu, Shawn and Babenko, Boris and Ward, Abbi and Olson, Niels and Mohtashamian, Arash and Matias, Yossi and Corrado, Greg S. and Duong, Quang and Webster, Dale R. and Shetty, Shravya and Golden, Daniel and Liu, Yun and Steiner, David F. and Wulczyn, Ellery},
  booktitle = 	 {Proceedings of the MICCAI Workshop on Computational Pathology},
  pages = 	 {72--108},
  year = 	 {2024},
  editor = 	 {Ciompi, Francesco and Khalili, Nadieh and Studer, Linda and Poceviciute, Milda and Khan, Amjad and Veta, Mitko and Jiao, Yiping and Haj-Hosseini, Neda and Chen, Hao and Raza, Shan and Minhas, FayyazZlobec, Inti and Burlutskiy, Nikolay and Vilaplana, Veronica and Brattoli, Biagio and Muller, Henning and Atzori, Manfredo and Raza, Shan and Minhas, Fayyaz},
  volume = 	 {254},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {06 Oct},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v254/main/assets/ahmed24a/ahmed24a.pdf},
  url = 	 {https://proceedings.mlr.press/v254/ahmed24a.html},
  abstract = 	 {Microscopic interpretation of histopathology images underlies many important diagnostic and treatment decisions. While advances in vision–language modeling raise new oppor- tunities for analysis of such images, the gigapixel-scale size of whole slide images (WSIs) introduces unique challenges. Additionally, pathology reports simultaneously highlight key findings from small regions while also aggregating interpretation across multiple slides, often making it difficult to create robust image–text pairs. As such, pathology reports remain a largely untapped source of supervision in computational pathology, with most efforts relying on region-of-interest annotations or self-supervision at the patch-level. In this work, we develop a vision–language model based on the BLIP-2 framework using WSIs paired with curated text from pathology reports. This enables applications utilizing a shared image–text embedding space, such as text or image retrieval for finding cases of interest, as well as integration of the WSI encoder with a frozen large language model (LLM) for WSI-based generative text capabilities such as report generation or AI-in-the-loop interactions. We utilize a de-identified dataset of over 350,000 WSIs and diagnostic text pairs, spanning a wide range of diagnoses, procedure types, and tissue types. We present pathologist evaluation of text generation and text retrieval using WSI embeddings, as well as results for WSI classification and workflow prioritization (slide-level triaging). Model-generated text for WSIs was rated by pathologists as accurate, without clinically significant error or omission, for 78% of WSIs on average. This work demonstrates exciting potential capabilities for language-aligned WSI embeddings.}
}

Endnote

%0 Conference Paper
%T PathAlign: A vision–language model for whole slide images in histopathology
%A Faruk Ahmed
%A Andrew Sellergen
%A Lin Yang
%A Shawn Xu
%A Boris Babenko
%A Abbi Ward
%A Niels Olson
%A Arash Mohtashamian
%A Yossi Matias
%A Greg S. Corrado
%A Quang Duong
%A Dale R. Webster
%A Shravya Shetty
%A Daniel Golden
%A Yun Liu
%A David F. Steiner
%A Ellery Wulczyn
%B Proceedings of the MICCAI Workshop on Computational Pathology
%C Proceedings of Machine Learning Research
%D 2024
%E Francesco Ciompi
%E Nadieh Khalili
%E Linda Studer
%E Milda Poceviciute
%E Amjad Khan
%E Mitko Veta
%E Yiping Jiao
%E Neda Haj-Hosseini
%E Hao Chen
%E Shan Raza
%E Fayyaz MinhasInti Zlobec
%E Nikolay Burlutskiy
%E Veronica Vilaplana
%E Biagio Brattoli
%E Henning Muller
%E Manfredo Atzori
%E Shan Raza
%E Fayyaz Minhas	
%F pmlr-v254-ahmed24a
%I PMLR
%P 72--108
%U https://proceedings.mlr.press/v254/ahmed24a.html
%V 254
%X Microscopic interpretation of histopathology images underlies many important diagnostic and treatment decisions. While advances in vision–language modeling raise new oppor- tunities for analysis of such images, the gigapixel-scale size of whole slide images (WSIs) introduces unique challenges. Additionally, pathology reports simultaneously highlight key findings from small regions while also aggregating interpretation across multiple slides, often making it difficult to create robust image–text pairs. As such, pathology reports remain a largely untapped source of supervision in computational pathology, with most efforts relying on region-of-interest annotations or self-supervision at the patch-level. In this work, we develop a vision–language model based on the BLIP-2 framework using WSIs paired with curated text from pathology reports. This enables applications utilizing a shared image–text embedding space, such as text or image retrieval for finding cases of interest, as well as integration of the WSI encoder with a frozen large language model (LLM) for WSI-based generative text capabilities such as report generation or AI-in-the-loop interactions. We utilize a de-identified dataset of over 350,000 WSIs and diagnostic text pairs, spanning a wide range of diagnoses, procedure types, and tissue types. We present pathologist evaluation of text generation and text retrieval using WSI embeddings, as well as results for WSI classification and workflow prioritization (slide-level triaging). Model-generated text for WSIs was rated by pathologists as accurate, without clinically significant error or omission, for 78% of WSIs on average. This work demonstrates exciting potential capabilities for language-aligned WSI embeddings.

APA


Ahmed, F., Sellergen, A., Yang, L., Xu, S., Babenko, B., Ward, A., Olson, N., Mohtashamian, A., Matias, Y., Corrado, G.S., Duong, Q., Webster, D.R., Shetty, S., Golden, D., Liu, Y., Steiner, D.F. & Wulczyn, E.. (2024). PathAlign: A vision–language model for whole slide images in histopathology. Proceedings of the MICCAI Workshop on Computational Pathology, in Proceedings of Machine Learning Research 254:72-108 Available from https://proceedings.mlr.press/v254/ahmed24a.html.

Related Material

Download PDF