Federated Multilingual Models for Medical Transcript Analysis

Andrea Manoel; Mirian del Carmen Hipolito Garcia; Tal Baumel; Shize Su; Jialei Chen; Robert Sim; Dan Miller; Danny Karmon; Dimitrios Dimitriadis

Federated Multilingual Models for Medical Transcript Analysis

Andrea Manoel, Mirian del Carmen Hipolito Garcia, Tal Baumel, Shize Su, Jialei Chen, Robert Sim, Dan Miller, Danny Karmon, Dimitrios Dimitriadis

Proceedings of the Conference on Health, Inference, and Learning, PMLR 209:147-162, 2023.

Abstract

Federated Learning (FL) is a machine learning approach that allows the model trainer to access more data samples by training across multiple decentralized data sources while enforcing data access constraints. Such trained models can achieve significantly higher performance beyond what can be done when trained on a single data source. In a FL setting, none of the training data is ever transmitted to any central location; i.e. sensitive data remains local and private. These characteristics make FL perfectly suited for applications in healthcare, where a variety of compliance constraints restrict how data may be handled. Despite these apparent benefits in compliance and privacy, certain scenarios such as heterogeneity of the local data distributions pose significant challenges for FL. Such challenges are even more pronounced in the case of a multilingual setting. This paper presents a FL system for pre-training a large-scale multi-lingual model suitable for fine-tuning on downstream tasks such as medical entity tagging. Our work represents one of the first such production-scale systems, capable of training across multiple highly heterogeneous data providers, and achieving levels of accuracy that could not be otherwise achieved by using central training with public data only. We also show that the global model performance can be further improved by a local training step.

Cite this Paper

BibTeX


@InProceedings{pmlr-v209-manoel23a,
  title = 	 {Federated Multilingual Models for Medical Transcript Analysis},
  author =       {Manoel, Andrea and Garcia, Mirian del Carmen Hipolito and Baumel, Tal and Su, Shize and Chen, Jialei and Sim, Robert and Miller, Dan and Karmon, Danny and Dimitriadis, Dimitrios},
  booktitle = 	 {Proceedings of the Conference on Health, Inference, and Learning},
  pages = 	 {147--162},
  year = 	 {2023},
  editor = 	 {Mortazavi, Bobak J. and Sarker, Tasmie and Beam, Andrew and Ho, Joyce C.},
  volume = 	 {209},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {22 Jun--24 Jun},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v209/manoel23a/manoel23a.pdf},
  url = 	 {https://proceedings.mlr.press/v209/manoel23a.html},
  abstract = 	 {Federated Learning (FL) is a machine learning approach that allows the model trainer to access more data samples by training across multiple decentralized data sources while enforcing data access constraints. Such trained models can achieve significantly higher performance beyond what can be done when trained on a single data source. In a FL setting, none of the training data is ever transmitted to any central location; i.e. sensitive data remains local and private. These characteristics make FL perfectly suited for applications in healthcare, where a variety of compliance constraints restrict how data may be handled. Despite these apparent benefits in compliance and privacy, certain scenarios such as heterogeneity of the local data distributions pose significant challenges for FL. Such challenges are even more pronounced in the case of a multilingual setting. This paper presents a FL system for pre-training a large-scale multi-lingual model suitable for fine-tuning on downstream tasks such as medical entity tagging. Our work represents one of the first such production-scale systems, capable of training across multiple highly heterogeneous data providers, and achieving levels of accuracy that could not be otherwise achieved by using central training with public data only. We also show that the global model performance can be further improved by a local training step.}
}

Endnote

%0 Conference Paper
%T Federated Multilingual Models for Medical Transcript Analysis
%A Andrea Manoel
%A Mirian del Carmen Hipolito Garcia
%A Tal Baumel
%A Shize Su
%A Jialei Chen
%A Robert Sim
%A Dan Miller
%A Danny Karmon
%A Dimitrios Dimitriadis
%B Proceedings of the Conference on Health, Inference, and Learning
%C Proceedings of Machine Learning Research
%D 2023
%E Bobak J. Mortazavi
%E Tasmie Sarker
%E Andrew Beam
%E Joyce C. Ho	
%F pmlr-v209-manoel23a
%I PMLR
%P 147--162
%U https://proceedings.mlr.press/v209/manoel23a.html
%V 209
%X Federated Learning (FL) is a machine learning approach that allows the model trainer to access more data samples by training across multiple decentralized data sources while enforcing data access constraints. Such trained models can achieve significantly higher performance beyond what can be done when trained on a single data source. In a FL setting, none of the training data is ever transmitted to any central location; i.e. sensitive data remains local and private. These characteristics make FL perfectly suited for applications in healthcare, where a variety of compliance constraints restrict how data may be handled. Despite these apparent benefits in compliance and privacy, certain scenarios such as heterogeneity of the local data distributions pose significant challenges for FL. Such challenges are even more pronounced in the case of a multilingual setting. This paper presents a FL system for pre-training a large-scale multi-lingual model suitable for fine-tuning on downstream tasks such as medical entity tagging. Our work represents one of the first such production-scale systems, capable of training across multiple highly heterogeneous data providers, and achieving levels of accuracy that could not be otherwise achieved by using central training with public data only. We also show that the global model performance can be further improved by a local training step.

APA


Manoel, A., Garcia, M.d.C.H., Baumel, T., Su, S., Chen, J., Sim, R., Miller, D., Karmon, D. & Dimitriadis, D.. (2023). Federated Multilingual Models for Medical Transcript Analysis. Proceedings of the Conference on Health, Inference, and Learning, in Proceedings of Machine Learning Research 209:147-162 Available from https://proceedings.mlr.press/v209/manoel23a.html.

Related Material

Download PDF