ECG-Byte: A Tokenizer for End-to-End Generative Electrocardiogram Language Modeling

William Han; Chaojing Duan; Michael Rosenberg; Emerson Liu; Ding Zhao

ECG-Byte: A Tokenizer for End-to-End Generative Electrocardiogram Language Modeling

William Han, Chaojing Duan, Michael Rosenberg, Emerson Liu, Ding Zhao

Proceedings of the 10th Machine Learning for Healthcare Conference, PMLR 298, 2025.

Abstract

Large Language Models (LLMs) have demonstrated exceptional versatility across domains, including applications to electrocardiograms (ECGs). A growing body of work focuses on generating text from multi-channeled ECG signals and corresponding textual prompts. Existing approaches often involve a two-stage process: pretraining an ECG-specific encoder with a self-supervised learning (SSL) objective, followed by finetuning an LLM for natural language generation (NLG) using encoder-derived features. However, these methods face two key limitations: inefficiency due to multi-stage training and challenges in interpreting encoder-generated features. To overcome these issues, we propose ECG-Byte, an adapted byte pair encoding (BPE) tokenizer pipeline for autoregressive language modeling of ECGs. ECG-Byte compresses and encodes ECG signals into tokens, enabling direct end-to-end LLM training by combining ECG and text tokens. This approach enhances interpretability, as ECG tokens can be directly mapped back to the original signals. Leveraging ECG-Byte, we achieve competitive NLG performance while training 3 times faster and using just 48% of the data required by traditional two-stage methods.

Cite this Paper

BibTeX

@InProceedings{pmlr-v298-han25a,
  title = 	 {{ECG}-Byte: A Tokenizer for End-to-End Generative Electrocardiogram Language Modeling},
  author =       {Han, William and Duan, Chaojing and Rosenberg, Michael and Liu, Emerson and Zhao, Ding},
  booktitle = 	 {Proceedings of the 10th Machine Learning for Healthcare Conference},
  year = 	 {2025},
  editor = 	 {Agrawal, Monica and Deshpande, Kaivalya and Engelhard, Matthew and Joshi, Shalmali and Tang, Shengpu and Urteaga, Iñigo},
  volume = 	 {298},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {15--16 Aug},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v298/main/assets/han25a/han25a.pdf},
  url = 	 {https://proceedings.mlr.press/v298/han25a.html},
  abstract = 	 {Large Language Models (LLMs) have demonstrated exceptional versatility across domains, including applications to electrocardiograms (ECGs). A growing body of work focuses on generating text from multi-channeled ECG signals and corresponding textual prompts. Existing approaches often involve a two-stage process: pretraining an ECG-specific encoder with a self-supervised learning (SSL) objective, followed by finetuning an LLM for natural language generation (NLG) using encoder-derived features. However, these methods face two key limitations: inefficiency due to multi-stage training and challenges in interpreting encoder-generated features. To overcome these issues, we propose ECG-Byte, an adapted byte pair encoding (BPE) tokenizer pipeline for autoregressive language modeling of ECGs. ECG-Byte compresses and encodes ECG signals into tokens, enabling direct end-to-end LLM training by combining ECG and text tokens. This approach enhances interpretability, as ECG tokens can be directly mapped back to the original signals. Leveraging ECG-Byte, we achieve competitive NLG performance while training 3 times faster and using just 48% of the data required by traditional two-stage methods.}
}

Endnote

%0 Conference Paper
%T ECG-Byte: A Tokenizer for End-to-End Generative Electrocardiogram Language Modeling
%A William Han
%A Chaojing Duan
%A Michael Rosenberg
%A Emerson Liu
%A Ding Zhao
%B Proceedings of the 10th Machine Learning for Healthcare Conference
%C Proceedings of Machine Learning Research
%D 2025
%E Monica Agrawal
%E Kaivalya Deshpande
%E Matthew Engelhard
%E Shalmali Joshi
%E Shengpu Tang
%E Iñigo Urteaga	
%F pmlr-v298-han25a
%I PMLR
%U https://proceedings.mlr.press/v298/han25a.html
%V 298
%X Large Language Models (LLMs) have demonstrated exceptional versatility across domains, including applications to electrocardiograms (ECGs). A growing body of work focuses on generating text from multi-channeled ECG signals and corresponding textual prompts. Existing approaches often involve a two-stage process: pretraining an ECG-specific encoder with a self-supervised learning (SSL) objective, followed by finetuning an LLM for natural language generation (NLG) using encoder-derived features. However, these methods face two key limitations: inefficiency due to multi-stage training and challenges in interpreting encoder-generated features. To overcome these issues, we propose ECG-Byte, an adapted byte pair encoding (BPE) tokenizer pipeline for autoregressive language modeling of ECGs. ECG-Byte compresses and encodes ECG signals into tokens, enabling direct end-to-end LLM training by combining ECG and text tokens. This approach enhances interpretability, as ECG tokens can be directly mapped back to the original signals. Leveraging ECG-Byte, we achieve competitive NLG performance while training 3 times faster and using just 48% of the data required by traditional two-stage methods.

APA

Han, W., Duan, C., Rosenberg, M., Liu, E. & Zhao, D.. (2025). ECG-Byte: A Tokenizer for End-to-End Generative Electrocardiogram Language Modeling. Proceedings of the 10th Machine Learning for Healthcare Conference, in Proceedings of Machine Learning Research 298 Available from https://proceedings.mlr.press/v298/han25a.html.

Related Material

Download PDF