[edit]
A Systematic Comparison of Data Representations for Transformer-Based ECG Arrhythmia Classification
Proceedings of The Second AAAI Bridge Program on AI for Medicine and Healthcare, PMLR 317:37-45, 2026.
Abstract
Automated electrocardiogram (ECG) classification plays a key role in detecting cardiac arrhythmias efficiently and objectively. Despite major advances in deep learning, there remains no consensus on whether one-dimensional (1D) temporal or two-dimensional (2D) time–frequency representations yield superior diagnostic accuracy. This study presents a controlled comparison between Vision Transformer (ViT) architectures trained on raw 1D ECG sequences and Short-Time Fourier Transform (STFT)-based 2D spectrograms using the CPSC2018 dataset. Both models share comparable architectures and parameter counts to isolate the effect of signal representation. The 1D-ViT achieved the highest overall accuracy (96.5%) and F1-score (96.5%), confirming that preserving temporal continuity is critical for arrhythmia discrimination. The 2D-ViT achieved lower accuracy (92.6%) due to temporal information loss, though it maintained competitive calibration (AUC 98.6%) and generalization. A bidirectional fusion model combining both encoders through cross-attention exhibited complementary behavior but did not surpass the 1D baseline. These findings indicate that while spectro-temporal information can enhance interpretability and stability, temporal-domain fidelity remains the dominant factor for reliable ECG classification.