[edit]
Regularizing and Interpreting Vision Transformer by Patch Selection on Echocardiography Data
Proceedings of the fifth Conference on Health, Inference, and Learning, PMLR 248:155-168, 2024.
Abstract
This work introduces a novel approach to model regularization and explanation in \Glspl{vit}, particularly beneficial for small-scale but high-dimensional data regimes, such as in healthcare. We introduce stochastic embedded feature selection in the context of echocardiography video analysis, specifically focusing on the EchoNet-Dynamic dataset for the prediction of \gls{lvef}. Our proposed method, termed \Glspl{gvit}, augments \Glspl{vvit}, a performant transformer architecture for videos with \Glspl{cae}, a common dataset-level feature selection technique, to enhance \gls{vvit}’s generalization and interpretability. The key contribution lies in the incorporation of stochastic token selection individually for each video frame during training. Such token selection regularizes the training of \gls{vvit}, improves its interpretability, and is achieved by differentiable sampling of categoricals using the Gumbel-Softmax distribution. Our experiments on EchoNet-Dynamic demonstrate a consistent and notable regularization effect. The \gls{gvit} model outperforms both a random selection baseline and standard \gls{vvit}. % using multiple evaluation metrics. The \gls{gvit} is also compared against recent works on EchoNet-Dynamic where it exhibits state-of-the-art performance among end-to-end learned methods. Finally, we explore model explainability by visualizing selected patches, providing insights into how the \gls{gvit} utilizes regions known to be crucial for \gls{lvef} prediction for humans. This proposed approach, therefore, extends beyond regularization, offering enhanced interpretability for \gls{vit}s.