[edit]
On the Interpretability of Attention Networks
Proceedings of The 14th Asian Conference on Machine
Learning, PMLR 189:832-847, 2023.
Abstract
Attention mechanisms form a core component of
several successful deep learning architectures, and
are based on one key idea: “The output depends only
on a small (but unknown) segment of the input.” In
several practical applications like image captioning
and language translation, this is mostly true. In
trained models with an attention mechanism, the
outputs of an intermediate module that encodes the
segment of input responsible for the output is often
used as a way to peek into the ‘reasoning’ of the
network. We make such a notion more precise for a
variant of the classification problem that we term
selective dependence classification (SDC) when used
with attention model architectures. Under such a
setting, we demonstrate various error modes where an
attention model can be accurate but fail to be
interpretable, and show that such models do occur as
a result of training. We illustrate various
situations that can accentuate and mitigate this
behaviour. Finally, we use our objective definition
of interpretability for SDC tasks to evaluate a few
attention model learning algorithms designed to
encourage sparsity and demonstrate that these
algorithms help improve interpretability.