VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers

Поділитися
Вставка
  • Опубліковано 25 тра 2024
  • VL-InterpreT was accepted to CVPR 2022.
    Paper: arxiv.org/abs/2203.17247
    Demo: vlinterpretenv4env-env.eba-vmh...
    VL-InterpreT provides novel interactive visualizations for interpreting the attention and hidden representations in multimodal transformers. It is a task agnostic and integrated tool that (1) tracks a variety of statistics in attention heads throughout all layers for both vision and language components, (2) visualizes cross-modal and intra-modal attentions through easily readable heatmaps, and (3) plots the hidden representations of vision and language tokens as they pass through the transformer layers. In this paper, we demonstrate the functionalities of VL-InterpreT through the analysis of KD-VLP, an end-to-end pretraining vision-language multimodal transformer-based model, in the tasks of Visual Commonsense Reasoning (VCR) and WebQA, two visual question answering benchmarks. Furthermore, we also present a few interesting findings about multimodal transformer behaviors that were learned through our tool.
  • Наука та технологія

КОМЕНТАРІ • 1