Semantics Fusion of Hierarchical Transformers for Multimodal Named Entity Recognition

Intelligent Computing Technology and Applications (ICIC) 2024.5.19,

Zhao Tong, Qiang Liu, Haichao Shi, Yuwei Xia, Shu Wu & Xiao-Yu Zhang

[PDF] [DOI] [Bib]

Abstract

Multimodal Named Entity Recognition (MNER) is crucial for identifying named entities by utilizing information across different modalities. However, traditional approaches face two major challenges: (1) They struggle to efficiently learn from both textual and visual features to enhance the performance of recognition; (2) Most of these methods fail to consider the relevance between text and images, this could lead to learning that is unrelated to the texts, potentially exerting uncertain or even negative effects on the learning process of multimodal models. To address these issues, we introduce a novel method employing connection-based transformers for MNER, which establishes a direct linkage between uni-modal and multi-modal encoders. This approach facilitates enhanced communication between the top layers of uni-modal encoders and every layer of the multi-modal encoders, thereby improving representation learning through the sophisticated semantics of hierarchical Transformers. Additionally, our method features a gating mechanism that selectively integrates visual clues, effectively filtering out irrelevant visual data and enhancing multi-modal learning. Comparative evaluations on the Twitter-2015 and Twitter 2017 datasets demonstrate our method’s superiority, showing an average performance increase of 2.71%–10.76% and 1.16%–13.79%, respectively. These improvements underscore our model’s effectiveness and its advancement over current state-of-the-art models.