VTKG: A Vision Transformer Model with Integration of Knowledge Graph for Enhanced Image Captioning

Main Article Content

Yugandhara A. Thakare, K. H. Walse, Mohammad Atique

Abstract

The Transformer model has exhibited impressive results in machine translation tasks. In this research, we utilize the Transformer model to improve the performance of image captioning. In this paper, we tackle the image captioning task from a novel sequence-to-sequence perspective and present VTKG, a VisionTransformer model with integrated Knowledge Graph, a comprehensive Transformer network that substitutes the CNN in the encoder section with a convolution-free Transformer encoder. Subsequently, to enhance the generation of meaningful captions and address the issue of mispredictions, we introduce a novel approach to integrate common-sense knowledge extracted from a knowledge graph. This has significantly improved the overall adaptability of our captioning model. Through the amalgamation of the previously mentioned strategies, we attain exceptional performance on multiple established evaluation metrics, outperforming existing benchmarks. Experimental results demonstrate a 1.32%, 1.7%, 1.25%, 1.14%, 2.8% and 2.5% improvement in Blue-1, Bluu-2, Blue-4, Metor, Rough-L and CIDEr score respectively when compared to state-of-the-art methods.

Article Details

How to Cite
Yugandhara A. Thakare, et al. (2023). VTKG: A Vision Transformer Model with Integration of Knowledge Graph for Enhanced Image Captioning . International Journal on Recent and Innovation Trends in Computing and Communication, 11(9), 889–896. https://doi.org/10.17762/ijritcc.v11i9.8981
Section
Articles