Transforming Image Captioning: Refining Models with Advanced Encoder-Decoder Architecture and Attention Mechanism

Main Article Content

Vikash Kumar Singh, Ankita Gandhi, Brijesh Vala

Abstract

Image captioning involves the generation of textual descriptions that describe the content within an image. This process finds extensive utility in diverse applications, including the analysis of large, unlabeled image datasets, uncovering concealed patterns to facilitate machine learning applications, guiding self-driving vehicles, and developing software solutions to aid visually impaired individuals. The implementation of image captioning relies heavily on deep learning models, a technological frontier that has simplified the task of generating captions for images. This paper focuses on the utilization of encoder-decoder model with attention mechanism for image captioning. In classic image captioning model, the words usually describe only a part of the image, however with attention mechanism special attention is given to the low level and high-level features of the image. With the use of stable dataset and improvised encoder – decoder modelling, it is possible to generate captions having an accurate description of image with CIDEr score more by 16.52% of established models.

Article Details

How to Cite
Brijesh Vala, V. K. S. A. G. (2024). Transforming Image Captioning: Refining Models with Advanced Encoder-Decoder Architecture and Attention Mechanism. International Journal on Recent and Innovation Trends in Computing and Communication, 12(2), 251–261. Retrieved from https://ijritcc.org/index.php/ijritcc/article/view/10562
Section
Articles