Transforming Image Captioning: Refining Models with Advanced Encoder-Decoder Architecture and Attention Mechanism

Vikash Kumar Singh, Ankita Gandhi, Brijesh Vala

PDF

Published: May 1, 2024

Keywords:

component Image Captioning system; Deep Learning; encoder-decoder architecture; attention mechanism; COCO datasets.

Vikash Kumar Singh, Ankita Gandhi, Brijesh Vala

Abstract

Image captioning involves the generation of textual descriptions that describe the content within an image. This process finds extensive utility in diverse applications, including the analysis of large, unlabeled image datasets, uncovering concealed patterns to facilitate machine learning applications, guiding self-driving vehicles, and developing software solutions to aid visually impaired individuals. The implementation of image captioning relies heavily on deep learning models, a technological frontier that has simplified the task of generating captions for images. This paper focuses on the utilization of encoder-decoder model with attention mechanism for image captioning. In classic image captioning model, the words usually describe only a part of the image, however with attention mechanism special attention is given to the low level and high-level features of the image. With the use of stable dataset and improvised encoder – decoder modelling, it is possible to generate captions having an accurate description of image with CIDEr score more by 16.52% of established models.

How to Cite

Brijesh Vala, V. K. S. A. G. (2024). Transforming Image Captioning: Refining Models with Advanced Encoder-Decoder Architecture and Attention Mechanism. International Journal on Recent and Innovation Trends in Computing and Communication, 12(2), 251–261. Retrieved from https://ijritcc.org/index.php/ijritcc/article/view/10562