Transforming Image Captioning: Refining Models with Advanced Encoder-Decoder Architecture and Attention Mechanism
Main Article Content
Abstract
Image captioning involves the generation of textual descriptions that describe the content within an image. This process finds extensive utility in diverse applications, including the analysis of large, unlabeled image datasets, uncovering concealed patterns to facilitate machine learning applications, guiding self-driving vehicles, and developing software solutions to aid visually impaired individuals. The implementation of image captioning relies heavily on deep learning models, a technological frontier that has simplified the task of generating captions for images. This paper focuses on the utilization of encoder-decoder model with attention mechanism for image captioning. In classic image captioning model, the words usually describe only a part of the image, however with attention mechanism special attention is given to the low level and high-level features of the image. With the use of stable dataset and improvised encoder – decoder modelling, it is possible to generate captions having an accurate description of image with CIDEr score more by 16.52% of established models.