Feature Fusion Based on Neural Image Captioning with Spatial Attention
Abstract— Generating a natural language description of an image is a challenging but meaningful task. The task combines two significant artificial intelligent fields: computer vision and natural language processing. This task is valuable for many applications, such as searching images and assisting the people who have visually impaired to view the world, etc. Most approaches adopt an encoder-decoder framework, and some of the future methods are improved on the basis of this framework. In these methods, image features are extracted by VGG net or other networks, but the feature map will lose important information during the processing. In this paper, we fusing different kinds of image features extracted by the two networks: VGG19 and Resnet50, and put it into the neural network to train. We also add an attention into the a basic neural encoder-decoder model for generating natural sentence descriptions, at each time step, our model will attend to the image feature and pick up the most meaningful parts to generate captions. We test our model on the benchmark dataset called IAPR TC-12, comparing with other methods, we validate our model have state-of-the-art performance.
Index Terms— Image captioning, feature fusion, encoder-decoder framework, attention
Qingqing Lu, Xin Kang, Fuji Ren
Faculty of Engineering, Tokushima University 2-1 Minami Josanjima, JAPAN
Qingqing Lu, Xiaomei Zhang
School of Information Science and Technology, Nantong University, CHINA
Cite: Qingqing Lu, Xiaomei Zhang, Xin Kang, Fuji Ren, "Feature Fusion Based on Neural Image Captioning with Spatial Attention," Proceedings of 2019 the 9th International Workshop on Computer Science and Engineering, pp. 195-200, Hong Kong, 15-17 June, 2019.