noodls browser compatibility check

The security settings of your browser are blocking the execution of scripts.

To use noodls, javascript support must be enabled. Please change your browser's security settings to enable javascript.

If you have changed your browser's security settings, you can click here.

related announcements

News

Walgreens Boots Alliance Inc.

Statement of Changes in Beneficial Ownership - Form 4
salesforce.com Inc.

A way out of the Odyssey: Analyzing and Combining Recent Insights for LSTMs
salesforce.com Inc.

New Neural Network Building Block Allows Faster and More Accurate Text[...]

Technology

salesforce.com Inc.

10/28/2024 | Press release | Distributed by Public on 10/28/2024 15:03

Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning

Automatically generating captions for images has emerged as a prominent interdisciplinary research problem in both academia and industry. It can aid visually impaired users, and make it easy for users to organize and navigate through large amounts of typically unstructured visual data.

CaimingXiong

October 28, 20242 min read

Automatically generating captions for images has emerged as a prominent interdisciplinary research problem in both academia and industry. It can aid visually impaired users, and make it easy for users to organize and navigate through large amounts of typically unstructured visual data. In order to generate high quality captions, the model needs to incorporate fine-grained visual clues from the image. Recently, visual attention-based neural encoder-decoder models have been explored, where the attention mechanism typically produces a spatial map highlighting image regions relevant to each generated word.

Most attention models for image captioning attend to the image at every time step, irrespective of which word is going to be emitted next. However, not all words in the caption have corresponding visual signals. Consider the example in above figure that shows an image and its generated caption "A white bird perched on top of a red stop sign". The words "a" and "of" do not have corresponding canonical visual signals. Moreover, language correlations make the visual signal unnecessary when generating words like "on" and "top" following "perched", and "sign" following "a red stop". In fact, gradients from non-visual words could mislead and diminish the overall effectiveness of the visual signal in guiding the caption generation process.

In this paper, we propose a novel adaptive attention model with a visual sentinel. At each time step, our model decides whether to attend to the image (and if so, to which regions) or to the visual sentinel, so that extract meaningful information for sequential word generation. We test our method on the COCO image captioning 2015 challenge dataset and Flickr30K. Our approach sets the new state-of-the-art by a significant margin.

Accuracy comparison

Qualitative output

Citation credit

Jiasen Lu^, Caiming Xiong^, Devi Parikh, and Richard Socher. 2016.
Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning.

(^ equal contribution)

Just For You

Learning When to Skim and When to Read

16 min read

Your TLDR by an ai: a Deep Reinforced Model for Abstractive Summarization

15 min read

Explore related content by topic

Caiming Xiong

More by Caiming

Sharing and Personal Tools

Please select the service you want to use:

Back

View original format

related announcements

News

Technology

salesforce.com Inc.

Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning

Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning

Automatically generating captions for images has emerged as a prominent interdisciplinary research problem in both academia and industry. It can aid visually impaired users, and make it easy for users to organize and navigate through large amounts of typically unstructured visual data.

CaimingXiong

Share article

Accuracy comparison

Qualitative output

Citation credit

Share article

Just For You

Learning When to Skim and When to Read

Your TLDR by an ai: a Deep Reinforced Model for Abstractive Summarization

Explore related content by topic

Sharing and Personal Tools