What technology area does this patent fall under?

Primary CPC classification G06V10/82. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Feb 08 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Adaptive attention model for image captioning

US11244111B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11244111-B2
Application number	US-201916668333-A
Country	US
Kind code	B2
Filing date	Oct 30, 2019
Priority date	Nov 18, 2016
Publication date	Feb 8, 2022
Grant date	Feb 8, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The technology disclosed presents a novel spatial attention model that uses current hidden state information of a decoder long short-term memory (LSTM) to guide attention and to extract spatial image features for use in image captioning. The technology disclosed also presents a novel adaptive attention model for image captioning that mixes visual information from a convolutional neural network (CNN) and linguistic information from an LSTM. At each timestep, the adaptive attention model automatically decides how heavily to rely on the image, as opposed to the linguistic model, to emit the next caption word. The technology disclosed further adds a new auxiliary sentinel gate to an LSTM architecture and produces a sentinel LSTM (Sn-LSTM). The sentinel gate produces a visual sentinel at each timestep, which is an additional representation, derived from the LSTM's memory, of long and short term visual and linguistic information.

First claim

Opening claim text (preview).

What is claimed is: 1. A system comprising: an encoder for processing an input image to generate encoded image features; a decoder for processing a previously emitted caption word combined with the encoded image features to produce, at each decoder iteration, a current hidden state of the decoder and a visual sentinel; an adaptive attender for: attending to the encoded image features at each decoder iteration to produce an image context conditioned on the current hidden state of the decoder; and mixing the image context and the visual sentinel to produce an adaptive context at each decoder iteration; and an emitter for generating a natural language caption for the input image based on the adaptive contexts produced over successive decoder iterations. 2. The system of claim 1 , wherein the encoder comprises a convolutional neural network (CNN). 3. The system of claim 1 , wherein the decoder comprises a long short-term memory network (LSTM). 4. The system of claim 1 , wherein the adaptive attender enhances attention directed to the image context when a next caption word is a visual word. 5. The system of claim 1 , wherein the adaptive attender enhances attention directed to the visual sentinel when a next caption word is a non-visual word or linguistically correlated to the previously emitted caption word. 6. The system of claim 1 , further configured to prevent, during training, backpropagation of gradients from the decoder to the encoder when a next caption word is a non-visual word or linguistically correlated to the previously emitted caption word. 7. The system of claim 1 , wherein the visual sentinel includes visual context determined from previously processed image features and textual context determined from previously emitted caption words. 8. A method comprising: processing an input image with an encoder to generate encoded image features; processing with a decoder a previously emitted caption word combined with the encoded image features to produce, at each decoder iteration, a current hidden state of the decoder and a visual sentinel; attending to the encoded image features at each decoder iteration to produce an image context conditioned on the current hidden state of the decoder; and mixing the image context and the visual sentinel to produce an adaptive context at each decoder iteration; and generating a natural language caption for the input image with an emitter based on the adaptive contexts produced over successive decoder iterations. 9. The method of claim 8 , wherein the encoder comprises a convolutional neural network (CNN). 10. The system of claim 8 , wherein the decoder comprises a long short-term memory network (LSTM). 11. The method of claim 8 , comprising enhancing attention directed to the image context when a next caption word is a visual word. 12. The method of claim 8 , comprising enhancing attention directed to the visual sentinel when a next caption word is a non-visual word or linguistically correlated to the previously emitted caption word. 13. The method of claim 8 , comprising preventing, during training, backpropagation of gradients from the decoder to the encoder when a next caption word is a non-visual word or linguistically correlated to the previously emitted caption word. 14. The method of claim 8 , wherein the visual sentinel includes visual context determined from previously processed image features and textual context determined from previously emitted caption words. 15. A non-transitory machine-readable medium comprising a plurality of machine-readable instructions which, when executed by one or more processors, are adapted to cause the one or more processors to perform a method comprising: processing an input image with an encoder to generate encoded image features; processing with a decoder a previously emitted caption word combined with the encoded image features to produce, at each decoder iteration, a current hidden state of the decoder and a visual sentinel; attending to the encoded image features at each decoder iteration to produce an image context conditioned on the current hidden state of the decoder; and mixing the image context and the visual sentinel to produce an adaptive context at each decoder iteration; and generating a natural language caption for the input image with an emitter based on the adaptive contexts produced over successive decoder iterations. 16. The non-transitory machine-readable medium of claim 15 , wherein the encoder comprises a convolutional neural network (CNN). 17. The non-transitory machine-readable medium of claim 15 , wherein the decoder comprises a long short-term memory network (LSTM). 18. The non-transitory machine-readable medium of claim 15 , comprising enhancing attention directed to the image context when a next caption word is a visual word. 19. The non-transitory machine-readable medium of claim 15 , comprising enhancing attention directed to the visual sentinel when a next caption word is a non-visual word or linguistically correlated to the previously emitted caption word. 20. The non-transitory machine-readable medium of claim 15 , comprising preventing, during training, backpropagation of gradients from the decoder to the encoder when a next caption word is a non-visual word or linguistically correlated to the previously emitted caption word. 21. The non-transitory machine-readable medium of claim 15 , wherein the visual sentinel includes visual context determined from previously processed image features and textual context determined from previously emitted caption words.

Assignees

Salesforce Com Inc

Inventors

Classifications

G06V10/82Primary
using neural networks · CPC title
G06V30/19173
Classification techniques · CPC title
G06F40/274Primary
Converting codes to words; Guess-ahead of partial word inputs · CPC title
G06F18/24133
Distances to prototypes · CPC title
G06N3/045
Combinations of networks · CPC title

Patent family

Related publications grouped by family.

View patent family 62147067

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11244111B2 cover?: The technology disclosed presents a novel spatial attention model that uses current hidden state information of a decoder long short-term memory (LSTM) to guide attention and to extract spatial image features for use in image captioning. The technology disclosed also presents a novel adaptive attention model for image captioning that mixes visual information from a convolutional neural network …
Who is the assignee on this patent?: Salesforce Com Inc
What technology area does this patent fall under?: Primary CPC classification G06V10/82. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Feb 08 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).