Image Captioning with Weak Supervision
US-2017200065-A1 · Jul 13, 2017 · US
US9792534B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-9792534-B2 |
| Application number | US-201614995042-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jan 13, 2016 |
| Priority date | Jan 13, 2016 |
| Publication date | Oct 17, 2017 |
| Grant date | Oct 17, 2017 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Techniques for image captioning with word vector representations are described. In implementations, instead of outputting results of caption analysis directly, the framework is adapted to output points in a semantic word vector space. These word vector representations reflect distance values in the context of the semantic word vector space. In this approach, words are mapped into a vector space and the results of caption analysis are expressed as points in the vector space that capture semantics between words. In the vector space, similar concepts with have small distance values. The word vectors are not tied to particular words or a single dictionary. A post-processing step is employed to map the points to words and convert the word vector representations to captions. Accordingly, conversion is delayed to a later stage in the process.
Opening claim text (preview).
What is claimed is: 1. In a digital media environment to facilitate management of image collections using one or more computing devices, a method to automatically generate image captions using word vector representations comprising: obtaining a target image for caption analysis; applying feature extraction to the target image to generate attributes corresponding to the image; supplying the attributes to a caption generator to initiate caption generation; and outputting by the caption generator a word vector in a semantic word vector space indicative of semantic relationships between words in sentences formed as a combination of the attributes, the word vector usable to generate a corresponding caption. 2. The method as described in claim 1 , further comprising converting the word vector into a caption for the target image as a post-processing operation. 3. The method as described in claim 2 , wherein converting the word vector into a caption for the target image comprises selecting a dictionary and mapping the word vector to words in the semantic word vector space based on the selected dictionary. 4. The method as described in claim 1 , wherein the caption generator is configured to generate word vectors as intermediate results of caption analysis. 5. The method of claim 1 , wherein the feature extraction is implemented using a pre-trained convolution neural network (CNN) to encode the image with keywords indicative of the attributes. 6. The method of claim 1 , wherein supplying the attributes to a caption generator to initiate caption generation comprises providing the attributes to a recurrent neural network (RNN) designed to implement language modeling and sentence construction techniques for generating a caption for the target image. 7. The method of claim 6 , wherein an objective function implemented by the RNN is adapted to consider distances in the semantic word vector space instead of probability distributions for word sequences. 8. The method of claim 6 , wherein word vector conversion is delayed to a post-processing operation performed after operations of the RNN occur to output the word vector. 9. The method of claim 6 , wherein the word vector conversion occurs in the context of a dictionary selected outside of the caption analysis performed via the RNN. 10. The method of claim 1 , wherein the word vector is usable to generate a corresponding caption with multiple different dictionaries selected after the word vector is generated. 11. In a digital media environment to facilitate access to collections of images using one or more computing devices, a system comprising; one or more processing devices; one or more computer-readable media storing instructions executable via the one or more processing devices to implement a caption generator configured to perform operations to automatically generate image captions using word vector representations including: obtaining a target image for caption analysis; applying feature extraction to the target image to generate attributes corresponding to the image; supplying the attributes to the caption generator to initiate caption generation; outputting by the caption generator a word vector in a semantic word vector space indicative of semantic relationships between words in sentences formed as a combination of the attributes; and subsequently using the word vector in post-processing operations to generate a corresponding caption by: selecting a dictionary; and mapping the word vector to words in the semantic word vector space based on the selected dictionary. 12. A system as recited in claim 11 , wherein outputting the word vector in the semantic word vector space enables changing of the selected dictionary for different contexts. 13. A system as recited in claim 11 , wherein the feature extraction is implemented using a pre-trained convolution neural network (CNN) to encode the image with keywords indicative of the attributes. 14. A system as recited in claim 11 , wherein supplying the attributes to a caption generator to initiate caption generation comprises providing the attributes to a recurrent neural network (RNN) designed to implement language modeling and sentence construction techniques for generating a caption for the target image. 15. A system as recited in claim 14 , wherein an objective function implemented by the RNN is adapted to consider distances in the semantic word vector space instead of probability distributions for word sequences. 16. One or more non-transitory computer-readable storage media storing instructions executable via the one or more processing devices to implement a caption generator configured to perform operations to automatically generate image captions using word vector representations including: obtaining a target image for caption analysis; applying feature extraction to the target image to generate attributes corresponding to the image; supplying the attributes to the caption generator to initiate caption generation; outputting by the caption generator a word vector in a semantic word vector space indicative of semantic relationships between words in sentences formed as a combination of the attributes; and subsequently using the word vector in post-processing operations to generate a corresponding caption by: selecting a dictionary; and mapping the word vector to words in the semantic word vector space based on the selected dictionary. 17. One or more non-transitory computer-readable storage media as recited in claim 16 , wherein outputting the word vector in the semantic word vector space enables changing of the selected dictionary for different contexts. 18. One or more non-transitory computer-readable storage media as recited in claim 16 , wherein the feature extraction is implemented using a pre-trained convolution neural network (CNN) to encode the image with keywords indicative of the attributes. 19. One or more non-transitory computer-readable storage media as recited in claim 16 , wherein supplying the attributes to a caption generator to initiate caption generation comprises providing the attributes to a recurrent neural network (RNN) designed to implement language modeling and sentence construction techniques for generating a caption for the target image. 20. One or more non-transitory computer-readable storage media as recited in claim 19 , wherein an objective function implemented by the RNN is adapted to consider distances in the semantic word vector space instead of probability distributions for word sequences.
Labelling scene content, e.g. deriving syntactic or semantic representations · CPC title
using neural networks · CPC title
using classification, e.g. of video objects · CPC title
based on the proximity to a decision surface, e.g. support vector machines · CPC title
Combinations of networks · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.