Image captioning with weak supervision

US9811765B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9811765-B2
Application numberUS-201614995032-A
CountryUS
Kind codeB2
Filing dateJan 13, 2016
Priority dateJan 13, 2016
Publication dateNov 7, 2017
Grant dateNov 7, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Techniques for image captioning with weak supervision are described herein. In implementations, weak supervision data regarding a target image is obtained and utilized to provide detail information that supplements global image concepts derived for image captioning. Weak supervision data refers to noisy data that is not closely curated and may include errors. Given a target image, weak supervision data for visually similar images may be collected from sources of weakly annotated images, such as online social networks. Generally, images posted online include “weak” annotations in the form of tags, titles, labels, and short descriptions added by users. Weak supervision data for the target image is generated by extracting keywords for visually similar images discovered in the different sources. The keywords included in the weak supervision data are then employed to modulate weights applied for probabilistic classifications during image captioning analysis.

First claim

Opening claim text (preview).

What is claimed is: 1. In a digital media environment to facilitate management of image collections using at least one computing device, a method to automatically generate image captions using weak supervision data comprising: obtaining, by the at least one computing device, a target image for caption analysis; applying, by the at least one computing device, feature extraction to the target image to generate global concepts corresponding to the image; comparing, by the at least one computing device, the target image to images from a source of weakly annotated images to identify visually similar images; building, by the at least one computing device, a collection of keywords for the target image indicative of image details by extracting the keywords from the visually similar images; and supplying, by the at least one computing device, the collection of keywords indicative of image details as the weak supervision data for caption generation along with the global concepts. 2. The method as described in claim 1 , further comprising generating a caption for the target image using the collection of keywords to modulate word weights applied for sentence construction. 3. The method as described in claim 1 , wherein the collection of keywords expands a set of candidate captions available for the caption analysis to include specific objects, attributes, and terms derived from the weak supervision data in addition to the global concepts derived from the feature extraction. 4. The method as described in claim 1 , wherein the collection of keywords is supplied to a language processing model operable to probabilistically generate a descriptive caption for the image by computing probability distributions that account for the weak supervision data. 5. The method of claim 1 , wherein applying feature extraction to the target image comprises using a pre-trained convolution neural network (CNN) to encode the image with global descriptive terms indicative of the global concepts. 6. The method of claim 1 , wherein supplying the collection of keywords comprises providing keywords to a recurrent neural network (RNN) designed to implement language modeling and sentence construction techniques for generating a caption for the target image. 7. The method of claim 6 , wherein the RNN iteratively predicts a sequence of words to combine as the caption for the target image based upon probability distributions computed in accordance with weight factors in multiple iterations. 8. The method of claim 7 , wherein the collection of keywords is injected in the RNN for each of the multiple iterations to modulate the weight factors used to predict the sequence. 9. The method of claim 1 , wherein caption generation includes multiple iterations to determine a sequence of words to combine as the caption for the target image and supplying the collection of keywords comprises providing the same keywords for each of the multiple iterations. 10. The method of claim 1 , wherein building the collection of keywords comprises scoring and ranking keywords associated with the visually similar images based on relevance criteria and generating a filtered list of top ranking keywords. 11. The method as described in claim 1 , wherein keywords in the collection of keywords are assigned keyword weights effective to change word probabilities in probabilistic categorization implemented for caption generation to favor keywords indicative of the image details. 12. The method as described in claim 1 , wherein the source of weakly annotated images comprises an online repository for images accessible over a network. 13. In a digital media environment to facilitate access to collections of images using one or more computing devices, a system comprising; one or more processing devices; one or more computer-readable media storing instructions executable via the one or more processing devices to implement a caption generator configured to perform operations to automatically generate image captions using weak supervision data including: processing a target image for caption analysis via a convolution neural network (CNN), the CNN configured to extract global concepts corresponding to the target image; comparing the target image to images from at least one source of weakly annotated images to identify visually similar images; building a collection of keywords for the target image indicative of image details by extracting the keywords from the visually similar images as weak supervision data used to inform caption generation; supplying the collection of keywords indicative of image details to a recurrent neural network (RNN) along with the global concepts, the RNN configured to implement language modeling and sentence construction techniques for generating a caption for the target image; and generating the caption for the target image via the RNN using the collection of keywords to modulate word weights applied by the RNN for sentence construction. 14. A system as recited in claim 13 , wherein the at least one source of weakly annotated images includes a social networking site having a database of images associated by users with weak annotations indicative of low-level image details. 15. A system as recited in claim 13 , wherein the at least one source of weakly annotated images includes a collection of training images used to train the caption generator. 16. A system as recited in claim 13 , wherein: the RNN iteratively predicts a sequence of words to combine as the caption for the target image based upon probability distributions computed in accordance with weight factors in multiple iterations; and the same collection of keywords derived from the weak supervision data is injected in the RNN for each of the multiple iterations to modulate the weight factors used to predict the sequence. 17. In a digital media environment to facilitate management of image collections using at least one computing device, a method to automatically generate image captions implemented via an image service comprising: comparing, by the at least one computing device, a target image for caption analysis to images from at least one source of weakly annotated images to identify visually similar images; building, by the at least one computing device, a collection of keywords for the target image indicative of image details by extracting the keywords from the visually similar images as weak supervision data used to inform caption generation; supplying, by the at least one computing device, the collection of keywords indicative of the image details to a caption generation model configured to iteratively combine words derived from the concepts and attributes to construct a caption in multiple iterations; and constructing, by the at least one computing device, the caption according to a semantic attention model configured to modulate weights assigned to the keywords for each of the multiple iterations based on relevance to a word predicted in a preceding iteration. 18. The method as described in claim 17 , wherein the semantic attention model causes different keywords to be considered at each of the multiple iterations. 19. The method as described in claim 18 , wherein the caption generation model comprises a recurrent neural network (RNN) designed to implement language modeling and sentence construction techniques for generating the caption for the target image. 20. The method as described in claim 19 , wherein the semantic attention model includes an input attention model applied to input for each node

Assignees

Inventors

Classifications

  • G06V20/70Primary

    Labelling scene content, e.g. deriving syntactic or semantic representations · CPC title

  • using neural networks · CPC title

  • using classification, e.g. of video objects · CPC title

  • based on the proximity to a decision surface, e.g. support vector machines · CPC title

  • Combinations of networks · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9811765B2 cover?
Techniques for image captioning with weak supervision are described herein. In implementations, weak supervision data regarding a target image is obtained and utilized to provide detail information that supplements global image concepts derived for image captioning. Weak supervision data refers to noisy data that is not closely curated and may include errors. Given a target image, weak supervis…
Who is the assignee on this patent?
Adobe Systems Inc
What technology area does this patent fall under?
Primary CPC classification G06V20/70. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Nov 07 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).