Discovery of semantic similarities between images and text

US9836671B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9836671-B2
Application numberUS-201514839430-A
CountryUS
Kind codeB2
Filing dateAug 28, 2015
Priority dateAug 28, 2015
Publication dateDec 5, 2017
Grant dateDec 5, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Disclosed herein are technologies directed to discovering semantic similarities between images and text, which can include performing image search using a textual query, performing text search using an image as a query, and/or generating captions for images using a caption generator. A semantic similarity framework can include a caption generator and can be based on a deep multimodal similar model. The deep multimodal similarity model can receive sentences and determine the relevancy of the sentences based on similarity of text vectors generated for one or more sentences to an image vector generated for an image. The text vectors and the image vector can be mapped in a semantic space, and their relevance can be determined based at least in part on the mapping. The sentence associated with the text vector determined to be the most relevant can be output as a caption for the image.

First claim

Opening claim text (preview).

What is claimed is: 1. A device comprising: a processor; and a computer-readable medium in communication with the processor, the computer-readable medium including modules comprising: a caption generator module configured to receive a query, the query comprising an image, a detector module configured to detect a set of words with a determined probability to be associated with the query, a sentence generator module configured to generate a set of sentences from the set of words, a sentence re-ranker module configured to rank the set of sentences to generate caption sentences by using a deep multimodal similarity model module, the deep multimodal similarity model module configured to use an image model to generate an image vector from the query and a text model to generate a plurality of text vectors from the set of sentences, and the deep multimodal similarity module further configured to associate the most relevant sentence of the caption sentences as a caption for the image. 2. A device as claim 1 recites, wherein to generate the plurality of text vectors from the set of sentences, the deep multimodal similarity detector module is further configured to: convert each word of the plurality of sentences of the set of sentences to a letter-trigram count vector; and propagate forward the letter-trigram count vector through a deep convolutional neural network to produce a semantic vector. 3. A device as claim 1 recites, wherein the deep multimodal similarity detector module is further configured to map the image vector and the plurality of text vectors into a semantic space. 4. A device as claim 3 recites, wherein the deep multimodal similarity detector module is further configured to establish a relevance space in the semantic space to determine relevance of the plurality of text vectors to the image vector, whereby text vectors in the relevance space of the semantic space are determined to be more relevant than text vectors outside of the relevance space of the semantic space. 5. A device as claim 4 recites, wherein the deep multimodal similarity detector module is further configured to determine an order of relevance for the caption sentences by measuring a cosine similarity between the image vector and one or more of the plurality of text vectors. 6. A device as claim 1 recites, wherein to detect a set of words, the detector module is configured to determine a number of common words found in training captions. 7. A device as claim 6 recites, wherein the number of common words found in training captions is set at a determined number of words. 8. A device as claim 1 recites, wherein the detector module is further configured to teach a set of detectors using a weakly-supervised approach of multiple instance learning, wherein the weakly-supervised approach of multiple instance learning comprises iteratively selecting instances within a set of positive bags of bounding boxes. 9. A method, comprising: receiving an image; detecting a set of words with a determined probability to be associated with the image; generating a ranked set of sentences from the set of words, the ranked set of sentences comprising a plurality of sentences in a ranked order; re-ranking the ranked set of sentences by using a deep multimodal similarity model comprising an image model and a text model to generate caption sentences; and associating a relevant sentence of the caption sentences as a caption for the image. 10. A method as claim 9 recites, wherein re-ranking the ranked set of sentences by using the deep multimodal similarity model comprising the image model and the text model to generate caption sentences comprises: generating an image vector from the image using the image model; and generating a plurality of text vectors from the ranked set of sentences using the text model. 11. A method as claim 10 recites, wherein generating the plurality of text vectors from the image comprises: converting each word of the plurality of sentences of the ranked set of sentences to a letter-trigram count vector; and propagating forward the letter-trigram count vector through a deep convolutional neural network to produce a semantic vector. 12. A method as claim 10 recites, further comprising mapping the image vector and the plurality of text vectors into a semantic space. 13. A method as claim 12 recites, further comprising establishing a relevance space in the semantic space to determine relevance of the plurality of text vectors to the image vector, whereby text vectors in the relevance space of the semantic space are determined to be more relevant than text vectors outside of the relevance space of the semantic space. 14. A method as claim 13 recites, further comprising determining an order of relevance for the caption sentences by measuring a cosine similarity between the image vector and one or more of the plurality of text vectors. 15. A method as claim 9 recites, wherein the set of words comprises determining a number of common words found in training captions. 16. A method as claim 15 recites, wherein the number of common words found in training captions is no greater than one thousand words. 17. A method as claim 9 recites, wherein detecting the set of words comprises teaching a set of detectors using a weakly-supervised approach of multiple instance learning. 18. A method as claim 17 recites, wherein the weakly-supervised approach of multiple instance learning comprises iteratively selecting instances within a set of positive bags of bounding boxes. 19. A method as claim 9 recites, further comprising generating an image file comprising the image and the relevant sentence as the caption. 20. A computer storage medium having computer-executable instructions thereupon that, when executed by a computer, cause the computer to: receive a set of sentences generated from a set of words associated with an image, rank the set of sentences to generate caption sentences by using a deep multimodal similarity model, the deep multimodal similarity model comprising an image model to generate an image vector from the query and a text model to generate a plurality of text vectors from the set of sentences, determine a relevant sentence of the caption sentences by comparing the image vector and the plurality of text vectors, with a sentence associated with a text vector having a highest similarity determined to be the relevant sentence, associate the relevant sentence of the caption sentences as a caption for the image, and create an image file comprising the image and the relevant sentence. 21. A method, comprising: receiving a search image; detecting a set of words with a determined probability to be associated with the search image; generating a ranked set of sentences from the set of words, the ranked set of sentences comprising a plurality of sentences in a ranked order; re-ranking the ranked set of sentences by using a deep multimodal similarity model comprising an image model and a text model to generate a ranked list of caption sentences; and providing the ranked list of caption sentences as a search result for the search image.

Assignees

Inventors

Classifications

  • G06V20/70Primary

    Labelling scene content, e.g. deriving syntactic or semantic representations · CPC title

  • Classification techniques · CPC title

  • using neural networks · CPC title

  • Classification techniques · CPC title

  • Distances to cluster centroïds · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9836671B2 cover?
Disclosed herein are technologies directed to discovering semantic similarities between images and text, which can include performing image search using a textual query, performing text search using an image as a query, and/or generating captions for images using a caption generator. A semantic similarity framework can include a caption generator and can be based on a deep multimodal similar mo…
Who is the assignee on this patent?
Microsoft Technology Licensing Llc
What technology area does this patent fall under?
Primary CPC classification G06V20/70. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Dec 05 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).