Semantically-relevant discovery of solutions
US-2017060844-A1 · Mar 2, 2017 · US
US9836671B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-9836671-B2 |
| Application number | US-201514839430-A |
| Country | US |
| Kind code | B2 |
| Filing date | Aug 28, 2015 |
| Priority date | Aug 28, 2015 |
| Publication date | Dec 5, 2017 |
| Grant date | Dec 5, 2017 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Disclosed herein are technologies directed to discovering semantic similarities between images and text, which can include performing image search using a textual query, performing text search using an image as a query, and/or generating captions for images using a caption generator. A semantic similarity framework can include a caption generator and can be based on a deep multimodal similar model. The deep multimodal similarity model can receive sentences and determine the relevancy of the sentences based on similarity of text vectors generated for one or more sentences to an image vector generated for an image. The text vectors and the image vector can be mapped in a semantic space, and their relevance can be determined based at least in part on the mapping. The sentence associated with the text vector determined to be the most relevant can be output as a caption for the image.
Opening claim text (preview).
What is claimed is: 1. A device comprising: a processor; and a computer-readable medium in communication with the processor, the computer-readable medium including modules comprising: a caption generator module configured to receive a query, the query comprising an image, a detector module configured to detect a set of words with a determined probability to be associated with the query, a sentence generator module configured to generate a set of sentences from the set of words, a sentence re-ranker module configured to rank the set of sentences to generate caption sentences by using a deep multimodal similarity model module, the deep multimodal similarity model module configured to use an image model to generate an image vector from the query and a text model to generate a plurality of text vectors from the set of sentences, and the deep multimodal similarity module further configured to associate the most relevant sentence of the caption sentences as a caption for the image. 2. A device as claim 1 recites, wherein to generate the plurality of text vectors from the set of sentences, the deep multimodal similarity detector module is further configured to: convert each word of the plurality of sentences of the set of sentences to a letter-trigram count vector; and propagate forward the letter-trigram count vector through a deep convolutional neural network to produce a semantic vector. 3. A device as claim 1 recites, wherein the deep multimodal similarity detector module is further configured to map the image vector and the plurality of text vectors into a semantic space. 4. A device as claim 3 recites, wherein the deep multimodal similarity detector module is further configured to establish a relevance space in the semantic space to determine relevance of the plurality of text vectors to the image vector, whereby text vectors in the relevance space of the semantic space are determined to be more relevant than text vectors outside of the relevance space of the semantic space. 5. A device as claim 4 recites, wherein the deep multimodal similarity detector module is further configured to determine an order of relevance for the caption sentences by measuring a cosine similarity between the image vector and one or more of the plurality of text vectors. 6. A device as claim 1 recites, wherein to detect a set of words, the detector module is configured to determine a number of common words found in training captions. 7. A device as claim 6 recites, wherein the number of common words found in training captions is set at a determined number of words. 8. A device as claim 1 recites, wherein the detector module is further configured to teach a set of detectors using a weakly-supervised approach of multiple instance learning, wherein the weakly-supervised approach of multiple instance learning comprises iteratively selecting instances within a set of positive bags of bounding boxes. 9. A method, comprising: receiving an image; detecting a set of words with a determined probability to be associated with the image; generating a ranked set of sentences from the set of words, the ranked set of sentences comprising a plurality of sentences in a ranked order; re-ranking the ranked set of sentences by using a deep multimodal similarity model comprising an image model and a text model to generate caption sentences; and associating a relevant sentence of the caption sentences as a caption for the image. 10. A method as claim 9 recites, wherein re-ranking the ranked set of sentences by using the deep multimodal similarity model comprising the image model and the text model to generate caption sentences comprises: generating an image vector from the image using the image model; and generating a plurality of text vectors from the ranked set of sentences using the text model. 11. A method as claim 10 recites, wherein generating the plurality of text vectors from the image comprises: converting each word of the plurality of sentences of the ranked set of sentences to a letter-trigram count vector; and propagating forward the letter-trigram count vector through a deep convolutional neural network to produce a semantic vector. 12. A method as claim 10 recites, further comprising mapping the image vector and the plurality of text vectors into a semantic space. 13. A method as claim 12 recites, further comprising establishing a relevance space in the semantic space to determine relevance of the plurality of text vectors to the image vector, whereby text vectors in the relevance space of the semantic space are determined to be more relevant than text vectors outside of the relevance space of the semantic space. 14. A method as claim 13 recites, further comprising determining an order of relevance for the caption sentences by measuring a cosine similarity between the image vector and one or more of the plurality of text vectors. 15. A method as claim 9 recites, wherein the set of words comprises determining a number of common words found in training captions. 16. A method as claim 15 recites, wherein the number of common words found in training captions is no greater than one thousand words. 17. A method as claim 9 recites, wherein detecting the set of words comprises teaching a set of detectors using a weakly-supervised approach of multiple instance learning. 18. A method as claim 17 recites, wherein the weakly-supervised approach of multiple instance learning comprises iteratively selecting instances within a set of positive bags of bounding boxes. 19. A method as claim 9 recites, further comprising generating an image file comprising the image and the relevant sentence as the caption. 20. A computer storage medium having computer-executable instructions thereupon that, when executed by a computer, cause the computer to: receive a set of sentences generated from a set of words associated with an image, rank the set of sentences to generate caption sentences by using a deep multimodal similarity model, the deep multimodal similarity model comprising an image model to generate an image vector from the query and a text model to generate a plurality of text vectors from the set of sentences, determine a relevant sentence of the caption sentences by comparing the image vector and the plurality of text vectors, with a sentence associated with a text vector having a highest similarity determined to be the relevant sentence, associate the relevant sentence of the caption sentences as a caption for the image, and create an image file comprising the image and the relevant sentence. 21. A method, comprising: receiving a search image; detecting a set of words with a determined probability to be associated with the search image; generating a ranked set of sentences from the set of words, the ranked set of sentences comprising a plurality of sentences in a ranked order; re-ranking the ranked set of sentences by using a deep multimodal similarity model comprising an image model and a text model to generate a ranked list of caption sentences; and providing the ranked list of caption sentences as a search result for the search image.
Labelling scene content, e.g. deriving syntactic or semantic representations · CPC title
Classification techniques · CPC title
using neural networks · CPC title
Classification techniques · CPC title
Distances to cluster centroïds · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.