Stacked cross-modal matching

US11093560B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11093560-B2
Application numberUS-201816138587-A
CountryUS
Kind codeB2
Filing dateSep 21, 2018
Priority dateSep 21, 2018
Publication dateAug 17, 2021
Grant dateAug 17, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The present concepts relate to matching data of two different modalities using two stages of attention. First data is encoded as a set of first vectors representing components of the first data, and second data is encoded as a set of second vectors representing components of the second data. In the first stage, the components of the first data are attended by comparing the first vectors and the second vectors to generate a set of attended vectors. In the second stage, the components of the second data are attended by comparing the second vectors and the attended vectors to generate a plurality of relevance scores. Then, the relevance scores are pooled to calculate a similarity score that indicates a degree of similarity between the first data and the second data.

First claim

Opening claim text (preview).

The invention claimed is: 1. A system, comprising: a first neural network for detecting a plurality of regions in an image; a second neural network for generating a plurality of region vectors associated with the plurality of regions; a third neural network for generating a plurality of word vectors associated with a plurality of words in a sentence; one or more storage resources storing the first neural network, the second neural network, and the third neural network; a search engine receiving the image as a search query and returning the sentence as a search result; one or more hardware processors; and at least one computer-readable storage medium storing computer-readable instructions which, when executed by the one or more hardware processors, cause the one or more hardware processors to: detect the plurality of regions based at least on the image using the first neural network; generate the plurality of region vectors based at least on the plurality of regions using the second neural network; generate the plurality of word vectors based at least on the sentence using the third neural network; generate a plurality of attended sentence vectors associated with the plurality of region vectors based at least on a comparison between the plurality of region vectors and the plurality of word vectors, the plurality of attended sentence vectors including weights indicating correspondence between the plurality of regions and the plurality of words; generate a plurality of region-sentence relevance scores associated with the plurality of region vectors based at least on a comparison between the plurality of region vectors and the plurality of attended sentence vectors, the plurality of region-sentence relevance scores indicating relevance of the plurality of regions with respect to the sentence; and generate an image-sentence similarity score indicating a similarity between the image and the sentence based at least on the plurality of region-sentence relevance scores, the search engine returning the sentence based at least on the image-sentence similarity score. 2. The system of claim 1 , wherein the computer-readable instructions further cause the one or more hardware processors to: train at least one of the second neural network or the third neural network using at least a plurality of matching image-sentence pairs. 3. The system of claim 1 , wherein the computer-readable instructions further cause the one or more hardware processors to: compute a cosine similarity matrix based at least on the plurality of region vectors and the plurality of word vectors to generate the plurality of attended sentence vectors. 4. The system of claim 1 , wherein the computer-readable instructions further cause the one or more hardware processors to: compute cosine similarities between the plurality of region vectors and the plurality of attended sentence vectors to generate the plurality of region-sentence relevance scores. 5. The system of claim 1 , wherein the computer-readable instructions further cause the one or more hardware processors to: compute an average of the plurality of region-sentence relevance scores to generate the image-sentence similarity score. 6. A method, comprising: receiving a sentence including a plurality of words as a search query; retrieving an image as a candidate search result; inputting the image into a first neural network to detect a plurality of regions in the image; inputting the plurality of regions into a second neural network to generate a plurality of region vectors; inputting the sentence into a third neural network to generate a plurality of word vectors; comparing the plurality of region vectors with the plurality of word vectors to generate a plurality of attended sentence vectors, the plurality of attended sentence vectors indicating correspondence between the plurality of regions and the plurality of words; comparing the plurality of region vectors with the plurality of attended sentence vectors to generate a plurality of region-sentence relevance scores indicating correspondence between the plurality of regions and the sentence; pooling the plurality of region-sentence relevance scores to generate an image-sentence similarity score indicating correspondence between the image and the sentence; and outputting the image as a search result based at least on the image-sentence similarity score. 7. The method of claim 6 , further comprising: training at least one of the second neural network or the third neural network using at least a plurality of matching image-sentence pairs. 8. The method of claim 7 , further comprising: training at least one of the second neural network or the third neural network using at least a plurality of mismatching image-sentence pairs. 9. The method of claim 6 , wherein the second neural network is a convolutional neural network. 10. The method of claim 6 , wherein the third neural network is a recurrent neural network. 11. The method of claim 6 , wherein the comparing of the plurality of region vectors with the plurality of word vectors comprises computing a cosine similarity matrix. 12. The method of claim 6 , wherein the plurality of attended sentence vectors are generated based at least on weighted sums of the plurality of word vectors. 13. The method of claim 6 , wherein the plurality of region-sentence relevance scores are generated based at least on cosine similarities between the plurality of region vectors and the plurality of attended sentence vectors. 14. The method of claim 6 , wherein the pooling of the plurality of region-sentence relevance scores comprises using a LogSumExp function on the plurality of region-sentence relevance scores. 15. The method of claim 6 , wherein the pooling of the plurality of region-sentence relevance scores comprises computing a maximum of the plurality of region-sentence relevance scores. 16. The method of claim 6 , wherein the plurality of region vectors, the plurality of word vectors, and the plurality of attended sentence vectors map to a common semantic vector space. 17. The method of claim 6 , further comprising: comparing the plurality of word vectors with the plurality of region vectors to generate a plurality of attended image vectors; comparing the plurality of word vectors with the plurality of attended image vectors to generate a plurality of word-image relevance scores indicating correspondence between the plurality of words and the image; pooling the plurality of word-image relevance scores to generate a sentence-image similarity score indicating correspondence between the sentence and the image; and generating a composite similarity score based at least on the image-sentence similarity score and the sentence-image similarity score, wherein the outputting of the image as the search result is based at least on the composite similarity score. 18. The method of claim 6 , further comprising: generating a plurality of image-sentence similarity scores for a plurality of candidate images, the image being one of the plurality of candidate images, wherein the outputting of the image as the search result is based at least on the image-sentence similarity score of the image being the highest among the plurality of image-sentence similarity scores. 19. A method, comprising: receiving first data of a first modality as a search query over a network from a client device; retrieving second data of a second modality that is distinct from the first modality; encoding a plurality of first vectors re

Assignees

Inventors

Classifications

  • Syntactic representation, e.g. by using alphabets or grammars · CPC title

  • Syntactic or structural pattern recognition, e.g. symbolic string recognition · CPC title

  • Labelling scene content, e.g. deriving syntactic or semantic representations · CPC title

  • Scenes; Scene-specific elements (control of digital cameras H04N23/60) · CPC title

  • using neural networks · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11093560B2 cover?
The present concepts relate to matching data of two different modalities using two stages of attention. First data is encoded as a set of first vectors representing components of the first data, and second data is encoded as a set of second vectors representing components of the second data. In the first stage, the components of the first data are attended by comparing the first vectors and the…
Who is the assignee on this patent?
Microsoft Technology Licensing Llc
What technology area does this patent fall under?
Primary CPC classification G06F16/951. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Aug 17 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 9 related publications on this page (citations in our corpus or others sharing the same primary CPC).