Who is the assignee on this patent?

Microsoft Technology Licensing Llc

What technology area does this patent fall under?

Primary CPC classification G06F16/951. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Aug 17 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 9 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Stacked cross-modal matching

US11093560B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11093560-B2
Application number	US-201816138587-A
Country	US
Kind code	B2
Filing date	Sep 21, 2018
Priority date	Sep 21, 2018
Publication date	Aug 17, 2021
Grant date	Aug 17, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The present concepts relate to matching data of two different modalities using two stages of attention. First data is encoded as a set of first vectors representing components of the first data, and second data is encoded as a set of second vectors representing components of the second data. In the first stage, the components of the first data are attended by comparing the first vectors and the second vectors to generate a set of attended vectors. In the second stage, the components of the second data are attended by comparing the second vectors and the attended vectors to generate a plurality of relevance scores. Then, the relevance scores are pooled to calculate a similarity score that indicates a degree of similarity between the first data and the second data.

First claim

Opening claim text (preview).

The invention claimed is: 1. A system, comprising: a first neural network for detecting a plurality of regions in an image; a second neural network for generating a plurality of region vectors associated with the plurality of regions; a third neural network for generating a plurality of word vectors associated with a plurality of words in a sentence; one or more storage resources storing the first neural network, the second neural network, and the third neural network; a search engine receiving the image as a search query and returning the sentence as a search result; one or more hardware processors; and at least one computer-readable storage medium storing computer-readable instructions which, when executed by the one or more hardware processors, cause the one or more hardware processors to: detect the plurality of regions based at least on the image using the first neural network; generate the plurality of region vectors based at least on the plurality of regions using the second neural network; generate the plurality of word vectors based at least on the sentence using the third neural network; generate a plurality of attended sentence vectors associated with the plurality of region vectors based at least on a comparison between the plurality of region vectors and the plurality of word vectors, the plurality of attended sentence vectors including weights indicating correspondence between the plurality of regions and the plurality of words; generate a plurality of region-sentence relevance scores associated with the plurality of region vectors based at least on a comparison between the plurality of region vectors and the plurality of attended sentence vectors, the plurality of region-sentence relevance scores indicating relevance of the plurality of regions with respect to the sentence; and generate an image-sentence similarity score indicating a similarity between the image and the sentence based at least on the plurality of region-sentence relevance scores, the search engine returning the sentence based at least on the image-sentence similarity score. 2. The system of claim 1 , wherein the computer-readable instructions further cause the one or more hardware processors to: train at least one of the second neural network or the third neural network using at least a plurality of matching image-sentence pairs. 3. The system of claim 1 , wherein the computer-readable instructions further cause the one or more hardware processors to: compute a cosine similarity matrix based at least on the plurality of region vectors and the plurality of word vectors to generate the plurality of attended sentence vectors. 4. The system of claim 1 , wherein the computer-readable instructions further cause the one or more hardware processors to: compute cosine similarities between the plurality of region vectors and the plurality of attended sentence vectors to generate the plurality of region-sentence relevance scores. 5. The system of claim 1 , wherein the computer-readable instructions further cause the one or more hardware processors to: compute an average of the plurality of region-sentence relevance scores to generate the image-sentence similarity score. 6. A method, comprising: receiving a sentence including a plurality of words as a search query; retrieving an image as a candidate search result; inputting the image into a first neural network to detect a plurality of regions in the image; inputting the plurality of regions into a second neural network to generate a plurality of region vectors; inputting the sentence into a third neural network to generate a plurality of word vectors; comparing the plurality of region vectors with the plurality of word vectors to generate a plurality of attended sentence vectors, the plurality of attended sentence vectors indicating correspondence between the plurality of regions and the plurality of words; comparing the plurality of region vectors with the plurality of attended sentence vectors to generate a plurality of region-sentence relevance scores indicating correspondence between the plurality of regions and the sentence; pooling the plurality of region-sentence relevance scores to generate an image-sentence similarity score indicating correspondence between the image and the sentence; and outputting the image as a search result based at least on the image-sentence similarity score. 7. The method of claim 6 , further comprising: training at least one of the second neural network or the third neural network using at least a plurality of matching image-sentence pairs. 8. The method of claim 7 , further comprising: training at least one of the second neural network or the third neural network using at least a plurality of mismatching image-sentence pairs. 9. The method of claim 6 , wherein the second neural network is a convolutional neural network. 10. The method of claim 6 , wherein the third neural network is a recurrent neural network. 11. The method of claim 6 , wherein the comparing of the plurality of region vectors with the plurality of word vectors comprises computing a cosine similarity matrix. 12. The method of claim 6 , wherein the plurality of attended sentence vectors are generated based at least on weighted sums of the plurality of word vectors. 13. The method of claim 6 , wherein the plurality of region-sentence relevance scores are generated based at least on cosine similarities between the plurality of region vectors and the plurality of attended sentence vectors. 14. The method of claim 6 , wherein the pooling of the plurality of region-sentence relevance scores comprises using a LogSumExp function on the plurality of region-sentence relevance scores. 15. The method of claim 6 , wherein the pooling of the plurality of region-sentence relevance scores comprises computing a maximum of the plurality of region-sentence relevance scores. 16. The method of claim 6 , wherein the plurality of region vectors, the plurality of word vectors, and the plurality of attended sentence vectors map to a common semantic vector space. 17. The method of claim 6 , further comprising: comparing the plurality of word vectors with the plurality of region vectors to generate a plurality of attended image vectors; comparing the plurality of word vectors with the plurality of attended image vectors to generate a plurality of word-image relevance scores indicating correspondence between the plurality of words and the image; pooling the plurality of word-image relevance scores to generate a sentence-image similarity score indicating correspondence between the sentence and the image; and generating a composite similarity score based at least on the image-sentence similarity score and the sentence-image similarity score, wherein the outputting of the image as the search result is based at least on the composite similarity score. 18. The method of claim 6 , further comprising: generating a plurality of image-sentence similarity scores for a plurality of candidate images, the image being one of the plurality of candidate images, wherein the outputting of the image as the search result is based at least on the image-sentence similarity score of the image being the highest among the plurality of image-sentence similarity scores. 19. A method, comprising: receiving first data of a first modality as a search query over a network from a client device; retrieving second data of a second modality that is distinct from the first modality; encoding a plurality of first vectors re

Assignees

Microsoft Technology Licensing Llc

Inventors

Classifications

G06V10/424
Syntactic representation, e.g. by using alphabets or grammars · CPC title
G06V30/1983
Syntactic or structural pattern recognition, e.g. symbolic string recognition · CPC title
G06V20/70
Labelling scene content, e.g. deriving syntactic or semantic representations · CPC title
G06V20/00
Scenes; Scene-specific elements (control of digital cameras H04N23/60) · CPC title
G06V10/82
using neural networks · CPC title

Patent family

Related publications grouped by family.

View patent family 69883447

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11093560B2 cover?: The present concepts relate to matching data of two different modalities using two stages of attention. First data is encoded as a set of first vectors representing components of the first data, and second data is encoded as a set of second vectors representing components of the second data. In the first stage, the components of the first data are attended by comparing the first vectors and the…
Who is the assignee on this patent?: Microsoft Technology Licensing Llc
What technology area does this patent fall under?: Primary CPC classification G06F16/951. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Aug 17 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 9 related publications on this page (citations in our corpus or others sharing the same primary CPC).