Multi-stage image querying

US10997233B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10997233-B2
Application numberUS-201615097086-A
CountryUS
Kind codeB2
Filing dateApr 12, 2016
Priority dateApr 12, 2016
Publication dateMay 4, 2021
Grant dateMay 4, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

In some examples, a computing device refines feature information of query text. The device repeatedly determines attention information based at least in part on feature information of the image and the feature information of the query text, and modifies the feature information of the query text based at least in part on the attention information. The device selects at least one of a predetermined plurality of outputs based at least in part on the refined feature information of the query text. In some examples, the device operates a convolutional computational model to determine feature information of the image. The device network computational models (NCMs) to determine feature information of the query and to determine attention information based at least in part on the feature information of the image and the feature information of the query. Examples include a microphone to detect audio corresponding to the query text.

First claim

Opening claim text (preview).

What is claimed is: 1. A device, comprising: at least one processing unit adapted to execute modules configured to answer natural-language queries about content depicted in images; and one or more computer-readable media communicatively coupled to the at least one processing unit and storing the modules, the modules comprising: a module of an image-representation engine that is configured to operate a convolutional computational model (CCM) to determine feature information of an image, the feature information comprising feature values for a plurality of image regions within the image; a module of a query-representation engine that is configured to operate a first network computational model to determine feature information of a natural-language query; and a module of a filtering engine that is configured to: operate a second network computational model to determine first attention information based at least in part on the feature information of the image and the feature information of the query, the first attention information representing a relevance of each of the plurality of image regions to the query; determine revised feature information based at least in part on the feature values for the plurality of image regions, weighted by the first attention information, and the feature information of the query; operate a third network computational model to determine second attention information based at least in part on the feature information of the image and the revised feature information, the second attention information representing a revised relevance of each of the plurality of image regions to the query; determine second revised feature information based at least in part on the feature values of the plurality of image regions, weighted by the second attention information; and determine, as an answer to the query, a natural-language filter output based at least in part on the second revised feature information. 2. A device as claim 1 recites, wherein the filtering engine is configured to determine the second revised feature information further based at least in part on at least the feature information of the query or the revised feature information. 3. A device as claim 1 recites, wherein the filtering engine is configured to: operate a fourth network computational model to determine respective output-element values of a plurality of output elements based at least in part on the second revised feature information; and determine the filter output by selecting at least one of the output elements based at least in part on the respective output-element values. 4. A device as claim 3 recites, wherein the filtering engine is configured to: determine respective ranks of the output elements based at least in part on the output-element values; and select the at least one of the output elements having respective ranks in a selected range. 5. A device as claim 1 recites, further comprising jointly training at least the first network computational model, the second network computational model, and the third network computational model based at least in part on training data. 6. A method of analyzing an image to answer a natural-language query about content depicted in the image, the method comprising: refining feature information of query text of the natural-language query, wherein the refining comprises performing a group of actions at least twice, and the group of actions comprises: determining attention information based at least in part on feature information of the image and the feature information of the query text, the feature information of the image comprising feature values for a plurality of image regions within the image, and the attention information representing a relevance of each of the plurality of image regions to the query; and modifying the feature information of the query text based at least in part on the feature values for the plurality of image regions weighted by the attention information; and selecting, as an answer to the query, at least one of a predetermined plurality of natural-language outputs based at least in part on the refined feature information of the query text. 7. A method as claim 6 recites, further comprising, before refining the feature information of the query text: determining the feature information of the image by applying a convolutional computational model to data of the image; and determining the feature information of the query text by applying a network computational model to the query text. 8. A method as claim 6 recites, wherein the determining the attention information comprises operating a network computational model having as input the feature information of the image and the feature information of the query text. 9. A method as claim 6 recites, wherein the modifying the feature information of the query text comprises incrementing the feature information of the query text by feature information of the plurality of image regions weighted by the attention information. 10. A method as claim 6 recites, wherein: the method further comprises, before the refining, copying the feature information of the query text to determine reference feature information of the query text; and the modifying the feature information of the query text comprises replacing the feature information of the query text with a sum or concatenation of the reference feature information of the query text and feature information of the image regions weighted by the attention information. 11. A method as claim 6 recites, wherein the selecting comprises: operating a network computational model to determine respective scores for individual outputs of the plurality of outputs based at least in part on the refined feature information of the query text; and selecting the at least one of the plurality of outputs having respective scores in a predetermined range. 12. A system for answering natural-language queries about content depicted in images, the system comprising: a microphone configured to provide an audio-input signal; a speaker configured to receive an audio-output signal and produce corresponding output audio; at least one processor; and a memory storing instructions that, when executed by the at least one processor, cause the at least one processor to perform operations comprising: determining query text corresponding to the audio-input signal; operating a network computational model to determine feature information of the query text; operating a convolutional computational model (CCM) to determine feature information of an image, the feature information of the image comprising feature values for a plurality of image regions within the image; operating a first computational stage of a plurality of computational stages to determine, based at least in part on the feature information of the query text and the feature information of the image, first attention information representing a relevance of each of the plurality of image regions to the query, and to determine feature information of the first computational stage based at least in part on the feature information of the query text and the feature information of the image weighted by the attention information; operating at least one subsequent stage of the plurality of computational stages to determine, based at least in part on the feature information of a respective preceding stage of the plurality of stages, subsequent attention information, and to determine feature information of that stage based at least in part on the feature information of the image, weighted by the subsequent attention information, and on the feature information of a respective preceding stag

Assignees

Inventors

Classifications

  • Combinations of networks · CPC title

  • Recurrent networks, e.g. Hopfield networks · CPC title

  • characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title

  • Weakly supervised learning, e.g. semi-supervised or self-supervised learning · CPC title

  • Supervised learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10997233B2 cover?
In some examples, a computing device refines feature information of query text. The device repeatedly determines attention information based at least in part on feature information of the image and the feature information of the query text, and modifies the feature information of the query text based at least in part on the attention information. The device selects at least one of a predetermin…
Who is the assignee on this patent?
Microsoft Technology Licensing Llc
What technology area does this patent fall under?
Primary CPC classification G06F16/583. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue May 04 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).