What technology area does this patent fall under?

Primary CPC classification G06N3/08. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Mar 28 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Generating responses to queries about videos utilizing a multi-modal neural network with attention

US11615308B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11615308-B2
Application number	US-202117563901-A
Country	US
Kind code	B2
Filing date	Dec 28, 2021
Priority date	Feb 6, 2020
Publication date	Mar 28, 2023
Grant date	Mar 28, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The present disclosure relates to systems, methods, and non-transitory computer-readable media for generating a response to a question received from a user during display or playback of a video segment by utilizing a query-response-neural network. The disclosed systems can extract a query vector from a question corresponding to the video segment using the query-response-neural network. The disclosed systems further generate context vectors representing both visual cues and transcript cues corresponding to the video segment using context encoders or other layers from the query-response-neural network. By utilizing additional layers from the query-response-neural network, the disclosed systems generate (i) a query-context vector based on the query vector and the context vectors, and (ii) candidate-response vectors representing candidate responses to the question from a domain-knowledge base or other source. To respond to a user's question, the disclosed systems further select a response from the candidate responses based on a comparison of the query-context vector and the candidate-response vectors.

First claim

Opening claim text (preview).

What is claimed is: 1. A non-transitory computer-readable medium comprising instructions that, when executed by at least one processor, cause the at least one processor to: extract a query vector from a question corresponding to a video segment; generate one or more context vectors representing at least one of a visual feature corresponding to the video segment or transcript text corresponding to the video segment; generate a query-context vector by combining the query vector and the one or more context vectors utilizing a neural network and one or more attention mechanisms; generate candidate-response vectors representing candidate responses to the question; and select a response from the candidate responses by comparing the query-context vector to the candidate-response vectors. 2. The non-transitory computer-readable medium of claim 1 , further comprising instructions that, when executed by the at least one processor, cause the at least one processor to generate the one or more context vectors by: generating visual-context vectors representing visual features corresponding to the video segment; and generating textual-context vectors representing transcript text corresponding to the video segment. 3. The non-transitory computer-readable medium of claim 2 , further comprising instructions that, when executed by the at least one processor, cause the at least one processor to generate the query-context vector by: generating, utilizing a spatial attention mechanism, a precursor query-context vector based on a combination of the visual-context vectors and the query vector; and combining the precursor query-context vector and at least one of the textual-context vectors utilizing the neural network. 4. The non-transitory computer-readable medium of claim 2 , further comprising instructions that, when executed by the at least one processor, cause the at least one processor to generate the query-context vector by: generating, utilizing the neural network, hidden-feature vectors based on at least one of the textual-context vectors; generating, utilizing a temporal attention mechanism, a precursor query-context vector based on a combination of the hidden-feature vectors and the query vector; and combining the precursor query-context vector and at least one of the visual-context vectors utilizing a spatial attention mechanism. 5. The non-transitory computer-readable medium of claim 2 , further comprising instructions that, when executed by the at least one processor, cause the at least one processor to generate the query-context vector by: generating, utilizing the neural network, hidden-feature vectors based on the visual-context vectors and the textual-context vectors; and combining the hidden-feature vectors and the query vector utilizing a temporal attention mechanism. 6. The non-transitory computer-readable medium of claim 1 , further comprising instructions that, when executed by the at least one processor, cause the at least one processor to generate a visual-context vector by: detecting, utilizing a detection neural network, an object portrayed within the video segment comprises a pop-up dialogue or panel; and extracting, utilizing a graphical-object-matching engine, a feature embedding based on textual elements inside the object. 7. The non-transitory computer-readable medium of claim 1 , further comprising instructions that, when executed by the at least one processor, cause the at least one processor to generate a visual-context vector by utilizing a tool-recognition classifier to detect a software-user-interface tool. 8. A system comprising: one or more memory devices comprising a video and dual-attention mechanisms comprising a spatial attention mechanism and a temporal attention mechanism; and one or more processors configured to cause the system to: extract a query vector from a question corresponding to the video; generate one or more context vectors representing at least one of a visual feature corresponding to the video or transcript text corresponding to the video; generate a query-context vector by combining the query vector and the one or more context vectors utilizing the dual-attention mechanisms; generate candidate-response vectors representing candidate responses to the question; and select a response from the candidate responses by comparing the query-context vector to the candidate-response vectors. 9. The system of claim 8 , wherein the one or more processors are configured to cause the system to generate the one or more context vectors by: generating visual-context vectors representing visual features corresponding to video frames of the video; and generating textual-context vectors representing transcript text corresponding to the video frames of the video. 10. The system of claim 8 , wherein the one or more processors are configured to cause the system to generate the query-context vector utilizing a neural network. 11. The system of claim 8 , wherein the one or more processors are configured to cause the system to generate the query-context vector by: generating, utilizing the spatial attention mechanism, a precursor query-context vector based on a combination of visual-context vectors and the query vector; and combining the precursor query-context vector and textual-context vectors utilizing gated recurrent units (GRUs) of a recurrent neural network. 12. The system of claim 8 , wherein the one or more processors are configured to cause the system to generate the query-context vector by: generating, utilizing GRUs of a recurrent neural network, hidden-feature vectors based on textual-context vectors; generating, utilizing the temporal attention mechanism, a precursor query-context vector based on a combination of the hidden-feature vectors and the query vector; and combining the precursor query-context vector and visual-context vectors utilizing the spatial attention mechanism. 13. The system of claim 8 , wherein the one or more processors are configured to cause the system to generate the query-context vector by: generating, utilizing GRUs of a recurrent neural network, hidden-feature vectors based on visual-context vectors and textual-context vectors; and combining the hidden-feature vectors and the query vector utilizing the temporal attention mechanism. 14. The system of claim 8 , wherein the one or more processors are configured to cause the system to: detect, utilizing a detection neural network comprising a convolutional neural network, an object portrayed within the video comprises a pop-up dialogue or a panel; and extract, utilizing a graphical-object-matching engine, a textual-feature embedding based on textual elements inside the pop-up dialogue or the panel. 15. The system of claim 14 , wherein the one or more processors are configured to cause the system to: compare the textual-feature embedding with feature embeddings of training-sample objects by generating similarity scores indicating a similarity between the textual-feature embedding and a particular feature embedding associated with a training-sample object; and generate, based on the similarity scores, a visual-context vector indicating a visual-feature category for the pop-up dialogue or the panel. 16. A computer-implemented method comprising: extracting a query vector from a question corresponding to a video segment; generating visual-context vectors representing visual features corresponding to the video segment; generating a query-context vector by combining the query vector and the visual-context vectors utilizing a spatial attention mechanism

Assignees

Adobe Inc

Inventors

Classifications

G06N3/0455
Auto-encoder networks; Encoder-decoder networks · CPC title
G06N3/09
Supervised learning · CPC title
G06N3/0442
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title
G06N3/0464
Convolutional networks [CNN, ConvNet] · CPC title
G06N3/08Primary
Learning methods · CPC title

Patent family

Related publications grouped by family.

View patent family 77178385

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11615308B2 cover?: The present disclosure relates to systems, methods, and non-transitory computer-readable media for generating a response to a question received from a user during display or playback of a video segment by utilizing a query-response-neural network. The disclosed systems can extract a query vector from a question corresponding to the video segment using the query-response-neural network. The disc…
Who is the assignee on this patent?: Adobe Inc
What technology area does this patent fall under?: Primary CPC classification G06N3/08. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Mar 28 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Generating a response to a user query utilizing visual features of a video segment and a query-response-neural network

Video visual relation detection methods and systems

System for automated dynamic guidance for diy projects

Spatial and temporal attention-based deep reinforcement learning of hierarchical lane-change policies for controlling an autonomous vehicle

Semantic clustering based retrieval for candidate set expansion

Frequently asked questions