Generating a response to a user query utilizing visual features of a video segment and a query-response-neural network
US-11244167-B2 · Feb 8, 2022 · US
US11615308B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11615308-B2 |
| Application number | US-202117563901-A |
| Country | US |
| Kind code | B2 |
| Filing date | Dec 28, 2021 |
| Priority date | Feb 6, 2020 |
| Publication date | Mar 28, 2023 |
| Grant date | Mar 28, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
The present disclosure relates to systems, methods, and non-transitory computer-readable media for generating a response to a question received from a user during display or playback of a video segment by utilizing a query-response-neural network. The disclosed systems can extract a query vector from a question corresponding to the video segment using the query-response-neural network. The disclosed systems further generate context vectors representing both visual cues and transcript cues corresponding to the video segment using context encoders or other layers from the query-response-neural network. By utilizing additional layers from the query-response-neural network, the disclosed systems generate (i) a query-context vector based on the query vector and the context vectors, and (ii) candidate-response vectors representing candidate responses to the question from a domain-knowledge base or other source. To respond to a user's question, the disclosed systems further select a response from the candidate responses based on a comparison of the query-context vector and the candidate-response vectors.
Opening claim text (preview).
What is claimed is: 1. A non-transitory computer-readable medium comprising instructions that, when executed by at least one processor, cause the at least one processor to: extract a query vector from a question corresponding to a video segment; generate one or more context vectors representing at least one of a visual feature corresponding to the video segment or transcript text corresponding to the video segment; generate a query-context vector by combining the query vector and the one or more context vectors utilizing a neural network and one or more attention mechanisms; generate candidate-response vectors representing candidate responses to the question; and select a response from the candidate responses by comparing the query-context vector to the candidate-response vectors. 2. The non-transitory computer-readable medium of claim 1 , further comprising instructions that, when executed by the at least one processor, cause the at least one processor to generate the one or more context vectors by: generating visual-context vectors representing visual features corresponding to the video segment; and generating textual-context vectors representing transcript text corresponding to the video segment. 3. The non-transitory computer-readable medium of claim 2 , further comprising instructions that, when executed by the at least one processor, cause the at least one processor to generate the query-context vector by: generating, utilizing a spatial attention mechanism, a precursor query-context vector based on a combination of the visual-context vectors and the query vector; and combining the precursor query-context vector and at least one of the textual-context vectors utilizing the neural network. 4. The non-transitory computer-readable medium of claim 2 , further comprising instructions that, when executed by the at least one processor, cause the at least one processor to generate the query-context vector by: generating, utilizing the neural network, hidden-feature vectors based on at least one of the textual-context vectors; generating, utilizing a temporal attention mechanism, a precursor query-context vector based on a combination of the hidden-feature vectors and the query vector; and combining the precursor query-context vector and at least one of the visual-context vectors utilizing a spatial attention mechanism. 5. The non-transitory computer-readable medium of claim 2 , further comprising instructions that, when executed by the at least one processor, cause the at least one processor to generate the query-context vector by: generating, utilizing the neural network, hidden-feature vectors based on the visual-context vectors and the textual-context vectors; and combining the hidden-feature vectors and the query vector utilizing a temporal attention mechanism. 6. The non-transitory computer-readable medium of claim 1 , further comprising instructions that, when executed by the at least one processor, cause the at least one processor to generate a visual-context vector by: detecting, utilizing a detection neural network, an object portrayed within the video segment comprises a pop-up dialogue or panel; and extracting, utilizing a graphical-object-matching engine, a feature embedding based on textual elements inside the object. 7. The non-transitory computer-readable medium of claim 1 , further comprising instructions that, when executed by the at least one processor, cause the at least one processor to generate a visual-context vector by utilizing a tool-recognition classifier to detect a software-user-interface tool. 8. A system comprising: one or more memory devices comprising a video and dual-attention mechanisms comprising a spatial attention mechanism and a temporal attention mechanism; and one or more processors configured to cause the system to: extract a query vector from a question corresponding to the video; generate one or more context vectors representing at least one of a visual feature corresponding to the video or transcript text corresponding to the video; generate a query-context vector by combining the query vector and the one or more context vectors utilizing the dual-attention mechanisms; generate candidate-response vectors representing candidate responses to the question; and select a response from the candidate responses by comparing the query-context vector to the candidate-response vectors. 9. The system of claim 8 , wherein the one or more processors are configured to cause the system to generate the one or more context vectors by: generating visual-context vectors representing visual features corresponding to video frames of the video; and generating textual-context vectors representing transcript text corresponding to the video frames of the video. 10. The system of claim 8 , wherein the one or more processors are configured to cause the system to generate the query-context vector utilizing a neural network. 11. The system of claim 8 , wherein the one or more processors are configured to cause the system to generate the query-context vector by: generating, utilizing the spatial attention mechanism, a precursor query-context vector based on a combination of visual-context vectors and the query vector; and combining the precursor query-context vector and textual-context vectors utilizing gated recurrent units (GRUs) of a recurrent neural network. 12. The system of claim 8 , wherein the one or more processors are configured to cause the system to generate the query-context vector by: generating, utilizing GRUs of a recurrent neural network, hidden-feature vectors based on textual-context vectors; generating, utilizing the temporal attention mechanism, a precursor query-context vector based on a combination of the hidden-feature vectors and the query vector; and combining the precursor query-context vector and visual-context vectors utilizing the spatial attention mechanism. 13. The system of claim 8 , wherein the one or more processors are configured to cause the system to generate the query-context vector by: generating, utilizing GRUs of a recurrent neural network, hidden-feature vectors based on visual-context vectors and textual-context vectors; and combining the hidden-feature vectors and the query vector utilizing the temporal attention mechanism. 14. The system of claim 8 , wherein the one or more processors are configured to cause the system to: detect, utilizing a detection neural network comprising a convolutional neural network, an object portrayed within the video comprises a pop-up dialogue or a panel; and extract, utilizing a graphical-object-matching engine, a textual-feature embedding based on textual elements inside the pop-up dialogue or the panel. 15. The system of claim 14 , wherein the one or more processors are configured to cause the system to: compare the textual-feature embedding with feature embeddings of training-sample objects by generating similarity scores indicating a similarity between the textual-feature embedding and a particular feature embedding associated with a training-sample object; and generate, based on the similarity scores, a visual-context vector indicating a visual-feature category for the pop-up dialogue or the panel. 16. A computer-implemented method comprising: extracting a query vector from a question corresponding to a video segment; generating visual-context vectors representing visual features corresponding to the video segment; generating a query-context vector by combining the query vector and the visual-context vectors utilizing a spatial attention mechanism
Auto-encoder networks; Encoder-decoder networks · CPC title
Supervised learning · CPC title
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title
Convolutional networks [CNN, ConvNet] · CPC title
Learning methods · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.