Artificial intelligence in interactive storytelling
US-2019304157-A1 · Oct 3, 2019 · US
US11210523B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11210523-B2 |
| Application number | US-202016783538-A |
| Country | US |
| Kind code | B2 |
| Filing date | Feb 6, 2020 |
| Priority date | Feb 6, 2020 |
| Publication date | Dec 28, 2021 |
| Grant date | Dec 28, 2021 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A scene aware dialog system includes an input interface to receive a sequence of video frames, contextual information, and a query and a memory configured to store neural networks trained to generate a response to the input query by analyzing one or combination of input sequence of video frames and the input contextual information. The system further includes a processor configured to detect and classify objects in each video frame of the sequence of video frames; determine relationships among the classified objects in each of the video frame; extract features representing the classified objects and the determined relationships for each of the video frame to produce a sequence of feature vectors; and submit the sequence of feature vectors, the input query and the input contextual information to the neural network to generate a response to the input query.
Opening claim text (preview).
We claim: 1. A scene-aware dialog system, comprising: an input interface configured to receive a sequence of video frames, contextual information, and a query: a memory configured to store at least one neural network comprising a visual scene-aware dialog neural network trained to generate a response to the input query by analyzing one or combination of the input sequence of video frames and the input contextual information provided to the neural network; a processor configured to detect and classify objects in each video frame of the sequence of video frames; integrate region of interests of objects in the sequence of video frames to determine relationships among the classified objects in each of the video frame, wherein at least one video frame of the sequence of video frames includes at least two classified objects, and wherein the relationship between the two classified objects is an intra-frame object relationship confined within the video frame; extract features representing the classified objects and the determined relationships for each of the video frame to produce a sequence of feature vectors, wherein there is one feature vector for one video frame; and submit the sequence of feature vectors, the input query and the input contextual information to the neural network to generate a response to the input query; and an output interface to render the response to the input query. 2. The scene-aware dialog system of claim 1 , wherein the input query concerns one or combination of objects, relationships among the objects, and temporal evolutions of the objects in the input sequence of video frames, and wherein the contextual information includes one or combination of audio information and textual information about the input video, such that the neural network is a multi-modal neural network configured to process information of modalities. 3. The scene-aware dialog system of claim 2 , wherein the processor is further configured to modify values of each feature vector of the sequence of feature vectors with weighted values of neighboring feature vectors in the sequence of feature vectors. 4. The scene-aware dialog system of claim 3 , wherein the values of each of the feature vector are determined as a weighted combination of values of multiple feature vectors fitting a window centered on the feature vector. 5. The scene-aware dialog system of claim 3 , wherein the at least one neural network stored in the memory includes an audio visual scene aware dialog (AVSD) neural network trained to prepare the response to the input query, a feature extraction neural network trained to represent the objects and the corresponding relationships among the objects in the sequence of video frames with the sequence of feature vectors, and an aggregation neural network trained to determine the values of each feature vectors of the sequence of feature vectors as a weighted combination of values of multiple feature vectors fitting the window centered on the feature vector. 6. The scene-aware dialog system of claim 5 , wherein the AVSD neural network corresponds to an attention-based architecture and includes one or combination of a faster region-based convolutional neural network (faster RCNN) and a 3-dimensional (3D) convolutional neural network (CNN). 7. The scene-aware dialog system of claim 1 , wherein the memory stores a set of neural network based classifiers comprising an object classifier configured to detect and classify a predefined type of objects in the input sequence of video frames and a relationship classifier to classify relationships among the classified objects, and wherein the processor is configured to select and execute the selected neural network based classifiers to detect and classify the objects and corresponding relationships among the classified objects in each video frame of the input sequence of video frames. 8. The scene-aware dialog system of claim 7 , wherein the processor is further configured to select the object classifier and the relationship classifier from the set of neural network based classifiers based on the input sequence of video frames, the input contextual information, the input query or combination thereof. 9. The scene-aware dialog system of claim 1 , wherein the memory stores an object and a relationship classifiers configured to detect and classify objects and their relationship relevant for generating navigation instructions for driving a vehicle, and wherein the processor is configured to generate a navigation instruction using a description and a relationships of an object pertinent to a navigation route to a destination of the vehicle. 10. The scene-aware dialog system of claim 1 , wherein the processor is further configured to generate a spatio-temporal scene graph representation (STSGR) model for each frame of the sequence of video frames based on an integrated region of interests and the visual memory, and wherein the at least one neural network is trained to perform spatio-temporal relational learning on training STSGR models of the sequence of video frames to generate responses to training queries. 11. The scene-aware dialog system of claim 10 , wherein each STSGR model represents each corresponding video frame as a spatio-temporal visual graphs stream and a semantic graph stream, and wherein the at least one neural network is a multi-head shuffled transformer for generating an object-level graph reasoning, the multi-head shuffled transformer enable shuffling heads of the sequence of feature vectors. 12. The scene-aware dialog system of claim 1 , wherein the processor is further configured to aggregate the classified objects and the determined relationships for generating visual memory for each video frame of the sequence of video frames. 13. A scene-aware dialog method, wherein the method uses a processor coupled with stored instructions implementing the method, wherein the instructions, when executed by the processor carry out steps of the method, comprising: receiving a sequence of video frames, contextual information, and a query; detecting and classifying objects in each video frame of the sequence of video frames; integrating region of interests of objects in the sequence of video frames for determining relationships among the classified objects in each of the video frame, wherein at least one video frame of the sequence of video frames includes at least two classified objects, and wherein the relationship between the two classified objects is an intra-frame object relationship confined within the video frame; extracting features representing the classified objects and the determined relationships for each of the video frame to produce a sequence of feature vectors, wherein there is one feature vector for one video frame; submitting the sequence of feature vectors, the input query and the input contextual information to at least one neural network comprising a visual scene-aware dialog neural network trained to generate a response to the input query by analyzing one or combination of input sequence of video frames and the input contextual information; and rendering the response to the input query via an output interface. 14. The method of claim 13 , wherein the input query concerns one or combination of objects, relationships among the objects, and temporal evolutions of the objects in the input sequence of video frames, and wherein the contextual information includes one or combination of audio information and textual information about the input video, such that the neural network is a multi-modal neural network configured to process information of different modalities. 15. The method of c
Knowledge engineering; Knowledge acquisition · CPC title
using probabilistic graphical models from image or video features, e.g. Markov models or Bayesian networks · CPC title
using neural networks · CPC title
using classification, e.g. of video objects · CPC title
Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items (segmenting video sequences G06V20/49) · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.