Scene-aware video dialog

US11210523B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11210523-B2
Application numberUS-202016783538-A
CountryUS
Kind codeB2
Filing dateFeb 6, 2020
Priority dateFeb 6, 2020
Publication dateDec 28, 2021
Grant dateDec 28, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A scene aware dialog system includes an input interface to receive a sequence of video frames, contextual information, and a query and a memory configured to store neural networks trained to generate a response to the input query by analyzing one or combination of input sequence of video frames and the input contextual information. The system further includes a processor configured to detect and classify objects in each video frame of the sequence of video frames; determine relationships among the classified objects in each of the video frame; extract features representing the classified objects and the determined relationships for each of the video frame to produce a sequence of feature vectors; and submit the sequence of feature vectors, the input query and the input contextual information to the neural network to generate a response to the input query.

First claim

Opening claim text (preview).

We claim: 1. A scene-aware dialog system, comprising: an input interface configured to receive a sequence of video frames, contextual information, and a query: a memory configured to store at least one neural network comprising a visual scene-aware dialog neural network trained to generate a response to the input query by analyzing one or combination of the input sequence of video frames and the input contextual information provided to the neural network; a processor configured to detect and classify objects in each video frame of the sequence of video frames; integrate region of interests of objects in the sequence of video frames to determine relationships among the classified objects in each of the video frame, wherein at least one video frame of the sequence of video frames includes at least two classified objects, and wherein the relationship between the two classified objects is an intra-frame object relationship confined within the video frame; extract features representing the classified objects and the determined relationships for each of the video frame to produce a sequence of feature vectors, wherein there is one feature vector for one video frame; and submit the sequence of feature vectors, the input query and the input contextual information to the neural network to generate a response to the input query; and an output interface to render the response to the input query. 2. The scene-aware dialog system of claim 1 , wherein the input query concerns one or combination of objects, relationships among the objects, and temporal evolutions of the objects in the input sequence of video frames, and wherein the contextual information includes one or combination of audio information and textual information about the input video, such that the neural network is a multi-modal neural network configured to process information of modalities. 3. The scene-aware dialog system of claim 2 , wherein the processor is further configured to modify values of each feature vector of the sequence of feature vectors with weighted values of neighboring feature vectors in the sequence of feature vectors. 4. The scene-aware dialog system of claim 3 , wherein the values of each of the feature vector are determined as a weighted combination of values of multiple feature vectors fitting a window centered on the feature vector. 5. The scene-aware dialog system of claim 3 , wherein the at least one neural network stored in the memory includes an audio visual scene aware dialog (AVSD) neural network trained to prepare the response to the input query, a feature extraction neural network trained to represent the objects and the corresponding relationships among the objects in the sequence of video frames with the sequence of feature vectors, and an aggregation neural network trained to determine the values of each feature vectors of the sequence of feature vectors as a weighted combination of values of multiple feature vectors fitting the window centered on the feature vector. 6. The scene-aware dialog system of claim 5 , wherein the AVSD neural network corresponds to an attention-based architecture and includes one or combination of a faster region-based convolutional neural network (faster RCNN) and a 3-dimensional (3D) convolutional neural network (CNN). 7. The scene-aware dialog system of claim 1 , wherein the memory stores a set of neural network based classifiers comprising an object classifier configured to detect and classify a predefined type of objects in the input sequence of video frames and a relationship classifier to classify relationships among the classified objects, and wherein the processor is configured to select and execute the selected neural network based classifiers to detect and classify the objects and corresponding relationships among the classified objects in each video frame of the input sequence of video frames. 8. The scene-aware dialog system of claim 7 , wherein the processor is further configured to select the object classifier and the relationship classifier from the set of neural network based classifiers based on the input sequence of video frames, the input contextual information, the input query or combination thereof. 9. The scene-aware dialog system of claim 1 , wherein the memory stores an object and a relationship classifiers configured to detect and classify objects and their relationship relevant for generating navigation instructions for driving a vehicle, and wherein the processor is configured to generate a navigation instruction using a description and a relationships of an object pertinent to a navigation route to a destination of the vehicle. 10. The scene-aware dialog system of claim 1 , wherein the processor is further configured to generate a spatio-temporal scene graph representation (STSGR) model for each frame of the sequence of video frames based on an integrated region of interests and the visual memory, and wherein the at least one neural network is trained to perform spatio-temporal relational learning on training STSGR models of the sequence of video frames to generate responses to training queries. 11. The scene-aware dialog system of claim 10 , wherein each STSGR model represents each corresponding video frame as a spatio-temporal visual graphs stream and a semantic graph stream, and wherein the at least one neural network is a multi-head shuffled transformer for generating an object-level graph reasoning, the multi-head shuffled transformer enable shuffling heads of the sequence of feature vectors. 12. The scene-aware dialog system of claim 1 , wherein the processor is further configured to aggregate the classified objects and the determined relationships for generating visual memory for each video frame of the sequence of video frames. 13. A scene-aware dialog method, wherein the method uses a processor coupled with stored instructions implementing the method, wherein the instructions, when executed by the processor carry out steps of the method, comprising: receiving a sequence of video frames, contextual information, and a query; detecting and classifying objects in each video frame of the sequence of video frames; integrating region of interests of objects in the sequence of video frames for determining relationships among the classified objects in each of the video frame, wherein at least one video frame of the sequence of video frames includes at least two classified objects, and wherein the relationship between the two classified objects is an intra-frame object relationship confined within the video frame; extracting features representing the classified objects and the determined relationships for each of the video frame to produce a sequence of feature vectors, wherein there is one feature vector for one video frame; submitting the sequence of feature vectors, the input query and the input contextual information to at least one neural network comprising a visual scene-aware dialog neural network trained to generate a response to the input query by analyzing one or combination of input sequence of video frames and the input contextual information; and rendering the response to the input query via an output interface. 14. The method of claim 13 , wherein the input query concerns one or combination of objects, relationships among the objects, and temporal evolutions of the objects in the input sequence of video frames, and wherein the contextual information includes one or combination of audio information and textual information about the input video, such that the neural network is a multi-modal neural network configured to process information of different modalities. 15. The method of c

Assignees

Inventors

Classifications

  • G06N5/022Primary

    Knowledge engineering; Knowledge acquisition · CPC title

  • using probabilistic graphical models from image or video features, e.g. Markov models or Bayesian networks · CPC title

  • using neural networks · CPC title

  • using classification, e.g. of video objects · CPC title

  • Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items (segmenting video sequences G06V20/49) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11210523B2 cover?
A scene aware dialog system includes an input interface to receive a sequence of video frames, contextual information, and a query and a memory configured to store neural networks trained to generate a response to the input query by analyzing one or combination of input sequence of video frames and the input contextual information. The system further includes a processor configured to detect an…
Who is the assignee on this patent?
Mitsubishi Electric Res Laboratories Inc
What technology area does this patent fall under?
Primary CPC classification G06N5/022. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Dec 28 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).