Who is the assignee on this patent?

Mitsubishi Electric Res Laboratories Inc

What technology area does this patent fall under?

Primary CPC classification G06F16/3329. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Mar 01 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 6 related publications on this page (citations in our corpus or others sharing the same primary CPC).

System and method for a dialogue response generation system

US11264009B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11264009-B2
Application number	US-201916569679-A
Country	US
Kind code	B2
Filing date	Sep 13, 2019
Priority date	Sep 13, 2019
Publication date	Mar 1, 2022
Grant date	Mar 1, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A computer-implemented method for training a dialogue response generation system and the dialogue response generation system are provided. The method includes arranging a first multimodal encoder-decoder for the dialogue response generation or video description having a first input and a first output, wherein the first multimodal encoder-decoder has been pretrained by training audio-video datasets with training video description sentences, arranging a second multimodal encoder-decoder for dialog response generation having a second input and a second output, providing first audio-visual datasets with first corresponding video description sentences to the first input of the first multimodal encoder-decoder, wherein the first encoder-decoder generates first output values based on the first audio-visual datasets with the first corresponding description sentences, providing the first audio-visual datasets excluding the first corresponding video description sentences to the second multimodal encoder-decoder. In this case, the second multimodal encoder-decoder generates second output values based on the first audio-visual datasets without the first corresponding video description sentences.

First claim

Opening claim text (preview).

We claim: 1. A computer-implemented method for training a dialogue response generation system comprising steps of: arranging a first multimodal encoder-decoder having a first input and a first output, wherein the first multimodal encoder-decoder has been pretrained by training audio-video datasets with training descriptions; arranging a second multimodal encoder-decoder having a second input and a second output; providing first audio-visual datasets with first corresponding description sentences to the first input of the first multimodal encoder-decoder, wherein the first multimodal encoder-decoder generates first output values based on the first audio-visual datasets with the first corresponding description sentences; providing the first audio-visual datasets excluding the first corresponding description sentences to the second multimodal encoder-decoder, wherein the second multimodal encoder-decoder generates second output values based on the first audio-visual datasets without the first corresponding description sentences, wherein an optimizer module updates second network parameters of the second multimodal encoder-decoder until errors between the first output values and the second output values are reduced into a predetermined range, wherein the errors are computed based on a loss function. 2. The computer-implemented method of claim 1 , wherein the loss function is a cross entropy loss function. 3. The computer-implemented method of claim 2 , the loss function incorporates mean square error between context vectors of the first and the second multimodal encoder-decoders. 4. The computer-implemented method of claim 1 , wherein first parameters of the first multimodal encoder-decoder are not updated. 5. The computer-implemented method of claim 1 , wherein the optimizer module updates first parameters of the first multimodal encoder-decoder based on a cross entropy loss function. 6. The computer-implemented method of claim 1 , wherein the optimizer module updates the second network parameters of the second multimodal encoder-decoder using a back propagation method. 7. The computer-implemented method of claim 1 , further comprises providing second audio-visual datasets to the first input of the first multimodal encoder-decoder to generate third audio-visual datasets, wherein the generated third audio-visual datasets are further provided to the second multimodal encoder-decoder to further update the second network parameters. 8. A system for training a dialogue response generation system, comprising: a memory and one or more storage devices storing instructions of a computer-implemented method of claim 1 ; one or more processors in connection with the memory and one or more storage devices that are operable, when executed by the one or more processors, to cause the one or more processors to perform operations comprising steps of: arranging a first multimodal encoder-decoder having a first input and a first output, wherein the first multimodal encoder-decoder has been pretrained by training audio-video datasets with training descriptions; arranging a second multimodal encoder-decoder having a second input and a second output; providing first audio-visual datasets with first corresponding description sentences to the first input of the first multimodal encoder-decoder, wherein the first multimodal encoder-decoder generates first output values based on the first audio-visual datasets with the first corresponding description sentences; providing the first audio-visual datasets excluding the first corresponding description sentences to the second multimodal encoder-decoder, wherein the second multimodal encoder-decoder generates second output values based on the first audio-visual datasets without the first corresponding description sentences, wherein an optimizer module updates second network parameters of the second multimodal encoder-decoder until errors between the first output values and the second output values are reduced into a predetermined range, wherein the errors are computed based on a loss function. 9. The system of claim 8 , wherein the loss function is a cross entropy loss function. 10. The system of claim 9 , the loss function incorporates mean square error between context vectors of the first and the second multimodal encoder-decoders. 11. The system of claim 8 , wherein first parameters of the first multimodal encoder-decoder are not updated. 12. The system of claim 8 , wherein the optimizer module updates first parameters of the first multimodal encoder-decoder based on a cross entropy loss function. 13. The system of claim 8 , wherein the optimizer module updates the second network parameters of the second multimodal encoder-decoder using a back propagation method. 14. The system of claim 8 , further comprises providing second audio-visual datasets to the first input of the first multimodal encoder-decoder to generate third audio-visual datasets, wherein the generated third audio-visual datasets are further provided to the second multimodal encoder-decoder to further update the second network parameters. 15. A dialogue response generation system comprising: a memory and one or more storage devices storing instructions of multimodal encoder-decoders, wherein the multimodal encoder-decoders have been trained by a computer-implemented method comprising steps of: arranging a first multimodal encoder-decoder having a first input and a first output, wherein the first multimodal encoder-decoder has been pretrained by training audio-video datasets with training descriptions; arranging a second multimodal encoder-decoder having a second input and a second output; providing first audio-visual datasets with first corresponding description sentences to the first input of the first multimodal encoder-decoder, wherein the first multimodal encoder-decoder generates first output values based on the first audio-visual datasets with the first corresponding description sentences; and providing the first audio-visual datasets excluding the first corresponding description sentences to the second multimodal encoder-decoder, wherein the second multimodal encoder-decoder generates second output values based on the first audio-visual datasets without the first corresponding description sentences, wherein an optimizer module updates second network parameters of the second multimodal encoder-decoder until errors between the first output values and the second output values are reduced into a predetermined range, wherein the errors are computed based on a loss function; one or more processors in connection with the memory and one or more storage devices that are operable, when executed by the one or more processors, to cause the one or more processors to perform operations comprising steps of: receiving first and second input vectors according to first and second sequential intervals; extracting first and second feature vectors using first and second feature extractors, respectively, from the first and second input; estimating a first set of weights and a second set of weights, respectively, from the first and second feature vectors and a prestep context vector of a sequence generator; calculating a first context vector from the first set of weights and the first feature vectors, and calculating a second context vector from the second set of weights and the second feature vectors; transforming the first context vector into a first modal context vector having a predetermined dimension and transforming the second context vector into a second modal context vector having the predetermined dimension; estimating a set

Assignees

Mitsubishi Electric Res Laboratories Inc

Inventors

Classifications

G06F16/3329Primary
Natural language query formulation · CPC title
G06N3/045
Combinations of networks · CPC title
G06N3/044
Recurrent networks, e.g. Hopfield networks · CPC title
G06N3/0442
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title
G06N3/0455
Auto-encoder networks; Encoder-decoder networks · CPC title

Patent family

Related publications grouped by family.

View patent family 72322507

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11264009B2 cover?: A computer-implemented method for training a dialogue response generation system and the dialogue response generation system are provided. The method includes arranging a first multimodal encoder-decoder for the dialogue response generation or video description having a first input and a first output, wherein the first multimodal encoder-decoder has been pretrained by training audio-video datas…
Who is the assignee on this patent?: Mitsubishi Electric Res Laboratories Inc
What technology area does this patent fall under?: Primary CPC classification G06F16/3329. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Mar 01 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 6 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Learning affinity via a spatial propagation neural network

Utilizing machine learning to generate parametric distributions for digital bids in a real-time digital bidding environment

Integrated understanding of user characteristics by multimodal processing

Vehicle environment modeling with a camera

Utilizing interactive deep learning to select objects in digital visual media

Encoding and reconstructing inputs using neural networks

Frequently asked questions