System and method for a dialogue response generation system

US11264009B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11264009-B2
Application numberUS-201916569679-A
CountryUS
Kind codeB2
Filing dateSep 13, 2019
Priority dateSep 13, 2019
Publication dateMar 1, 2022
Grant dateMar 1, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A computer-implemented method for training a dialogue response generation system and the dialogue response generation system are provided. The method includes arranging a first multimodal encoder-decoder for the dialogue response generation or video description having a first input and a first output, wherein the first multimodal encoder-decoder has been pretrained by training audio-video datasets with training video description sentences, arranging a second multimodal encoder-decoder for dialog response generation having a second input and a second output, providing first audio-visual datasets with first corresponding video description sentences to the first input of the first multimodal encoder-decoder, wherein the first encoder-decoder generates first output values based on the first audio-visual datasets with the first corresponding description sentences, providing the first audio-visual datasets excluding the first corresponding video description sentences to the second multimodal encoder-decoder. In this case, the second multimodal encoder-decoder generates second output values based on the first audio-visual datasets without the first corresponding video description sentences.

First claim

Opening claim text (preview).

We claim: 1. A computer-implemented method for training a dialogue response generation system comprising steps of: arranging a first multimodal encoder-decoder having a first input and a first output, wherein the first multimodal encoder-decoder has been pretrained by training audio-video datasets with training descriptions; arranging a second multimodal encoder-decoder having a second input and a second output; providing first audio-visual datasets with first corresponding description sentences to the first input of the first multimodal encoder-decoder, wherein the first multimodal encoder-decoder generates first output values based on the first audio-visual datasets with the first corresponding description sentences; providing the first audio-visual datasets excluding the first corresponding description sentences to the second multimodal encoder-decoder, wherein the second multimodal encoder-decoder generates second output values based on the first audio-visual datasets without the first corresponding description sentences, wherein an optimizer module updates second network parameters of the second multimodal encoder-decoder until errors between the first output values and the second output values are reduced into a predetermined range, wherein the errors are computed based on a loss function. 2. The computer-implemented method of claim 1 , wherein the loss function is a cross entropy loss function. 3. The computer-implemented method of claim 2 , the loss function incorporates mean square error between context vectors of the first and the second multimodal encoder-decoders. 4. The computer-implemented method of claim 1 , wherein first parameters of the first multimodal encoder-decoder are not updated. 5. The computer-implemented method of claim 1 , wherein the optimizer module updates first parameters of the first multimodal encoder-decoder based on a cross entropy loss function. 6. The computer-implemented method of claim 1 , wherein the optimizer module updates the second network parameters of the second multimodal encoder-decoder using a back propagation method. 7. The computer-implemented method of claim 1 , further comprises providing second audio-visual datasets to the first input of the first multimodal encoder-decoder to generate third audio-visual datasets, wherein the generated third audio-visual datasets are further provided to the second multimodal encoder-decoder to further update the second network parameters. 8. A system for training a dialogue response generation system, comprising: a memory and one or more storage devices storing instructions of a computer-implemented method of claim 1 ; one or more processors in connection with the memory and one or more storage devices that are operable, when executed by the one or more processors, to cause the one or more processors to perform operations comprising steps of: arranging a first multimodal encoder-decoder having a first input and a first output, wherein the first multimodal encoder-decoder has been pretrained by training audio-video datasets with training descriptions; arranging a second multimodal encoder-decoder having a second input and a second output; providing first audio-visual datasets with first corresponding description sentences to the first input of the first multimodal encoder-decoder, wherein the first multimodal encoder-decoder generates first output values based on the first audio-visual datasets with the first corresponding description sentences; providing the first audio-visual datasets excluding the first corresponding description sentences to the second multimodal encoder-decoder, wherein the second multimodal encoder-decoder generates second output values based on the first audio-visual datasets without the first corresponding description sentences, wherein an optimizer module updates second network parameters of the second multimodal encoder-decoder until errors between the first output values and the second output values are reduced into a predetermined range, wherein the errors are computed based on a loss function. 9. The system of claim 8 , wherein the loss function is a cross entropy loss function. 10. The system of claim 9 , the loss function incorporates mean square error between context vectors of the first and the second multimodal encoder-decoders. 11. The system of claim 8 , wherein first parameters of the first multimodal encoder-decoder are not updated. 12. The system of claim 8 , wherein the optimizer module updates first parameters of the first multimodal encoder-decoder based on a cross entropy loss function. 13. The system of claim 8 , wherein the optimizer module updates the second network parameters of the second multimodal encoder-decoder using a back propagation method. 14. The system of claim 8 , further comprises providing second audio-visual datasets to the first input of the first multimodal encoder-decoder to generate third audio-visual datasets, wherein the generated third audio-visual datasets are further provided to the second multimodal encoder-decoder to further update the second network parameters. 15. A dialogue response generation system comprising: a memory and one or more storage devices storing instructions of multimodal encoder-decoders, wherein the multimodal encoder-decoders have been trained by a computer-implemented method comprising steps of: arranging a first multimodal encoder-decoder having a first input and a first output, wherein the first multimodal encoder-decoder has been pretrained by training audio-video datasets with training descriptions; arranging a second multimodal encoder-decoder having a second input and a second output; providing first audio-visual datasets with first corresponding description sentences to the first input of the first multimodal encoder-decoder, wherein the first multimodal encoder-decoder generates first output values based on the first audio-visual datasets with the first corresponding description sentences; and providing the first audio-visual datasets excluding the first corresponding description sentences to the second multimodal encoder-decoder, wherein the second multimodal encoder-decoder generates second output values based on the first audio-visual datasets without the first corresponding description sentences, wherein an optimizer module updates second network parameters of the second multimodal encoder-decoder until errors between the first output values and the second output values are reduced into a predetermined range, wherein the errors are computed based on a loss function; one or more processors in connection with the memory and one or more storage devices that are operable, when executed by the one or more processors, to cause the one or more processors to perform operations comprising steps of: receiving first and second input vectors according to first and second sequential intervals; extracting first and second feature vectors using first and second feature extractors, respectively, from the first and second input; estimating a first set of weights and a second set of weights, respectively, from the first and second feature vectors and a prestep context vector of a sequence generator; calculating a first context vector from the first set of weights and the first feature vectors, and calculating a second context vector from the second set of weights and the second feature vectors; transforming the first context vector into a first modal context vector having a predetermined dimension and transforming the second context vector into a second modal context vector having the predetermined dimension; estimating a set

Assignees

Inventors

Classifications

  • Natural language query formulation · CPC title

  • Combinations of networks · CPC title

  • Recurrent networks, e.g. Hopfield networks · CPC title

  • characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title

  • Auto-encoder networks; Encoder-decoder networks · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11264009B2 cover?
A computer-implemented method for training a dialogue response generation system and the dialogue response generation system are provided. The method includes arranging a first multimodal encoder-decoder for the dialogue response generation or video description having a first input and a first output, wherein the first multimodal encoder-decoder has been pretrained by training audio-video datas…
Who is the assignee on this patent?
Mitsubishi Electric Res Laboratories Inc
What technology area does this patent fall under?
Primary CPC classification G06F16/3329. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 01 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 6 related publications on this page (citations in our corpus or others sharing the same primary CPC).