Learning affinity via a spatial propagation neural network
US-10762425-B2 · Sep 1, 2020 · US
US11264009B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11264009-B2 |
| Application number | US-201916569679-A |
| Country | US |
| Kind code | B2 |
| Filing date | Sep 13, 2019 |
| Priority date | Sep 13, 2019 |
| Publication date | Mar 1, 2022 |
| Grant date | Mar 1, 2022 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A computer-implemented method for training a dialogue response generation system and the dialogue response generation system are provided. The method includes arranging a first multimodal encoder-decoder for the dialogue response generation or video description having a first input and a first output, wherein the first multimodal encoder-decoder has been pretrained by training audio-video datasets with training video description sentences, arranging a second multimodal encoder-decoder for dialog response generation having a second input and a second output, providing first audio-visual datasets with first corresponding video description sentences to the first input of the first multimodal encoder-decoder, wherein the first encoder-decoder generates first output values based on the first audio-visual datasets with the first corresponding description sentences, providing the first audio-visual datasets excluding the first corresponding video description sentences to the second multimodal encoder-decoder. In this case, the second multimodal encoder-decoder generates second output values based on the first audio-visual datasets without the first corresponding video description sentences.
Opening claim text (preview).
We claim: 1. A computer-implemented method for training a dialogue response generation system comprising steps of: arranging a first multimodal encoder-decoder having a first input and a first output, wherein the first multimodal encoder-decoder has been pretrained by training audio-video datasets with training descriptions; arranging a second multimodal encoder-decoder having a second input and a second output; providing first audio-visual datasets with first corresponding description sentences to the first input of the first multimodal encoder-decoder, wherein the first multimodal encoder-decoder generates first output values based on the first audio-visual datasets with the first corresponding description sentences; providing the first audio-visual datasets excluding the first corresponding description sentences to the second multimodal encoder-decoder, wherein the second multimodal encoder-decoder generates second output values based on the first audio-visual datasets without the first corresponding description sentences, wherein an optimizer module updates second network parameters of the second multimodal encoder-decoder until errors between the first output values and the second output values are reduced into a predetermined range, wherein the errors are computed based on a loss function. 2. The computer-implemented method of claim 1 , wherein the loss function is a cross entropy loss function. 3. The computer-implemented method of claim 2 , the loss function incorporates mean square error between context vectors of the first and the second multimodal encoder-decoders. 4. The computer-implemented method of claim 1 , wherein first parameters of the first multimodal encoder-decoder are not updated. 5. The computer-implemented method of claim 1 , wherein the optimizer module updates first parameters of the first multimodal encoder-decoder based on a cross entropy loss function. 6. The computer-implemented method of claim 1 , wherein the optimizer module updates the second network parameters of the second multimodal encoder-decoder using a back propagation method. 7. The computer-implemented method of claim 1 , further comprises providing second audio-visual datasets to the first input of the first multimodal encoder-decoder to generate third audio-visual datasets, wherein the generated third audio-visual datasets are further provided to the second multimodal encoder-decoder to further update the second network parameters. 8. A system for training a dialogue response generation system, comprising: a memory and one or more storage devices storing instructions of a computer-implemented method of claim 1 ; one or more processors in connection with the memory and one or more storage devices that are operable, when executed by the one or more processors, to cause the one or more processors to perform operations comprising steps of: arranging a first multimodal encoder-decoder having a first input and a first output, wherein the first multimodal encoder-decoder has been pretrained by training audio-video datasets with training descriptions; arranging a second multimodal encoder-decoder having a second input and a second output; providing first audio-visual datasets with first corresponding description sentences to the first input of the first multimodal encoder-decoder, wherein the first multimodal encoder-decoder generates first output values based on the first audio-visual datasets with the first corresponding description sentences; providing the first audio-visual datasets excluding the first corresponding description sentences to the second multimodal encoder-decoder, wherein the second multimodal encoder-decoder generates second output values based on the first audio-visual datasets without the first corresponding description sentences, wherein an optimizer module updates second network parameters of the second multimodal encoder-decoder until errors between the first output values and the second output values are reduced into a predetermined range, wherein the errors are computed based on a loss function. 9. The system of claim 8 , wherein the loss function is a cross entropy loss function. 10. The system of claim 9 , the loss function incorporates mean square error between context vectors of the first and the second multimodal encoder-decoders. 11. The system of claim 8 , wherein first parameters of the first multimodal encoder-decoder are not updated. 12. The system of claim 8 , wherein the optimizer module updates first parameters of the first multimodal encoder-decoder based on a cross entropy loss function. 13. The system of claim 8 , wherein the optimizer module updates the second network parameters of the second multimodal encoder-decoder using a back propagation method. 14. The system of claim 8 , further comprises providing second audio-visual datasets to the first input of the first multimodal encoder-decoder to generate third audio-visual datasets, wherein the generated third audio-visual datasets are further provided to the second multimodal encoder-decoder to further update the second network parameters. 15. A dialogue response generation system comprising: a memory and one or more storage devices storing instructions of multimodal encoder-decoders, wherein the multimodal encoder-decoders have been trained by a computer-implemented method comprising steps of: arranging a first multimodal encoder-decoder having a first input and a first output, wherein the first multimodal encoder-decoder has been pretrained by training audio-video datasets with training descriptions; arranging a second multimodal encoder-decoder having a second input and a second output; providing first audio-visual datasets with first corresponding description sentences to the first input of the first multimodal encoder-decoder, wherein the first multimodal encoder-decoder generates first output values based on the first audio-visual datasets with the first corresponding description sentences; and providing the first audio-visual datasets excluding the first corresponding description sentences to the second multimodal encoder-decoder, wherein the second multimodal encoder-decoder generates second output values based on the first audio-visual datasets without the first corresponding description sentences, wherein an optimizer module updates second network parameters of the second multimodal encoder-decoder until errors between the first output values and the second output values are reduced into a predetermined range, wherein the errors are computed based on a loss function; one or more processors in connection with the memory and one or more storage devices that are operable, when executed by the one or more processors, to cause the one or more processors to perform operations comprising steps of: receiving first and second input vectors according to first and second sequential intervals; extracting first and second feature vectors using first and second feature extractors, respectively, from the first and second input; estimating a first set of weights and a second set of weights, respectively, from the first and second feature vectors and a prestep context vector of a sequence generator; calculating a first context vector from the first set of weights and the first feature vectors, and calculating a second context vector from the second set of weights and the second feature vectors; transforming the first context vector into a first modal context vector having a predetermined dimension and transforming the second context vector into a second modal context vector having the predetermined dimension; estimating a set
Natural language query formulation · CPC title
Combinations of networks · CPC title
Recurrent networks, e.g. Hopfield networks · CPC title
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title
Auto-encoder networks; Encoder-decoder networks · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.