Method and system for multi-modal fusion model

US10417498B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10417498-B2
Application numberUS-201715472797-A
CountryUS
Kind codeB2
Filing dateMar 29, 2017
Priority dateDec 30, 2016
Publication dateSep 17, 2019
Grant dateSep 17, 2019

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A system for generating a word sequence includes one or more processors in connection with a memory and one or more storage devices storing instructions causing operations that include receiving first and second input vectors, extracting first and second feature vectors, estimating a first set of weights and a second set of weights, calculating a first content vector from the first set of weights and the first feature vectors, and calculating a second content vector, transforming the first content vector into a first modal content vector having a predetermined dimension and transforming the second content vector into a second modal content vector having the predetermined dimension, estimating a set of modal attention weights, generating a weighted content vector having the predetermined dimension from the set of modal attention weights and the first and second modal content vectors, and generating a predicted word using the sequence generator.

First claim

Opening claim text (preview).

We claim: 1. A system for generating a word sequence from multi-modal input vectors, comprising: one or more processors in connection with a memory and one or more storage devices storing instructions that are operable, when executed by the one or more processors, to cause the one or more processors to perform operations comprising: receiving first and second input vectors according to first and second sequential intervals; extracting first and second feature vectors using first and second feature extractors, respectively, from the first and second input; estimating a first set of weights and a second set of weights, respectively, from the first and second feature vectors and a prestep context vector of a sequence generator; calculating a first content vector from the first set of weights and the first feature vectors, and calculating a second content vector from the second set of weights and the second feature vectors; tranforming the first content vector into a first modal content vector having a predetermined dimension and tranforming the second content vector into a second modal content vector having the predetermined dimension; estimating a set of modal attention weights from the pre-step context vector and the first and second content vectors or the first and second modal content vectors; generating a weighted content vector having the predetermined dimension from the set of modal attention weights and the first and second modal content vectors; and generating a predicted word using the sequence generator for generating the word sequence from the weighted content vector. 2. The system of claim 1 , wherein the first and second sequential intervals are an identical interval. 3. The system of claim 1 , wherein the first and second input vectors are different modalies. 4. The system of claim 1 , wherein the operations further comprising: accumulating the predicted word into the memory or the one or more storage devices to generate the word sequence. 5. The system of claim 4 , wherein the accumulating is continued until an end label is received. 6. The system of claim 1 , wherein the operations further comprising: transmitting the predicted word generated from the sequence generator. 7. The system of claim 1 , wherein the first and second feature extractors are pretrained Convolutional Neural Networks (CNNs) having been trained for an image or a video classification task. 8. The system of claim 1 , wherein the feature extractors are Long Short-Term Memory (LSTM) networks. 9. The system of claim 1 , wherein the predicted word having a highest probability in all possible words given the weighted content vector and the prestep context vector is determined. 10. The system of claim 1 , wherein the sequence generator employs a Long Short-Term Memory (LSTM) network. 11. The system of claim 1 , wherein the first input vector is received via a first input/output (I/O) interface and the second input vector is received via a second I/O interface. 12. A non-transitory computer-readable medium storing software for generating a word sequence from multi-modal input vectors, comprising instructions executable by one or more processors which, upon such execution, cause the one or more processors in connection with a memory to perform operations comprising: receiving first and second input vectors according to first and second sequential intervals; extracting first and second feature vectors using first and second feature extractors, respectively, from the first and second input; stimating a first set of weights and a second set of weights, respectively, from the first and second feature vectors and a prestep context vector of a sequence generator; calculating a first content vector from the first set of weights and the first feature vectors, and calculating a second content vector from the second set of weights and the second feature vectors; tranforming the first content vector into a first modal content vector having a predetermined dimension and tranforming the second content vector into a second modal content vector having the predetermined dimension; estimating a set of modal attention weights from the pre-step context vector and the first and second content vectors or the first and second modal content vectors; generating a weighted content vector having the predetermined dimension from the set of modal attention weights and the first and second modal content vectors; and generating a predicted word using the sequence generator for generating the word sequence from the weighted content vector. 13. The computer-readable medium of claim 12 , wherein the first and second sequential intervals are an identical interval. 14. The computer-readable medium of claim 12 , wherein the first and second input vectors are different modalies. 15. The computer-readable medium of claim 12 , wherein the operations further comprising: accumulating the predicted word into the memory or the one or more storage devices to generate the word sequence. 16. The computer-readable medium of claim 15 , wherein the accumulating is continued until an end label is received. 17. The computer-readable medium of claim 12 , wherein the operations further comprising: transmitting the predicted word generated from the sequence generator. 18. The computer-readable medium of claim 12 , wherein the first and second feature extractors are pretrained Convolutional Neural Networks (CNNs) having been trained for an image or a video classification 6task. 19. A method for generating a word sequence from multi-modal input, comprising: receiving first and second input vectors according to first and second sequential intervals; extracting first and second feature vectors using first and second feature extractors, respectively, from the first and second input; estimating a first set of weights and a second set of weights, respectively, from the first and second feature vectors and a prestep context vector of a sequence generator; calculating a first content vector from the first set of weights and the first feature vectors, and calculating a second content vector from the second set of weights and the second feature vectors; tranforming the first content vector into a first modal content vector having a predetermined dimension and tranforming the second content vector into a second modal content vector having the predetermined dimension; estimating a set of modal attention weights from the prestep context vector and the first and second content vectors or the first and second modal content vectors; generating a weighted content vector having the predetermined dimension from the set of modal attention weights and the first and second modal content vectors; and generating a predicted word using the sequence generator for generating the word sequence from the weighted content vector. 20. The method of claim 19 , wherein the first and second sequential intervals are an identical interval.

Assignees

Inventors

Classifications

  • of extracted features · CPC title

  • Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames · CPC title

  • G06V20/41Primary

    Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items (segmenting video sequences G06V20/49) · CPC title

  • Recurrent networks, e.g. Hopfield networks · CPC title

  • Combinations of networks · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10417498B2 cover?
A system for generating a word sequence includes one or more processors in connection with a memory and one or more storage devices storing instructions causing operations that include receiving first and second input vectors, extracting first and second feature vectors, estimating a first set of weights and a second set of weights, calculating a first content vector from the first set of weigh…
Who is the assignee on this patent?
Mitsubishi Electric Res Laboratories Inc
What technology area does this patent fall under?
Primary CPC classification G06V20/41. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 17 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).