Multilingual image question answering
US-2016342895-A1 · Nov 24, 2016 · US
US10417498B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10417498-B2 |
| Application number | US-201715472797-A |
| Country | US |
| Kind code | B2 |
| Filing date | Mar 29, 2017 |
| Priority date | Dec 30, 2016 |
| Publication date | Sep 17, 2019 |
| Grant date | Sep 17, 2019 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A system for generating a word sequence includes one or more processors in connection with a memory and one or more storage devices storing instructions causing operations that include receiving first and second input vectors, extracting first and second feature vectors, estimating a first set of weights and a second set of weights, calculating a first content vector from the first set of weights and the first feature vectors, and calculating a second content vector, transforming the first content vector into a first modal content vector having a predetermined dimension and transforming the second content vector into a second modal content vector having the predetermined dimension, estimating a set of modal attention weights, generating a weighted content vector having the predetermined dimension from the set of modal attention weights and the first and second modal content vectors, and generating a predicted word using the sequence generator.
Opening claim text (preview).
We claim: 1. A system for generating a word sequence from multi-modal input vectors, comprising: one or more processors in connection with a memory and one or more storage devices storing instructions that are operable, when executed by the one or more processors, to cause the one or more processors to perform operations comprising: receiving first and second input vectors according to first and second sequential intervals; extracting first and second feature vectors using first and second feature extractors, respectively, from the first and second input; estimating a first set of weights and a second set of weights, respectively, from the first and second feature vectors and a prestep context vector of a sequence generator; calculating a first content vector from the first set of weights and the first feature vectors, and calculating a second content vector from the second set of weights and the second feature vectors; tranforming the first content vector into a first modal content vector having a predetermined dimension and tranforming the second content vector into a second modal content vector having the predetermined dimension; estimating a set of modal attention weights from the pre-step context vector and the first and second content vectors or the first and second modal content vectors; generating a weighted content vector having the predetermined dimension from the set of modal attention weights and the first and second modal content vectors; and generating a predicted word using the sequence generator for generating the word sequence from the weighted content vector. 2. The system of claim 1 , wherein the first and second sequential intervals are an identical interval. 3. The system of claim 1 , wherein the first and second input vectors are different modalies. 4. The system of claim 1 , wherein the operations further comprising: accumulating the predicted word into the memory or the one or more storage devices to generate the word sequence. 5. The system of claim 4 , wherein the accumulating is continued until an end label is received. 6. The system of claim 1 , wherein the operations further comprising: transmitting the predicted word generated from the sequence generator. 7. The system of claim 1 , wherein the first and second feature extractors are pretrained Convolutional Neural Networks (CNNs) having been trained for an image or a video classification task. 8. The system of claim 1 , wherein the feature extractors are Long Short-Term Memory (LSTM) networks. 9. The system of claim 1 , wherein the predicted word having a highest probability in all possible words given the weighted content vector and the prestep context vector is determined. 10. The system of claim 1 , wherein the sequence generator employs a Long Short-Term Memory (LSTM) network. 11. The system of claim 1 , wherein the first input vector is received via a first input/output (I/O) interface and the second input vector is received via a second I/O interface. 12. A non-transitory computer-readable medium storing software for generating a word sequence from multi-modal input vectors, comprising instructions executable by one or more processors which, upon such execution, cause the one or more processors in connection with a memory to perform operations comprising: receiving first and second input vectors according to first and second sequential intervals; extracting first and second feature vectors using first and second feature extractors, respectively, from the first and second input; stimating a first set of weights and a second set of weights, respectively, from the first and second feature vectors and a prestep context vector of a sequence generator; calculating a first content vector from the first set of weights and the first feature vectors, and calculating a second content vector from the second set of weights and the second feature vectors; tranforming the first content vector into a first modal content vector having a predetermined dimension and tranforming the second content vector into a second modal content vector having the predetermined dimension; estimating a set of modal attention weights from the pre-step context vector and the first and second content vectors or the first and second modal content vectors; generating a weighted content vector having the predetermined dimension from the set of modal attention weights and the first and second modal content vectors; and generating a predicted word using the sequence generator for generating the word sequence from the weighted content vector. 13. The computer-readable medium of claim 12 , wherein the first and second sequential intervals are an identical interval. 14. The computer-readable medium of claim 12 , wherein the first and second input vectors are different modalies. 15. The computer-readable medium of claim 12 , wherein the operations further comprising: accumulating the predicted word into the memory or the one or more storage devices to generate the word sequence. 16. The computer-readable medium of claim 15 , wherein the accumulating is continued until an end label is received. 17. The computer-readable medium of claim 12 , wherein the operations further comprising: transmitting the predicted word generated from the sequence generator. 18. The computer-readable medium of claim 12 , wherein the first and second feature extractors are pretrained Convolutional Neural Networks (CNNs) having been trained for an image or a video classification 6task. 19. A method for generating a word sequence from multi-modal input, comprising: receiving first and second input vectors according to first and second sequential intervals; extracting first and second feature vectors using first and second feature extractors, respectively, from the first and second input; estimating a first set of weights and a second set of weights, respectively, from the first and second feature vectors and a prestep context vector of a sequence generator; calculating a first content vector from the first set of weights and the first feature vectors, and calculating a second content vector from the second set of weights and the second feature vectors; tranforming the first content vector into a first modal content vector having a predetermined dimension and tranforming the second content vector into a second modal content vector having the predetermined dimension; estimating a set of modal attention weights from the prestep context vector and the first and second content vectors or the first and second modal content vectors; generating a weighted content vector having the predetermined dimension from the set of modal attention weights and the first and second modal content vectors; and generating a predicted word using the sequence generator for generating the word sequence from the weighted content vector. 20. The method of claim 19 , wherein the first and second sequential intervals are an identical interval.
of extracted features · CPC title
Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames · CPC title
Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items (segmenting video sequences G06V20/49) · CPC title
Recurrent networks, e.g. Hopfield networks · CPC title
Combinations of networks · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.