Method and System for Multi-Modal Fusion Model
US-2018189572-A1 · Jul 5, 2018 · US
US10402658B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10402658-B2 |
| Application number | US-201715794802-A |
| Country | US |
| Kind code | B2 |
| Filing date | Oct 26, 2017 |
| Priority date | Nov 3, 2016 |
| Publication date | Sep 3, 2019 |
| Grant date | Sep 3, 2019 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A video retrieval system is provided, that includes a set of servers, configured to retrieve a video sequence from a database and forward it to a requesting device responsive to a match between an input text and a caption for the video sequence. The servers are further configured to translate the video sequence into the caption by (A) applying a C3D to image frames of the video sequence to obtain therefor (i) intermediate feature representations across L convolutional layers and (ii) top-layer features, (B) producing a first word of the caption for the video sequence by applying the top-layer features to a LSTM, and (C) producing subsequent words of the caption by (i) dynamically performing spatiotemporal attention and layer attention using the representations to form a context vector, and (ii) applying the LSTM to the context vector, a previous word of the caption, and a hidden state of the LSTM.
Opening claim text (preview).
What is claimed is: 1. A video retrieval system comprising: a set of servers, configured to retrieve a video sequence from a database of multiple video sequences and forward the video sequence to a requesting hardware device responsive to a match between an input text provided by a user of the requesting hardware device and a video caption for the video sequence, wherein the set of servers are further configured to translate the video sequence into the video caption by applying a three-dimensional Convolutional Neural Network (C3D) to image frames of the video sequence to obtain, for the video sequence, (i) intermediate feature representations across L convolutional layers and (ii) top-layer features, producing a first word of the video caption for the video sequence by applying the top-layer features to a Long Short Term Memory (LSTM), and producing subsequent words of the video caption by (i) dynamically performing spatiotemporal attention and layer attention using the intermediate feature representation to form a context vector, and (ii) applying the LSTM to the context vector, a previous word of the video caption, and a hidden state of the LSTM, wherein each of the intermediate feature representations is extracted at a respective location in a respective one of the L convolution layers, and wherein the spatiotemporal attention and layer attention generates, for each of the intermediate feature representations, two positive weight vectors for a particular time step that respectively measure a relative importance, to the respective location and to the respective one of the L convolutional layers, for producing the subsequent words based on history word information. 2. A video retrieval system comprising: a set of servers, configured to retrieve a video sequence from a database of multiple video sequences and forward the video sequence to a requesting hardware device responsive to a match between an input text provided by a user of the requesting hardware device and a video caption for the video sequence, wherein the set of servers are further configured to translate the video sequence into the video caption by applying a three-dimensional Convolutional Neural Network (C3D) to image frames of the video sequence to obtain, for the video sequence, (i) intermediate feature representations across L convolutional layers and (ii) top-layer features, producing a first word of the video caption for the video sequence by applying the top-layer features to a Long Short Term Memory (LSTM), and producing subsequent words of the video caption by (i) dynamically performing spatiotemporal attention and layer attention using the intermediate feature representation to form a context vector, and (ii) applying the LSTM to the context vector, a previous word of the video caption, and a hidden state of the LSTM, wherein the spatiotemporal attention and layer attention adaptively and sequentially emphasize different ones of the L convolutional layers while imposing attention within local regions of feature maps at each of the L convolutional layers in order to form the context vector. 3. The video retrieval system of claim 2 , wherein the spatiotemporal attention and layer attention selectively uses an attention type selected from the group consisting of a soft attention and a hard attention, wherein the hard attention is configured to use a multi-sample stochastic lower bound to approximate an objective function to be optimized. 4. A video retrieval system comprising: a set of servers, configured to retrieve a video sequence from a database of multiple video sequences and forward the video sequence to a requesting hardware device responsive to a match between an input text provided by a user of the requesting hardware device and a video caption for the video sequence, wherein the set of servers are further configured to translate the video sequence into the video caption by applying a three-dimensional Convolutional Neural Network (C3D) to image frames of the video sequence to obtain, for the video sequence, (i) intermediate feature representations across L convolutional layers and (ii) top-layer features, producing a first word of the video caption for the video sequence by applying the top-layer features to a Long Short Term Memory (LSTM), and producing subsequent words of the video caption by (i) dynamically performing spatiotemporal attention and layer attention using the intermediate feature representation to form a context vector, and (ii) applying the LSTM to the context vector, a previous word of the video caption, and a hidden state of the LSTM, wherein the spatiotemporal attention and layer attention involve direct comparisons between different ones of the L convolutional layers to produce the context vector, the direct comparisons enabled by applying a set of convolutional transformations to map different ones of the intermediate feature representations in different ones of the L convolutional layers to a same semantic-space dimension. 5. A computer-implemented method for video retrieval comprising: retrieving, by a set of servers, a video sequence from a database of multiple video sequences and forwarding the video sequence to a requesting hardware device responsive to a match between an input text provided by a user of the requesting hardware device and a video caption for the video sequence, wherein the method further comprises translating, by the set of servers, the video sequence into the video caption by applying a three-dimensional Convolutional Neural Network (C3D) to image frames of the video sequence to obtain, for the video sequence, (i) intermediate feature representations across L convolutional layers and (ii) top-layer features, producing a first word of the video caption for the video sequence by applying the top-layer features to a Long Short Term Memory (LSTM), and producing subsequent words of the video caption by (i) dynamically performing spatiotemporal attention and layer attention using the intermediate feature representation to form a context vector, and (ii) applying the LSTM to the context vector, a previous word of the video caption, and a hidden state of the LSTM, wherein each of the intermediate feature representations is extracted at a respective location in a respective one of the L convolutional layers, and wherein the spatiotemporal attention and layer attention generates, for each of the intermediate feature representations, two positive weights for a particular time step that respectively measure a relative importance, to the respective location and to the respective one of the L convolutional layers, for producing the subsequent words based on history word information.
for receiving images from a single remote source · CPC title
using neural networks · CPC title
Detecting features for summarising video content · CPC title
characterised by the process organisation or structure, e.g. boosting cascade · CPC title
Combinations of networks · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.