What technology area does this patent fall under?

Primary CPC classification H04N7/183. Mapped technology areas include Electricity.

When was this patent published?

Publication date Tue Sep 03 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 11 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Video retrieval system using adaptive spatiotemporal convolution feature representation with dynamic abstraction for video to language translation

US10402658B2 · US · B2

Patent metadata
Field	Value
Publication number	US-10402658-B2
Application number	US-201715794802-A
Country	US
Kind code	B2
Filing date	Oct 26, 2017
Priority date	Nov 3, 2016
Publication date	Sep 3, 2019
Grant date	Sep 3, 2019

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A video retrieval system is provided, that includes a set of servers, configured to retrieve a video sequence from a database and forward it to a requesting device responsive to a match between an input text and a caption for the video sequence. The servers are further configured to translate the video sequence into the caption by (A) applying a C3D to image frames of the video sequence to obtain therefor (i) intermediate feature representations across L convolutional layers and (ii) top-layer features, (B) producing a first word of the caption for the video sequence by applying the top-layer features to a LSTM, and (C) producing subsequent words of the caption by (i) dynamically performing spatiotemporal attention and layer attention using the representations to form a context vector, and (ii) applying the LSTM to the context vector, a previous word of the caption, and a hidden state of the LSTM.

First claim

Opening claim text (preview).

What is claimed is: 1. A video retrieval system comprising: a set of servers, configured to retrieve a video sequence from a database of multiple video sequences and forward the video sequence to a requesting hardware device responsive to a match between an input text provided by a user of the requesting hardware device and a video caption for the video sequence, wherein the set of servers are further configured to translate the video sequence into the video caption by applying a three-dimensional Convolutional Neural Network (C3D) to image frames of the video sequence to obtain, for the video sequence, (i) intermediate feature representations across L convolutional layers and (ii) top-layer features, producing a first word of the video caption for the video sequence by applying the top-layer features to a Long Short Term Memory (LSTM), and producing subsequent words of the video caption by (i) dynamically performing spatiotemporal attention and layer attention using the intermediate feature representation to form a context vector, and (ii) applying the LSTM to the context vector, a previous word of the video caption, and a hidden state of the LSTM, wherein each of the intermediate feature representations is extracted at a respective location in a respective one of the L convolution layers, and wherein the spatiotemporal attention and layer attention generates, for each of the intermediate feature representations, two positive weight vectors for a particular time step that respectively measure a relative importance, to the respective location and to the respective one of the L convolutional layers, for producing the subsequent words based on history word information. 2. A video retrieval system comprising: a set of servers, configured to retrieve a video sequence from a database of multiple video sequences and forward the video sequence to a requesting hardware device responsive to a match between an input text provided by a user of the requesting hardware device and a video caption for the video sequence, wherein the set of servers are further configured to translate the video sequence into the video caption by applying a three-dimensional Convolutional Neural Network (C3D) to image frames of the video sequence to obtain, for the video sequence, (i) intermediate feature representations across L convolutional layers and (ii) top-layer features, producing a first word of the video caption for the video sequence by applying the top-layer features to a Long Short Term Memory (LSTM), and producing subsequent words of the video caption by (i) dynamically performing spatiotemporal attention and layer attention using the intermediate feature representation to form a context vector, and (ii) applying the LSTM to the context vector, a previous word of the video caption, and a hidden state of the LSTM, wherein the spatiotemporal attention and layer attention adaptively and sequentially emphasize different ones of the L convolutional layers while imposing attention within local regions of feature maps at each of the L convolutional layers in order to form the context vector. 3. The video retrieval system of claim 2 , wherein the spatiotemporal attention and layer attention selectively uses an attention type selected from the group consisting of a soft attention and a hard attention, wherein the hard attention is configured to use a multi-sample stochastic lower bound to approximate an objective function to be optimized. 4. A video retrieval system comprising: a set of servers, configured to retrieve a video sequence from a database of multiple video sequences and forward the video sequence to a requesting hardware device responsive to a match between an input text provided by a user of the requesting hardware device and a video caption for the video sequence, wherein the set of servers are further configured to translate the video sequence into the video caption by applying a three-dimensional Convolutional Neural Network (C3D) to image frames of the video sequence to obtain, for the video sequence, (i) intermediate feature representations across L convolutional layers and (ii) top-layer features, producing a first word of the video caption for the video sequence by applying the top-layer features to a Long Short Term Memory (LSTM), and producing subsequent words of the video caption by (i) dynamically performing spatiotemporal attention and layer attention using the intermediate feature representation to form a context vector, and (ii) applying the LSTM to the context vector, a previous word of the video caption, and a hidden state of the LSTM, wherein the spatiotemporal attention and layer attention involve direct comparisons between different ones of the L convolutional layers to produce the context vector, the direct comparisons enabled by applying a set of convolutional transformations to map different ones of the intermediate feature representations in different ones of the L convolutional layers to a same semantic-space dimension. 5. A computer-implemented method for video retrieval comprising: retrieving, by a set of servers, a video sequence from a database of multiple video sequences and forwarding the video sequence to a requesting hardware device responsive to a match between an input text provided by a user of the requesting hardware device and a video caption for the video sequence, wherein the method further comprises translating, by the set of servers, the video sequence into the video caption by applying a three-dimensional Convolutional Neural Network (C3D) to image frames of the video sequence to obtain, for the video sequence, (i) intermediate feature representations across L convolutional layers and (ii) top-layer features, producing a first word of the video caption for the video sequence by applying the top-layer features to a Long Short Term Memory (LSTM), and producing subsequent words of the video caption by (i) dynamically performing spatiotemporal attention and layer attention using the intermediate feature representation to form a context vector, and (ii) applying the LSTM to the context vector, a previous word of the video caption, and a hidden state of the LSTM, wherein each of the intermediate feature representations is extracted at a respective location in a respective one of the L convolutional layers, and wherein the spatiotemporal attention and layer attention generates, for each of the intermediate feature representations, two positive weights for a particular time step that respectively measure a relative importance, to the respective location and to the respective one of the L convolutional layers, for producing the subsequent words based on history word information.

Assignees

Inventors

Classifications

H04N7/183Primary
for receiving images from a single remote source · CPC title
G06V10/82
using neural networks · CPC title
G06V20/47
Detecting features for summarising video content · CPC title
G06F18/2148
characterised by the process organisation or structure, e.g. boosting cascade · CPC title
G06N3/045
Combinations of networks · CPC title

Patent family

Related publications grouped by family.

View patent family 62021569

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10402658B2 cover?: A video retrieval system is provided, that includes a set of servers, configured to retrieve a video sequence from a database and forward it to a requesting device responsive to a match between an input text and a caption for the video sequence. The servers are further configured to translate the video sequence into the caption by (A) applying a C3D to image frames of the video sequence to obta…
Who is the assignee on this patent?: Nec Lab America Inc, Nec Corp
What technology area does this patent fall under?: Primary CPC classification H04N7/183. Mapped technology areas include Electricity.
When was this patent published?: Publication date Tue Sep 03 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 11 related publications on this page (citations in our corpus or others sharing the same primary CPC).