Systems and methods for video paragraph captioning using hierarchical recurrent neural networks

US2017127016A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2017127016-A1
Application numberUS-201615183678-A
CountryUS
Kind codeA1
Filing dateJun 15, 2016
Priority dateOct 29, 2015
Publication dateMay 4, 2017
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Described herein are systems and methods that exploit hierarchical Recurrent Neural Networks (RNNs) to tackle the video captioning problem; that is, generating one or multiple sentences to describe a realistic video. Embodiments of the hierarchical framework comprise a sentence generator and a paragraph generator. In embodiments, the sentence generator produces one simple short sentence that describes a specific short video interval. In embodiments, it exploits both temporal- and spatial-attention mechanisms to selectively focus on visual elements during generation. In embodiments, the paragraph generator captures the inter-sentence dependency by taking as input the sentential embedding produced by the sentence generator, combining it with the paragraph history, and outputting the new initial state for the sentence generator.

First claim

Opening claim text (preview).

1 . A computer-implemented method for automating describing a video with a paragraph comprising multiple sentences, the method comprising: producing, using a sentence generator, multiple single sentences, each single sentence being generated sequentially and describing a specific time span and video region or regions within the video, the sentence generator comprising: a word embedding layer word converting an input word into a word embedding; a first recurrent layer for language modeling based at least on the word embedding, the first recurrent layer updating its hidden state when receiving a word embedding; an attention model coupled to the first recurrent layer for selectively focusing on input video features in a video feature pool; and a multimodal layer for integrating outputs from the first recurrent layer and the attention model to connect vision components with the language model; using a paragraph generator to affect inter-sentence dependency for the sentence generator when producing a next sentence, the paragraph generator comprising: a sentence embedding layer receiving word embeddings from the word embedding layer and current hidden state of the first recurrent layer to output a sentence embedding; a second recurrent layer linked to the sentence embedding layer for inter-sentence dependency modeling, the second recurrent layer updating its hidden state when receiving a sentence embedding; and a paragraph state component combining current hidden state of the second recurrent layer and the sentence embedding to generate of a current paragraph state as an initial hidden state when the first recurrent layer is reinitialized for next sentence generation. 2 . The computer-implemented method of claim 1 wherein the first recurrent layer and second recurrent layer are asynchronous, the first recurrent layer updating its current hidden state at every time step and the second recurrent layer only updating its hidden state when a full sentence has been processed. 3 . The computer-implemented method of claim 1 wherein the second recurrent layer operates whenever a full sentence goes through the sentence generator and the final sentence representation is produced by the sentence embedding layer. 4 . The computer-implemented method of claim 1 wherein the attention model exploits both temporal- and spatial-attention mechanisms for selectively focusing on input video features. 5 . The computer-implemented method of claim 1 wherein the attention model comprises: at least one attention layer and a sequential Softmax layer to compute attention weights for features in the video feature pool; and a weighted average module performing weighted averaging from the calculated attention weights for a single feature vector. 6 . The computer-implemented method of claim 1 wherein the sentence generator further comprises a Softmax layer coupled to the multimodal layer and a MaxID layer selects an index that points to the maximal value in the output of the Softmax layer as a predicted word. 7 . The computer-implemented method of claim 7 wherein the predicted word is fed back to the word embedding layer of the sentence generator as next input word. 8 . The computer-implemented method of claim 1 wherein the sentence embedding layer receiving all word embeddings from the sentence generator via an embedding average layer, which accumulates all the word embeddings of the sentence currently generated and takes an average to get a compact embedding vector. 9 . The computer-implemented method of claim 1 wherein the last hidden state of the first recurrent layer is taken as a compact representation for the sentence. 10 . A computer-implemented method for generating multiple sentences to describe a video, the method comprising: receiving a one-hot vector input at a word embedding layer and converting the input to a dense representation in a dimensional space with each row as a word embedding; receiving the word embeddings at a first Recurrent Neural Network (RNN) for its hidden state updating and encoding a sentence semantics in a compact form up to the word embeddings that have been fed in; outputting the compact form to at least one attention layer and a sequential softmax layer to compute attention weights for features in a video feature pool; obtaining a weighted sum by weighted averaging in a weighted average block; feeding the weighted sum and the output of the first RNN into a multimodal layer; feeding the output of the multimodal layer into a hidden layer and then feeding the output of the hidden layer into a Softmax layer; picking an index pointing to the maximal value in an output of the Softmax layer as a predicted word; feeding back the predicted word to the word embedding layer again as a next input word; repeating above steps until an end-of-sentence symbol received at the wording embedding layer to generate a complete sentence; and receiving, at the first RNN, a reinitialization input from a paragraph generator such that the first RNN is reinitialized for a next sentence generation. 11 . The computer-implemented method of claim 10 wherein the hidden layer has the same dimension as the word embedding layer. 12 . The computer-implemented method of claim 10 wherein the attention weights for features in a video feature pool are calculated via at least one channel with each feature channel having a different set of weights and biases to be determined. 13 . The computer-implemented method of claim 10 wherein the reinitialization input is generated at the paragraph generator based at least on word embeddings from the sentence generator and last hidden state of the first RNN. 14 . The computer-implemented method of claim 10 wherein the at least one attention layer comprises a first attention layer projecting the hidden state of the first RNN and the features from the video feature pool into a dimensional space. 15 . The computer-implemented method of claim 14 wherein the at least one attention layer comprises a second attention layer compressing the projected dimensional space into a scalar for each feature. 16 . The computer-implemented method of claim 10 wherein when the first RNN receives the reinitialization input from the paragraph generator, the word embedding layer accepts a begin-of-sentence (BOS) symbol to start new sentence generation. 17 . The computer-implemented method of claim 10 wherein the first RNN is trained in a hierarchical framework with randomly initialized parameters. 18 . The computer-implemented method of claim 10 wherein the next input word is provided by annotated sentences of an annotated video clip in a training process. 19 . A non-transitory computer-readable medium or media comprising one or more sequences of instructions which, when executed by one or more processors, causes the steps to be performed comprising: receiving, at a sentence embedding layer, word embeddings from a sentence generator and a current hidden state of a first gated Recurrent Neural Network (RNN) within a sentence generator; outputting a sentence representation from the sentence embedding layer; receiving, at a second gated RNN, the sentence representation for inter-sentence dependency modeling, the second gated RNN updating its hidden state whenever a full sentence goes through the sentence generator and the sentence representation is produced by the sentence embedding layer; combining the updated hidden state of the second gated RNN and the sentence representation at a paragraph state layer fo

Assignees

Inventors

Classifications

  • G06N3/084Primary

    Backpropagation, e.g. using gradient descent · CPC title

  • Recurrent networks, e.g. Hopfield networks · CPC title

  • Data services, e.g. news ticker {(systems specially adapted for using meteorological information in broadcast systems H04H60/71)} · CPC title

  • characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title

  • Supervised learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2017127016A1 cover?
Described herein are systems and methods that exploit hierarchical Recurrent Neural Networks (RNNs) to tackle the video captioning problem; that is, generating one or multiple sentences to describe a realistic video. Embodiments of the hierarchical framework comprise a sentence generator and a paragraph generator. In embodiments, the sentence generator produces one simple short sentence that de…
Who is the assignee on this patent?
Baidu Usa Llc, Baidu Usa Llc
What technology area does this patent fall under?
Primary CPC classification G06N3/084. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu May 04 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).