Method for video recognition and related products

US12254690B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12254690-B2
Application numberUS-202217932360-A
CountryUS
Kind codeB2
Filing dateSep 15, 2022
Priority dateMar 26, 2020
Publication dateMar 18, 2025
Grant dateMar 18, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method for video recognition and related products are provided. The method includes the following. An original set of clip descriptors is obtained by providing multiple clips of a video as an input of a 3D CNN of a neural network, where the neural network includes the 3D CNN and at least one first fully connected layer, and each of the multiple clips includes at least one frame. An attention vector corresponding to the original set of clip descriptors is determined. An enhanced set of clip descriptors is obtained based on the original set of clip descriptors and the attention vector. The enhanced set of clip descriptors is input into the at least one first fully connected layer and video recognition is performed based on an output of the at least one first fully connected layer.

First claim

Opening claim text (preview).

The invention claimed is: 1. A method for video recognition, comprising: obtaining an original set of clip descriptors by providing a plurality of clips of a video as an input of a three-dimensional (3D) convolutional neural network (CNN) of a neural network, wherein the neural network comprises the 3D CNN and at least one first fully connected layer, and each of the plurality of clips comprises at least one frame; determining an attention vector corresponding to the original set of clip descriptors; obtaining an enhanced set of clip descriptors based on the original set of clip descriptors and the attention vector; and inputting the enhanced set of clip descriptors into the at least one first fully connected layer and performing video recognition based on an output of the at least one first fully connected layer. 2. The method of claim 1 , wherein determining the attention vector corresponding to the original set of clip descriptors comprises: obtaining a first vector by performing a global average pooling on the original set of clip descriptors; and obtaining the attention vector by employing a gating mechanism on the first vector based on a weight of at least one second fully connected layer, wherein the 3D CNN comprises at least one convolutional layer and the at least one second fully connected layer. 3. The method of claim 2 , wherein obtaining the attention vector by employing the gating mechanism on the first vector based on the weight of the at least one second fully connected layer comprises: multiplying the first vector by a first weight of the at least one second fully connected layer to obtain a second vector; processing the second vector based on a rectified linear unit (ReLU) function to obtain a third vector; multiplying the third vector by a second weight of the at least one second fully connected layer to obtain a fourth vector; and processing the fourth vector based on an activation function to obtain the attention vector. 4. The method of claim 1 , wherein obtaining the enhanced set of clip descriptors based on the original set of clip descriptors and the attention vector comprises: obtaining a first set of clip descriptors as the enhanced set of clip descriptors by multiplying the original set of clip descriptors by the attention vector. 5. The method of claim 1 , wherein obtaining the enhanced set of clip descriptors based on the original set of clip descriptors and the attention vector comprises: obtaining a first set of clip descriptors by multiplying the original set of clip descriptors by the attention vector; and obtaining the enhanced set of clip descriptors by adding the first set of clip descriptors to the original set of clip descriptors. 6. The method of claim 1 , wherein inputting the enhanced set of clip descriptors into the at least one first fully connected layer and performing video recognition based on the output of the at least one first fully connected layer comprises: determining a fifth vector based on the enhanced set of clip descriptors; obtaining the output of the at least one first fully connected layer by multiplying the fifth vector by a weight of the at least one first fully connected layer; and obtaining an output of the neural network which is used for video recognition by processing the output of the at least one first fully connected layer based on a SoftMax function. 7. The method of claim 1 , further comprising: obtaining parameters of the neural network based on a loss, wherein parameters of the neural network comprise a weight of the at least one first fully connected layer and a weight of at least one second fully connected layer, the loss comprises a classification loss corresponding to an output of the neural network and a sparsity loss corresponding to the attention vector. 8. The method of claim 7 , wherein the classification loss is based on a standard cross-entropy loss between a ground truth corresponding to the input and the output of the neural network corresponding to the input, and the sparsity loss is obtained by performing L1 norm on the attention vector. 9. A method for training a neural network, comprising, obtaining an original set of clip descriptors by providing a plurality of clips of a video as an input of a three-dimensional (3D) convolutional neural network (CNN) of a neural network, wherein the neural network comprises the 3D CNN and at least one first fully connected layer, the 3D CNN comprises at least one convolutional layer and at least one second fully connected layer, and each of the plurality of clips comprises at least one frame; determining an attention vector corresponding to the original set of clip descriptors; obtaining an enhanced set of clip descriptors based on the original set of clip descriptors and the attention vector; inputting the enhanced set of clip descriptors into the at least one first fully connected layer and obtaining an output of the neural network; and training the neural network by updating parameters of the neural network based on a loss of the neural network, wherein the parameters of the neural network comprise a weight of the at least one first fully connected layer and a weight of the at least one second fully connected layer. 10. The method of claim 9 , wherein the loss comprises a classification loss corresponding to the output of the neural network and a sparsity loss corresponding to the attention vector. 11. The method of claim 10 , wherein the classification loss is based on a standard cross-entropy loss between a ground truth corresponding to the input and the output of the neural network corresponding to the input, and the sparsity loss is obtained by performing L1 norm on the attention vector. 12. A neural network based apparatus for video recognition, comprising: at least one processor; a memory coupled with the at least one processor and configured to store instructions which, when executed by the at least one processor, are operable with the processor to implement a neural network to: obtain an original set of clip descriptors by providing a plurality of clips of a video as an input of a three-dimensional (3D) convolutional neural network (CNN) of a neural network, wherein the neural network comprises the 3D CNN and at least one first fully connected layer, and each of the plurality of clips comprises at least one frame; determine an attention vector corresponding to the original set of clip descriptors; obtain an enhanced set of clip descriptors based on the original set of clip descriptors and the attention vector; and input the enhanced set of clip descriptors into the at least one first fully connected layer and perform video recognition based on an output of the at least one first fully connected layer. 13. The apparatus of claim 12 , wherein the instructions being operable with the at least one processor to implement the neural network to determine the attention vector corresponding to the original set of clip descriptors are operable with the at least one processor to implement the neural network to: obtain a first vector by performing a global average pooling on the original set of clip descriptors; and obtain the attention vector by employing a gating mechanism on the first vector based on a weight of at least one second fully connected layer, wherein the 3D CNN comprises at least one convolutional layer and the at least one second fully connected layer. 14. The apparatus of claim 13 , wherein the instructions being operable with the at least one processor to implement the neural network to obtain the attention vector by employing the gating mechanism on the fir

Assignees

Inventors

Classifications

  • Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title

  • Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items (segmenting video sequences G06V20/49) · CPC title

  • Event detection · CPC title

  • using neural networks · CPC title

  • Smoothing the distance, e.g. radial basis function networks [RBFN] · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12254690B2 cover?
A method for video recognition and related products are provided. The method includes the following. An original set of clip descriptors is obtained by providing multiple clips of a video as an input of a 3D CNN of a neural network, where the neural network includes the 3D CNN and at least one first fully connected layer, and each of the multiple clips includes at least one frame. An attention …
Who is the assignee on this patent?
Guangdong Oppo Mobile Telecommunications Corp Ltd
What technology area does this patent fall under?
Primary CPC classification G06V20/40. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 18 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).