System and method for the fusion of bottom-up whole-image features and top-down enttiy classification for accurate image/video scene classification
US-2019005330-A1 · Jan 3, 2019 · US
US11967134B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11967134-B2 |
| Application number | US-202017611673-A |
| Country | US |
| Kind code | B2 |
| Filing date | Mar 19, 2020 |
| Priority date | Jun 5, 2019 |
| Publication date | Apr 23, 2024 |
| Grant date | Apr 23, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Disclosed are a method and device for recognizing a video. One specific embodiment of the method comprises: obtaining a video to be identified; inputting said video to a pre-trained local and global representation propagation LGD model to obtain the category of said video, wherein the LGD model learns a spatial-temporal representation in said video based on diffusion between local and global representations. According to this embodiment, the spatial-temporal representation in the video is learned based on diffusion between the local and global representations.
Opening claim text (preview).
What is claimed is: 1. A method for recognizing a video, comprising: acquiring a to-be-recognized video; and inputting the to-be-recognized video into a pre-trained local and global diffusion (LGD) model to obtain a category of the to-be-recognized video, the LGD model learning a spatio-temporal representation in the to-be-recognized video based on diffusion between a local representation and a global representation. 2. The method according to claim 1 , wherein the LGD model comprises a plurality of cascaded LGD modules, a local and global combination classifier and a fully connected layer. 3. The method according to claim 2 , wherein each LGD module comprises a local path and a global path interacting with each other, respectively describing local variation and holistic appearance at each spatio-temporal location. 4. The method according to claim 3 , wherein diffusion directions in the each LGD module comprise a global-to-local diffusion direction and a local-to-global diffusion direction, wherein, in the global-to-local diffusion direction, a local feature map at a current LGD module is learned based on a local feature map at a preceding LGD module and a global feature vector at the preceding LGD module, and in the local-to-global diffusion direction, a global feature vector at the current LGD module is learned based on the local feature map at the current LGD module and the global feature vector at the preceding LGD module. 5. The method according to claim 4 , wherein learning the local feature map at the current LGD module based on the local feature map at the preceding LGD module and the global feature vector at the preceding LGD module comprises: attaching a residual value of a global path at the preceding LGD module to the local feature map at the preceding LGD module, to generate the local feature map at the current LGD module, wherein learning the global feature vector at the current LGD module based on the local feature map at the current LGD module and the global feature vector at the preceding LGD module comprises: embedding linearly the global feature vector at the preceding LGD module and global average pooling of the local feature map at the current LGD module, to generate the global feature vector at the current LGD module. 6. The method according to claim 5 , wherein the each LGD module generates a local feature map and a global feature vector through at least three projection matrices, and uses a low-rank approximation of each projection matrix to reduce a number of additional parameters of the LGD module. 7. The method according to claim 2 , wherein the inputting the to-be-recognized video into the pre-trained local and global diffusion (LGD) model to obtain the category of the to-be-recognized video comprises: learning the local representation and the global representation of the to-be-recognized video in parallel based on the to-be-recognized video and the plurality of cascaded LGD modules; inputting the local representation and the global representation of the to-be-recognized video into the local and global combination classifier, to synthesize a combined representation of the to-be-recognized video; and inputting the combined representation of the to-be-recognized video into the fully connected layer, to obtain the category of the to-be-recognized video. 8. The method according to claim 7 , wherein the each LGD module is a two-dimensional LGD (LGD-2D) module or a three-dimensional LGD (LGD-3D) module. 9. The method according to claim 8 , wherein the learning the local representation and the global representation of the to-be-recognized video in parallel based on the to-be-recognized video and the plurality of cascaded LGD modules comprises: segmenting the to-be-recognized video into a plurality of to-be-recognized video segments; selecting a plurality of to-be-recognized video frames from the plurality of to-be-recognized video segments; and inputting the plurality of to-be-recognized video frames into a plurality of cascaded LGD-2D modules to learn a local representation and a global representation of the plurality of to-be-recognized video frames in parallel, and using the learned local representation and global representation as the local representation and the global representation of the to-be-recognized video. 10. The method according to claim 9 , wherein selecting at least one to-be-recognized video frame from each to-be-recognized video segment in the plurality of to-be-recognized video segments. 11. The method according to claim 8 , wherein the learning the local representation and the global representation of the to-be-recognized video in parallel based on the to-be-recognized video and the plurality of cascaded LGD modules comprises: segmenting the to-be-recognized video into a plurality of to-be-recognized video segments; and inputting the plurality of to-be-recognized video segments into a plurality of cascaded LGD-3D modules to learn a local representation and a global representation of the plurality of to-be-recognized video segments in parallel, and using the learned local representation and global representation as the local representation and the global representation of the to-be-recognized video. 12. The method according to claim 11 , wherein the plurality of cascaded LGD-3D modules decompose three-dimensional learning into two-dimensional convolutions in a spatial space and one-dimensional operations in a temporal dimension. 13. The method according to claim 2 , wherein the local and global combination classifier is a kernel-based classifier. 14. A server, comprising: one or more processors; and a storage apparatus, configured to store one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement operations, the operations comprising: acquiring a to-be-recognized video; and inputting the to-be-recognized video into a pre-trained local and global diffusion (LGD) model to obtain a category of the to-be-recognized video, the LGD model learning a spatio-temporal representation in the to-be-recognized video based on diffusion between a local representation and a global representation. 15. A computer readable medium, storing a computer program thereon, wherein the computer program, when executed by a processor, cause the processor to implement operations, the operations comprising: acquiring a to-be-recognized video; and inputting the to-be-recognized video into a pre-trained local and global diffusion (LGD) model to obtain a category of the to-be-recognized video, the LGD model learning a spatio-temporal representation in the to-be-recognized video based on diffusion between a local representation and a global representation. 16. The server according to claim 14 , wherein the LGD model comprises a plurality of cascaded LGD modules, a local and global combination classifier and a fully connected layer. 17. The server according to claim 16 , wherein each LGD module comprises a local path and a global path interacting with each other, respectively describing local variation and holistic appearance at each spatio-temporal location. 18. The server according to claim 17 , wherein diffusion directions in the each LGD module comprise a global-to-local diffusion direction and a local-to-global diffusion direction, wherein, in the global-to-local diffusion direction, a local feature map at a current LGD module is learned based on a local feature map at a preceding LGD module and a global feature vector at the preceding LGD m
Quantised networks; Sparse networks; Compressed networks · CPC title
Supervised learning · CPC title
Convolutional networks [CNN, ConvNet] · CPC title
using classification, e.g. of video objects · CPC title
Extraction of image or video features · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.