What technology area does this patent fall under?

Primary CPC classification G06V20/41. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Apr 23 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 6 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Video classification method and apparatus, model training method and apparatus, device, and storage medium

US11967151B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11967151-B2
Application number	US-202117515164-A
Country	US
Kind code	B2
Filing date	Oct 29, 2021
Priority date	Nov 15, 2019
Publication date	Apr 23, 2024
Grant date	Apr 23, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Embodiments of this application disclose a video classification method performed by a computer device and belong to the field of computer vision (CV) technologies. The method includes: obtaining a video; selecting n image frames from the video; extracting respective feature information of the n image frames according to a learned feature fusion policy by using a feature extraction network, the learned feature fusion policy being used for indicating proportions of the feature information of the other image frames that have been fused with feature information of a first image frame in the n image frames; and determining a classification result of the video according to the respective feature information of the n image frames. By replacing complex and repeated 3D convolution operations with simple feature information fusion between adjacent image frames, time for finally obtaining a classification result of the video is therefore reduced, thereby having high efficiency.

First claim

Opening claim text (preview).

What is claimed is: 1. A video classification method performed by a computer device, the method comprising: obtaining a video; dividing the video into n segments of equal length, n being a positive integer; selecting n image frames from the video, each image frame from a corresponding one of the n segments; extracting respective feature information of each of the n image frames by using a feature extraction network; fusing the feature information of each of the n image frames according to a learned feature fusion policy, the learned feature fusion policy being used for indicating, when a first image frame in the n image frames is fused with feature information of other image frames in the n image frames, proportions of the feature information of the other image frames; and determining a classification result of the video according to the respective feature information of the n image frames, wherein feature information of an edge image frame in the n image frames is weighted differently from feature information of a non-edge image frame in the n image frames. 2. The method according to claim 1 , wherein the feature extraction network comprises m cascaded network structures, m being a positive integer; and the fusing the feature information of each of then image frames according to a learned feature fusion policy comprises: before first feature information of the first image frame is inputted into a k th network structure of the feature extraction network, performing feature fusion on the first feature information of the first image frame according to the feature fusion policy, to obtain processed first feature information, the processed first feature information being fused with feature information of the first image frame and the proportions of the feature information of the other image frames, k being a positive integer less than or equal to m; and processing the processed first feature information by using the k th network structure, to generate second feature information of the first image frame, the second feature information being feature information of the first image frame outputted by the feature extraction network, or intermediate feature information of the first image frame generated by the feature extraction network. 3. The method according to claim 2 , wherein the first feature information comprises features of c channels, c being a positive integer; and the performing feature fusion on the first feature information of the first image frame according to the feature fusion policy, to obtain processed first feature information comprises: performing, for a feature of an i th channel in the first feature information of the first image frame, a convolution operation on the feature of the i th channel in the first image frame and features of the i th channel in the other image frames by using a learned convolution kernel, to obtain a processed feature of the i th channel in the first image frame, i being a positive integer less than or equal to c, the convolution kernel being configured to define a feature fusion policy corresponding to the feature of the i th channel in the first image frame; and obtaining the processed first feature information according to the processed features of the channels in the first image frame. 4. The method according to claim 1 , wherein the determining a classification result of the video according to the respective feature information of the n image frames comprises: obtaining n classification results corresponding to the n image frames according to the respective feature information of the n image frames; and determining the classification result of the video according to then classification results. 5. The method according to claim 4 , wherein the obtaining n classification results corresponding to the n image frames according to the respective feature information of the n image frames comprises: performing dimension reduction on feature information of a j th image frame in the n image frames, to obtain dimension-reduced feature information of the j th image frame; and obtaining a classification result corresponding to the j th image frame according to the dimension-reduced feature information of the j th image frame by using a j th classifier in n classifiers, j being a positive integer less than or equal to n. 6. The method according to claim 4 , wherein the determining the classification result of the video according to the n classification results comprises: determining a sum of products of then classification results and weights respectively corresponding to the n classification results as the classification result of the video. 7. The method according to claim 1 , wherein the selecting n image frames from the video comprises: extracting image frames from the video according to a preset frame rate, to obtain a video frame sequence; equally dividing the video frame sequence into n subsequences; and extracting one image frame from each of the n subsequences, to obtain the n image frames. 8. A computer device, comprising a processor and a memory, the memory storing at least one program, the at least one program being loaded and executed by the processor to perform a plurality of operations including: obtaining a video; dividing the video into n segments of equal length, n being a positive integer; selecting n image frames from the video, each image frame from a corresponding one of the n segments; extracting respective feature information of each of the n image frames by using a feature extraction network; fusing the feature information of each of the n image frames according to a learned feature fusion policy, the learned feature fusion policy being used for indicating, when a first image frame in the n image frames is fused with feature information of other image frames in the n image frames, proportions of the feature information of the other image frames; and determining a classification result of the video according to the respective feature information of the n image frames, wherein feature information of an edge image frame in the n image frames is weighted differently from feature information of a non-edge image frame in the n image frames. 9. The computer device according to claim 8 , wherein the feature extraction network comprises m cascaded network structures, m being a positive integer; and the fusing the feature information of each of then image frames according to a learned feature fusion policy comprises: before first feature information of the first image frame is inputted into a k th network structure of the feature extraction network, performing feature fusion on the first feature information of the first image frame according to the feature fusion policy, to obtain processed first feature information, the processed first feature information being fused with feature information of the first image frame and the other image frames, k being a positive integer less than or equal to m; and processing the processed first feature information by using the k th network structure, to generate second feature information of the first image frame, the second feature information being feature information of the first image frame outputted by the feature extraction network, or intermediate feature information of the first image frame generated by the feature extraction network. 10. The computer device according to claim 9 , wherein the first feature information comprises features of c channels, c being a positive integer; and the performing feature fusion on the first feature information of the first image frame according to the feature fusion policy, to obtain processed first feature information comprises: performing,

Assignees

Tencent Tech Shenzhen Co Ltd

Inventors

Classifications

G06N3/09
Supervised learning · CPC title
G06N3/0464
Convolutional networks [CNN, ConvNet] · CPC title
G06V20/41Primary
Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items (segmenting video sequences G06V20/49) · CPC title
G06F18/214
Generating training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title
G06F18/253
of extracted features · CPC title

Patent family

Related publications grouped by family.

View patent family 69853121

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11967151B2 cover?: Embodiments of this application disclose a video classification method performed by a computer device and belong to the field of computer vision (CV) technologies. The method includes: obtaining a video; selecting n image frames from the video; extracting respective feature information of the n image frames according to a learned feature fusion policy by using a feature extraction network, the …
Who is the assignee on this patent?: Tencent Tech Shenzhen Co Ltd
What technology area does this patent fall under?: Primary CPC classification G06V20/41. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Apr 23 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 6 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Method of segmenting pedestrians in roadside image by using convolutional network fusing features at different scales

Feature fusion and dense connection-based method for infrared plane object detection

Method and apparatus for sar image recognition based on multi-scale features and broad learning

Recurrent multimodal attention system based on expert gated networks

Multi-layer fusion in a convolutional neural network for image classification

Method, apparatus and computer program product for human-face features extraction

Frequently asked questions