Video classification method and apparatus, model training method and apparatus, device, and storage medium

US11967151B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11967151-B2
Application numberUS-202117515164-A
CountryUS
Kind codeB2
Filing dateOct 29, 2021
Priority dateNov 15, 2019
Publication dateApr 23, 2024
Grant dateApr 23, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Embodiments of this application disclose a video classification method performed by a computer device and belong to the field of computer vision (CV) technologies. The method includes: obtaining a video; selecting n image frames from the video; extracting respective feature information of the n image frames according to a learned feature fusion policy by using a feature extraction network, the learned feature fusion policy being used for indicating proportions of the feature information of the other image frames that have been fused with feature information of a first image frame in the n image frames; and determining a classification result of the video according to the respective feature information of the n image frames. By replacing complex and repeated 3D convolution operations with simple feature information fusion between adjacent image frames, time for finally obtaining a classification result of the video is therefore reduced, thereby having high efficiency.

First claim

Opening claim text (preview).

What is claimed is: 1. A video classification method performed by a computer device, the method comprising: obtaining a video; dividing the video into n segments of equal length, n being a positive integer; selecting n image frames from the video, each image frame from a corresponding one of the n segments; extracting respective feature information of each of the n image frames by using a feature extraction network; fusing the feature information of each of the n image frames according to a learned feature fusion policy, the learned feature fusion policy being used for indicating, when a first image frame in the n image frames is fused with feature information of other image frames in the n image frames, proportions of the feature information of the other image frames; and determining a classification result of the video according to the respective feature information of the n image frames, wherein feature information of an edge image frame in the n image frames is weighted differently from feature information of a non-edge image frame in the n image frames. 2. The method according to claim 1 , wherein the feature extraction network comprises m cascaded network structures, m being a positive integer; and the fusing the feature information of each of then image frames according to a learned feature fusion policy comprises: before first feature information of the first image frame is inputted into a k th network structure of the feature extraction network, performing feature fusion on the first feature information of the first image frame according to the feature fusion policy, to obtain processed first feature information, the processed first feature information being fused with feature information of the first image frame and the proportions of the feature information of the other image frames, k being a positive integer less than or equal to m; and processing the processed first feature information by using the k th network structure, to generate second feature information of the first image frame, the second feature information being feature information of the first image frame outputted by the feature extraction network, or intermediate feature information of the first image frame generated by the feature extraction network. 3. The method according to claim 2 , wherein the first feature information comprises features of c channels, c being a positive integer; and the performing feature fusion on the first feature information of the first image frame according to the feature fusion policy, to obtain processed first feature information comprises: performing, for a feature of an i th channel in the first feature information of the first image frame, a convolution operation on the feature of the i th channel in the first image frame and features of the i th channel in the other image frames by using a learned convolution kernel, to obtain a processed feature of the i th channel in the first image frame, i being a positive integer less than or equal to c, the convolution kernel being configured to define a feature fusion policy corresponding to the feature of the i th channel in the first image frame; and obtaining the processed first feature information according to the processed features of the channels in the first image frame. 4. The method according to claim 1 , wherein the determining a classification result of the video according to the respective feature information of the n image frames comprises: obtaining n classification results corresponding to the n image frames according to the respective feature information of the n image frames; and determining the classification result of the video according to then classification results. 5. The method according to claim 4 , wherein the obtaining n classification results corresponding to the n image frames according to the respective feature information of the n image frames comprises: performing dimension reduction on feature information of a j th image frame in the n image frames, to obtain dimension-reduced feature information of the j th image frame; and obtaining a classification result corresponding to the j th image frame according to the dimension-reduced feature information of the j th image frame by using a j th classifier in n classifiers, j being a positive integer less than or equal to n. 6. The method according to claim 4 , wherein the determining the classification result of the video according to the n classification results comprises: determining a sum of products of then classification results and weights respectively corresponding to the n classification results as the classification result of the video. 7. The method according to claim 1 , wherein the selecting n image frames from the video comprises: extracting image frames from the video according to a preset frame rate, to obtain a video frame sequence; equally dividing the video frame sequence into n subsequences; and extracting one image frame from each of the n subsequences, to obtain the n image frames. 8. A computer device, comprising a processor and a memory, the memory storing at least one program, the at least one program being loaded and executed by the processor to perform a plurality of operations including: obtaining a video; dividing the video into n segments of equal length, n being a positive integer; selecting n image frames from the video, each image frame from a corresponding one of the n segments; extracting respective feature information of each of the n image frames by using a feature extraction network; fusing the feature information of each of the n image frames according to a learned feature fusion policy, the learned feature fusion policy being used for indicating, when a first image frame in the n image frames is fused with feature information of other image frames in the n image frames, proportions of the feature information of the other image frames; and determining a classification result of the video according to the respective feature information of the n image frames, wherein feature information of an edge image frame in the n image frames is weighted differently from feature information of a non-edge image frame in the n image frames. 9. The computer device according to claim 8 , wherein the feature extraction network comprises m cascaded network structures, m being a positive integer; and the fusing the feature information of each of then image frames according to a learned feature fusion policy comprises: before first feature information of the first image frame is inputted into a k th network structure of the feature extraction network, performing feature fusion on the first feature information of the first image frame according to the feature fusion policy, to obtain processed first feature information, the processed first feature information being fused with feature information of the first image frame and the other image frames, k being a positive integer less than or equal to m; and processing the processed first feature information by using the k th network structure, to generate second feature information of the first image frame, the second feature information being feature information of the first image frame outputted by the feature extraction network, or intermediate feature information of the first image frame generated by the feature extraction network. 10. The computer device according to claim 9 , wherein the first feature information comprises features of c channels, c being a positive integer; and the performing feature fusion on the first feature information of the first image frame according to the feature fusion policy, to obtain processed first feature information comprises: performing,

Assignees

Inventors

Classifications

  • Supervised learning · CPC title

  • Convolutional networks [CNN, ConvNet] · CPC title

  • G06V20/41Primary

    Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items (segmenting video sequences G06V20/49) · CPC title

  • Generating training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title

  • of extracted features · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11967151B2 cover?
Embodiments of this application disclose a video classification method performed by a computer device and belong to the field of computer vision (CV) technologies. The method includes: obtaining a video; selecting n image frames from the video; extracting respective feature information of the n image frames according to a learned feature fusion policy by using a feature extraction network, the …
Who is the assignee on this patent?
Tencent Tech Shenzhen Co Ltd
What technology area does this patent fall under?
Primary CPC classification G06V20/41. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Apr 23 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 6 related publications on this page (citations in our corpus or others sharing the same primary CPC).