What technology area does this patent fall under?

Primary CPC classification G06F40/30. Mapped technology areas include Physics.

When was this patent published?

Publication date Thu Jun 24 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Video classification method and apparatus, computer device, and storage medium

US2021192220A1 · US · A1

Patent metadata
Field	Value
Publication number	US-2021192220-A1
Application number	US-202117192580-A
Country	US
Kind code	A1
Filing date	Mar 4, 2021
Priority date	Dec 14, 2018
Publication date	Jun 24, 2021
Grant date	—

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Video classification accuracy can be improved by utilizing multiple features. Classification based on a combination of an image classification model, an audio classification model, and a textual description classification model may improve classification. The image classification result is based on an image feature of the image frame. The audio classification result is based on an audio feature of the audio. The textual description classification result is based on a text feature of the textual description information. A target classification result of the target video is determined based on to the image classification result, the audio classification result, and the textual classification result.

First claim

Opening claim text (preview).

What is claimed is: 1 . A video classification method, performed by a computer device, the method comprising: obtaining a target video; classifying an image frame in the target video by using a first classification model, to obtain an image classification result, the first classification model being configured to perform classification based on an image feature of the image frame; classifying an audio in the target video by using a second classification model, to obtain an audio classification result, the second classification model being configured to perform classification based on an audio feature of the audio; classifying textual description information corresponding to the target video by using a third classification model, to obtain a textual classification result, the third classification model being configured to perform classification based on a text feature of the textual description information; and determining a target classification result of the target video according to the image classification result, the audio classification result, and the textual classification result. 2 . The method according to claim 1 , wherein the image classification result comprises a first image classification result, further wherein the classifying an image frame in the target video by using a first classification model, to obtain an image classification result further comprises: determining an original image frame extracted from the target video as an RGB image frame; and classifying the RGB image frame by using a residual network and an RGB classifier in the first classification model, to obtain the first image classification result, wherein the RGB classifier is configured to perform classification based on a static image feature of the RGB image frame. 3 . The method according to claim 2 , wherein the image classification result comprises a second image classification result, further wherein the classifying an image frame in the target video by using a first classification model, to obtain an image classification result further comprises: generating an RGB difference image frame according to two adjacent original image frames in the target video; and classifying the RGB difference image frame by using a residual network and an RGB difference classifier in the first classification model, to obtain the second image classification result, wherein the RGB difference classifier is configured to perform classification based on a dynamic image feature of the RGB difference image frame. 4 . The method according to claim 3 , wherein the image classification result comprises a third image classification result, further wherein the classifying an image frame in the target video by using a first classification model, to obtain an image classification result further comprises: determining an original image frame extracted from the target video as an RGB image frame; and classifying the RGB image frame by using a target detection network and a fine granularity classifier in the first classification model, to obtain a third image classification result, wherein the target detection network is configured to extract a fine granularity image feature of a target object in the RGB image frame, and the fine granularity classifier is configured to perform classification based on the fine granularity image feature. 5 . The method according to claim 1 , wherein the audio classification result comprises a first audio classification result, further wherein the classifying an audio in the target video by using a second classification model, to obtain an audio classification result further comprises: extracting a Mel-frequency cepstral coefficient (MFCC) of the audio; performing feature extraction on the MFCC by using a VGGish network in a second classification model, to obtain a VGGish feature; and classifying the VGGish feature by using a general classifier in the second classification model, to obtain the first audio classification result. 6 . The method according to claim 5 , wherein the audio classification result further comprises a second audio classification result, wherein the method further comprises: classifying the VGGish feature by using at least one specific classifier in the second classification model, to obtain the second audio classification result outputted by each specific classifier, wherein a quantity of classes in the general classifier are a same quantity of preset classes for videos, wherein the specific classifier is configured to perform classification based on a specific class, which is one of the preset classes for videos, and different specific classifiers correspond to different specific classes. 7 . The method according to claim 1 , wherein the classifying textual description information corresponding to the target video by using a third classification model, to obtain a textual classification result further comprises: obtaining the textual description information corresponding to the target video, the textual description information comprising at least one of a video title, video content description information, video background music information, or video publisher information; preprocessing the textual description information, wherein the preprocessing comprises at least one of de-noising, word segmentation, entity word retrieving, or stop word removal; and classifying the preprocessed textual description information by using a Bi-directional long short-term memory network (Bi-LSTM) and a text classifier in the third classification model, to obtain the textual classification result. 8 . The method according to claim 7 , wherein the classifying the preprocessed textual description information by using the Bi-LSTM and the text classifier in the third classification model, to obtain the textual classification result further comprises: inputting the preprocessed textual description information to the Bi-LSTM; performing weight correction on an output result of the Bi-LSTM by using an attention mechanism; and classifying the corrected output result of the Bi-LSTM by using the text classifier, to obtain the textual classification result. 9 . The method according to claim 1 , wherein the determining a target classification result of the target video according to the image classification result, the audio classification result, and the textual classification result further comprises: splicing probabilities corresponding to classes in the image classification result, the audio classification result, and the textual classification result, to generate a classification feature vector; and inputting the classification feature vector to a target classifier, to obtain the target classification result, the target classifier being constructed based on a softmax classification model. 10 . A computing apparatus comprising a processor and a memory, the memory storing computer-readable instructions, the computer-readable instructions, when executed by the processor, causing the processor to perform operations comprising: obtaining a target video; classifying an image frame in the target video by using a first classification model, to obtain an image classification result, the first classification model being configured to perform a classification based on an image feature of the image frame; classifying an audio in the target video by using a second classification model, to obtain an audio classification result, the second classification model being configured to perform a classification based on an audio feature of the audio; classifying textual description information corresponding to the target video by using a third classification model, to obtain a textual classification result, the

Assignees

Tencent Tech Shenzhen Co Ltd

Inventors

Classifications

G06F40/30Primary
Semantic analysis · CPC title
G06V20/635
Overlay text, e.g. embedded captions in a TV programme · CPC title
G06V10/811
the classifiers operating on different input data, e.g. multi-modal recognition · CPC title
G06V20/40
in video content (extracting overlay text G06V20/62; video retrieval G06F16/70; processing of video elementary streams in video servers H04N21/234; processing of video elementary streams in video clients H04N21/44) · CPC title
G06V10/82
using neural networks · CPC title

Patent family

Related publications grouped by family.

View patent family 65328892

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2021192220A1 cover?: Video classification accuracy can be improved by utilizing multiple features. Classification based on a combination of an image classification model, an audio classification model, and a textual description classification model may improve classification. The image classification result is based on an image feature of the image frame. The audio classification result is based on an audio feature…
Who is the assignee on this patent?: Tencent Tech Shenzhen Co Ltd
What technology area does this patent fall under?: Primary CPC classification G06F40/30. Mapped technology areas include Physics.
When was this patent published?: Publication date Thu Jun 24 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).