Machine learning collaboration techniques
US-2024420212-A1 · Dec 19, 2024 · US
US2021192220A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2021192220-A1 |
| Application number | US-202117192580-A |
| Country | US |
| Kind code | A1 |
| Filing date | Mar 4, 2021 |
| Priority date | Dec 14, 2018 |
| Publication date | Jun 24, 2021 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Video classification accuracy can be improved by utilizing multiple features. Classification based on a combination of an image classification model, an audio classification model, and a textual description classification model may improve classification. The image classification result is based on an image feature of the image frame. The audio classification result is based on an audio feature of the audio. The textual description classification result is based on a text feature of the textual description information. A target classification result of the target video is determined based on to the image classification result, the audio classification result, and the textual classification result.
Opening claim text (preview).
What is claimed is: 1 . A video classification method, performed by a computer device, the method comprising: obtaining a target video; classifying an image frame in the target video by using a first classification model, to obtain an image classification result, the first classification model being configured to perform classification based on an image feature of the image frame; classifying an audio in the target video by using a second classification model, to obtain an audio classification result, the second classification model being configured to perform classification based on an audio feature of the audio; classifying textual description information corresponding to the target video by using a third classification model, to obtain a textual classification result, the third classification model being configured to perform classification based on a text feature of the textual description information; and determining a target classification result of the target video according to the image classification result, the audio classification result, and the textual classification result. 2 . The method according to claim 1 , wherein the image classification result comprises a first image classification result, further wherein the classifying an image frame in the target video by using a first classification model, to obtain an image classification result further comprises: determining an original image frame extracted from the target video as an RGB image frame; and classifying the RGB image frame by using a residual network and an RGB classifier in the first classification model, to obtain the first image classification result, wherein the RGB classifier is configured to perform classification based on a static image feature of the RGB image frame. 3 . The method according to claim 2 , wherein the image classification result comprises a second image classification result, further wherein the classifying an image frame in the target video by using a first classification model, to obtain an image classification result further comprises: generating an RGB difference image frame according to two adjacent original image frames in the target video; and classifying the RGB difference image frame by using a residual network and an RGB difference classifier in the first classification model, to obtain the second image classification result, wherein the RGB difference classifier is configured to perform classification based on a dynamic image feature of the RGB difference image frame. 4 . The method according to claim 3 , wherein the image classification result comprises a third image classification result, further wherein the classifying an image frame in the target video by using a first classification model, to obtain an image classification result further comprises: determining an original image frame extracted from the target video as an RGB image frame; and classifying the RGB image frame by using a target detection network and a fine granularity classifier in the first classification model, to obtain a third image classification result, wherein the target detection network is configured to extract a fine granularity image feature of a target object in the RGB image frame, and the fine granularity classifier is configured to perform classification based on the fine granularity image feature. 5 . The method according to claim 1 , wherein the audio classification result comprises a first audio classification result, further wherein the classifying an audio in the target video by using a second classification model, to obtain an audio classification result further comprises: extracting a Mel-frequency cepstral coefficient (MFCC) of the audio; performing feature extraction on the MFCC by using a VGGish network in a second classification model, to obtain a VGGish feature; and classifying the VGGish feature by using a general classifier in the second classification model, to obtain the first audio classification result. 6 . The method according to claim 5 , wherein the audio classification result further comprises a second audio classification result, wherein the method further comprises: classifying the VGGish feature by using at least one specific classifier in the second classification model, to obtain the second audio classification result outputted by each specific classifier, wherein a quantity of classes in the general classifier are a same quantity of preset classes for videos, wherein the specific classifier is configured to perform classification based on a specific class, which is one of the preset classes for videos, and different specific classifiers correspond to different specific classes. 7 . The method according to claim 1 , wherein the classifying textual description information corresponding to the target video by using a third classification model, to obtain a textual classification result further comprises: obtaining the textual description information corresponding to the target video, the textual description information comprising at least one of a video title, video content description information, video background music information, or video publisher information; preprocessing the textual description information, wherein the preprocessing comprises at least one of de-noising, word segmentation, entity word retrieving, or stop word removal; and classifying the preprocessed textual description information by using a Bi-directional long short-term memory network (Bi-LSTM) and a text classifier in the third classification model, to obtain the textual classification result. 8 . The method according to claim 7 , wherein the classifying the preprocessed textual description information by using the Bi-LSTM and the text classifier in the third classification model, to obtain the textual classification result further comprises: inputting the preprocessed textual description information to the Bi-LSTM; performing weight correction on an output result of the Bi-LSTM by using an attention mechanism; and classifying the corrected output result of the Bi-LSTM by using the text classifier, to obtain the textual classification result. 9 . The method according to claim 1 , wherein the determining a target classification result of the target video according to the image classification result, the audio classification result, and the textual classification result further comprises: splicing probabilities corresponding to classes in the image classification result, the audio classification result, and the textual classification result, to generate a classification feature vector; and inputting the classification feature vector to a target classifier, to obtain the target classification result, the target classifier being constructed based on a softmax classification model. 10 . A computing apparatus comprising a processor and a memory, the memory storing computer-readable instructions, the computer-readable instructions, when executed by the processor, causing the processor to perform operations comprising: obtaining a target video; classifying an image frame in the target video by using a first classification model, to obtain an image classification result, the first classification model being configured to perform a classification based on an image feature of the image frame; classifying an audio in the target video by using a second classification model, to obtain an audio classification result, the second classification model being configured to perform a classification based on an audio feature of the audio; classifying textual description information corresponding to the target video by using a third classification model, to obtain a textual classification result, the
Semantic analysis · CPC title
Overlay text, e.g. embedded captions in a TV programme · CPC title
the classifiers operating on different input data, e.g. multi-modal recognition · CPC title
in video content (extracting overlay text G06V20/62; video retrieval G06F16/70; processing of video elementary streams in video servers H04N21/234; processing of video elementary streams in video clients H04N21/44) · CPC title
using neural networks · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.