Systems and Methods for Video Genre Classification
US-2023419663-A1 · Dec 28, 2023 · US
US12511329B1 · US · B1
| Field | Value |
|---|---|
| Publication number | US-12511329-B1 |
| Application number | US-202418781552-A |
| Country | US |
| Kind code | B1 |
| Filing date | Jul 23, 2024 |
| Priority date | Jun 28, 2024 |
| Publication date | Dec 30, 2025 |
| Grant date | Dec 30, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Illustrative embodiments of the disclosure include a method, device, and computer program product for generating a video database. The method includes determining a contextual feature indicating contextual information of a video. The method further includes determining, for a video frame in the video, an audio feature indicating voice text and an ambient sound associated with the video frame. The method further includes determining, for the video frame, a visual feature of the video frame. The method further includes generating a video database based on the contextual feature, the audio feature, and the visual feature. In this way, video features stored in the video database more accurately reflect the real meaning and contextual information of the video, and the generated video database can provide matching results that are more accurate and better conform with user demands for both a video retrieval system and a recommendation system, thus improving the user experience.
Opening claim text (preview).
What is claimed is: 1 . A method, comprising: determining, in a processor-based machine learning system, a contextual feature indicating contextual information of a video, the processor-based machine learning system implementing a plurality of feature fusion models, including at least a first fusion model, a second fusion model and a third fusion model, each such fusion model comprising at least one neural network; for a video frame in the video, determining, in the processor-based machine learning system, an audio feature indicating voice text and an ambient sound associated with the video frame; for the video frame, determining, in the processor-based machine learning system, a visual feature of the video frame; and generating, in the processor-based machine learning system, a video database based on the contextual feature, the audio feature, and the visual feature; wherein generating the video database comprises: applying the contextual feature and the audio feature to respective inputs of the first fusion model; applying an output of the first fusion model to an input of the second fusion model, the second fusion model also receiving as an additional input an additional feature different than the contextual feature, the audio feature and the visual feature, the additional feature comprising a temporal-related feature; applying an output of the second fusion model to an input of the third fusion model, the third fusion model also receiving as an additional input the visual feature; and generating the video database based on an output of the third fusion model. 2 . The method according to claim 1 , wherein generating the video database comprises: for the video frame in the video, determining an adjacent frame to the video frame; determining an adjacent feature corresponding to the video frame based on the adjacent frame; and generating the video database based on the contextual feature, the audio feature, the visual feature, and the adjacent feature. 3 . The method according to claim 2 , wherein generating the video database further comprises: for the video frame in the video, determining a difference between the video frame and a subsequent frame; generating, by a sequence model, a temporal difference feature based on the difference; and generating the video database based on the contextual feature, the audio feature, the visual feature, the adjacent feature, and the temporal difference feature. 4 . The method according to claim 3 , wherein generating the video database further comprises: generating, by the first fusion model, a first fused feature based on the contextual feature, the audio feature, and the adjacent feature; generating, by the second fusion model, a second fused feature based on the first fused feature and the temporal difference feature; generating, by the third fusion model, a third fused feature based on the second fused feature and the visual feature; and integrating the third fused feature and one or more additional fused features in temporal order to obtain a video feature corresponding to the video so as to generate the video database. 5 . The method according to claim 4 , wherein training of the feature fusion models comprises: generating, by the first fusion model, a first training feature based on a training contextual feature, a training audio feature, and a training adjacent feature; generating, by the second fusion model, a second training feature based on the first training feature and a training temporal difference feature; generating, by the third fusion model, a third training feature based on the second training feature and a training visual feature; and integrating the third training feature and one or more additional training features in temporal order to generate a training video feature corresponding to the video so as to generate a training database. 6 . The method according to claim 5 , further comprising: in response to receiving a training user query, converting the training user query into a training query feature; and training the first fusion model, the second fusion model, and the third fusion model based on the training query feature and the training database. 7 . The method according to claim 6 , wherein training the first fusion model, the second fusion model, and the third fusion model comprises: determining a positive sample and a negative sample in the training database based on the training query feature and a preset strategy; calculating a similarity between the training query feature and the positive sample as well as the negative sample; and training the first fusion model, the second fusion model, and the third fusion model based on a contrastive loss function and the similarity. 8 . The method according to claim 1 , further comprising: in response to receiving a user query, converting the user query into a query feature; determining one or more video frames associated with the user query in the video database based on the query feature; and displaying a video clip based on the one or more video frames associated with the user query. 9 . The method according to claim 8 , wherein determining one or more video frames associated with the user query in the video database comprises: determining a similarity between the query feature and a video feature stored in the video database; and determining one or more video frames associated with the user query in the video database based on the similarity. 10 . An electronic device, comprising: at least one processor; and memory coupled to the at least one processor and having instructions stored therein, wherein the instructions, when executed by the at least one processor, cause the electronic device to perform actions comprising: determining, in a processor-based machine learning system, a contextual feature indicating contextual information of a video, the processor-based machine learning system implementing a plurality of feature fusion models, including at least a first fusion model, a second fusion model and a third fusion model, each such fusion model comprising at least one neural network; for a video frame in the video, determining, in the processor-based machine learning system, an audio feature indicating voice text and an ambient sound associated with the video frame; for the video frame, determining, in the processor-based machine learning system, a visual feature of the video frame; and generating, in the processor-based machine learning system, a video database based on the contextual feature, the audio feature, and the visual feature; wherein generating the video database comprises: applying the contextual feature and the audio feature to respective inputs of the first fusion model; applying an output of the first fusion model to an input of the second fusion model, the second fusion model also receiving as an additional input an additional feature different than the contextual feature, the audio feature and the visual feature, the additional feature comprising a temporal-related feature; applying an output of the second fusion model to an input of the third fusion model, the third fusion model also receiving as an additional input the visual feature; and generating the video database based on an output of the third fusion model. 11 . The electronic device according to claim 10 , wherein generating the video database comprises: for the video frame in the video, determining an adjacent frame to the video frame; determining an adjacent feature corresponding to the video frame based on the adjacent frame; and generating the video database based on the contextual feature, the audio feature, the vi
Indexing; Data structures therefor; Storage structures · CPC title
using metadata automatically derived from the content · CPC title
in form of a video summary, e.g. the video summary being a video sequence, a composite still image or having synthesized frames · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.