Method, device, and computer program product for generating video database

US12511329B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-12511329-B1
Application numberUS-202418781552-A
CountryUS
Kind codeB1
Filing dateJul 23, 2024
Priority dateJun 28, 2024
Publication dateDec 30, 2025
Grant dateDec 30, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Illustrative embodiments of the disclosure include a method, device, and computer program product for generating a video database. The method includes determining a contextual feature indicating contextual information of a video. The method further includes determining, for a video frame in the video, an audio feature indicating voice text and an ambient sound associated with the video frame. The method further includes determining, for the video frame, a visual feature of the video frame. The method further includes generating a video database based on the contextual feature, the audio feature, and the visual feature. In this way, video features stored in the video database more accurately reflect the real meaning and contextual information of the video, and the generated video database can provide matching results that are more accurate and better conform with user demands for both a video retrieval system and a recommendation system, thus improving the user experience.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method, comprising: determining, in a processor-based machine learning system, a contextual feature indicating contextual information of a video, the processor-based machine learning system implementing a plurality of feature fusion models, including at least a first fusion model, a second fusion model and a third fusion model, each such fusion model comprising at least one neural network; for a video frame in the video, determining, in the processor-based machine learning system, an audio feature indicating voice text and an ambient sound associated with the video frame; for the video frame, determining, in the processor-based machine learning system, a visual feature of the video frame; and generating, in the processor-based machine learning system, a video database based on the contextual feature, the audio feature, and the visual feature; wherein generating the video database comprises: applying the contextual feature and the audio feature to respective inputs of the first fusion model; applying an output of the first fusion model to an input of the second fusion model, the second fusion model also receiving as an additional input an additional feature different than the contextual feature, the audio feature and the visual feature, the additional feature comprising a temporal-related feature; applying an output of the second fusion model to an input of the third fusion model, the third fusion model also receiving as an additional input the visual feature; and generating the video database based on an output of the third fusion model. 2 . The method according to claim 1 , wherein generating the video database comprises: for the video frame in the video, determining an adjacent frame to the video frame; determining an adjacent feature corresponding to the video frame based on the adjacent frame; and generating the video database based on the contextual feature, the audio feature, the visual feature, and the adjacent feature. 3 . The method according to claim 2 , wherein generating the video database further comprises: for the video frame in the video, determining a difference between the video frame and a subsequent frame; generating, by a sequence model, a temporal difference feature based on the difference; and generating the video database based on the contextual feature, the audio feature, the visual feature, the adjacent feature, and the temporal difference feature. 4 . The method according to claim 3 , wherein generating the video database further comprises: generating, by the first fusion model, a first fused feature based on the contextual feature, the audio feature, and the adjacent feature; generating, by the second fusion model, a second fused feature based on the first fused feature and the temporal difference feature; generating, by the third fusion model, a third fused feature based on the second fused feature and the visual feature; and integrating the third fused feature and one or more additional fused features in temporal order to obtain a video feature corresponding to the video so as to generate the video database. 5 . The method according to claim 4 , wherein training of the feature fusion models comprises: generating, by the first fusion model, a first training feature based on a training contextual feature, a training audio feature, and a training adjacent feature; generating, by the second fusion model, a second training feature based on the first training feature and a training temporal difference feature; generating, by the third fusion model, a third training feature based on the second training feature and a training visual feature; and integrating the third training feature and one or more additional training features in temporal order to generate a training video feature corresponding to the video so as to generate a training database. 6 . The method according to claim 5 , further comprising: in response to receiving a training user query, converting the training user query into a training query feature; and training the first fusion model, the second fusion model, and the third fusion model based on the training query feature and the training database. 7 . The method according to claim 6 , wherein training the first fusion model, the second fusion model, and the third fusion model comprises: determining a positive sample and a negative sample in the training database based on the training query feature and a preset strategy; calculating a similarity between the training query feature and the positive sample as well as the negative sample; and training the first fusion model, the second fusion model, and the third fusion model based on a contrastive loss function and the similarity. 8 . The method according to claim 1 , further comprising: in response to receiving a user query, converting the user query into a query feature; determining one or more video frames associated with the user query in the video database based on the query feature; and displaying a video clip based on the one or more video frames associated with the user query. 9 . The method according to claim 8 , wherein determining one or more video frames associated with the user query in the video database comprises: determining a similarity between the query feature and a video feature stored in the video database; and determining one or more video frames associated with the user query in the video database based on the similarity. 10 . An electronic device, comprising: at least one processor; and memory coupled to the at least one processor and having instructions stored therein, wherein the instructions, when executed by the at least one processor, cause the electronic device to perform actions comprising: determining, in a processor-based machine learning system, a contextual feature indicating contextual information of a video, the processor-based machine learning system implementing a plurality of feature fusion models, including at least a first fusion model, a second fusion model and a third fusion model, each such fusion model comprising at least one neural network; for a video frame in the video, determining, in the processor-based machine learning system, an audio feature indicating voice text and an ambient sound associated with the video frame; for the video frame, determining, in the processor-based machine learning system, a visual feature of the video frame; and generating, in the processor-based machine learning system, a video database based on the contextual feature, the audio feature, and the visual feature; wherein generating the video database comprises: applying the contextual feature and the audio feature to respective inputs of the first fusion model; applying an output of the first fusion model to an input of the second fusion model, the second fusion model also receiving as an additional input an additional feature different than the contextual feature, the audio feature and the visual feature, the additional feature comprising a temporal-related feature; applying an output of the second fusion model to an input of the third fusion model, the third fusion model also receiving as an additional input the visual feature; and generating the video database based on an output of the third fusion model. 11 . The electronic device according to claim 10 , wherein generating the video database comprises: for the video frame in the video, determining an adjacent frame to the video frame; determining an adjacent feature corresponding to the video frame based on the adjacent frame; and generating the video database based on the contextual feature, the audio feature, the vi

Assignees

Inventors

Classifications

  • G06F16/71Primary

    Indexing; Data structures therefor; Storage structures · CPC title

  • using metadata automatically derived from the content · CPC title

  • G06F16/739Primary

    in form of a video summary, e.g. the video summary being a video sequence, a composite still image or having synthesized frames · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12511329B1 cover?
Illustrative embodiments of the disclosure include a method, device, and computer program product for generating a video database. The method includes determining a contextual feature indicating contextual information of a video. The method further includes determining, for a video frame in the video, an audio feature indicating voice text and an ambient sound associated with the video frame. T…
Who is the assignee on this patent?
Dell Products Lp
What technology area does this patent fall under?
Primary CPC classification G06F16/71. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Dec 30 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).