What technology area does this patent fall under?

Primary CPC classification G06F16/71. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Dec 30 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Method, device, and computer program product for generating video database

US12511329B1 · US · B1

Patent metadata
Field	Value
Publication number	US-12511329-B1
Application number	US-202418781552-A
Country	US
Kind code	B1
Filing date	Jul 23, 2024
Priority date	Jun 28, 2024
Publication date	Dec 30, 2025
Grant date	Dec 30, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Illustrative embodiments of the disclosure include a method, device, and computer program product for generating a video database. The method includes determining a contextual feature indicating contextual information of a video. The method further includes determining, for a video frame in the video, an audio feature indicating voice text and an ambient sound associated with the video frame. The method further includes determining, for the video frame, a visual feature of the video frame. The method further includes generating a video database based on the contextual feature, the audio feature, and the visual feature. In this way, video features stored in the video database more accurately reflect the real meaning and contextual information of the video, and the generated video database can provide matching results that are more accurate and better conform with user demands for both a video retrieval system and a recommendation system, thus improving the user experience.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method, comprising: determining, in a processor-based machine learning system, a contextual feature indicating contextual information of a video, the processor-based machine learning system implementing a plurality of feature fusion models, including at least a first fusion model, a second fusion model and a third fusion model, each such fusion model comprising at least one neural network; for a video frame in the video, determining, in the processor-based machine learning system, an audio feature indicating voice text and an ambient sound associated with the video frame; for the video frame, determining, in the processor-based machine learning system, a visual feature of the video frame; and generating, in the processor-based machine learning system, a video database based on the contextual feature, the audio feature, and the visual feature; wherein generating the video database comprises: applying the contextual feature and the audio feature to respective inputs of the first fusion model; applying an output of the first fusion model to an input of the second fusion model, the second fusion model also receiving as an additional input an additional feature different than the contextual feature, the audio feature and the visual feature, the additional feature comprising a temporal-related feature; applying an output of the second fusion model to an input of the third fusion model, the third fusion model also receiving as an additional input the visual feature; and generating the video database based on an output of the third fusion model. 2 . The method according to claim 1 , wherein generating the video database comprises: for the video frame in the video, determining an adjacent frame to the video frame; determining an adjacent feature corresponding to the video frame based on the adjacent frame; and generating the video database based on the contextual feature, the audio feature, the visual feature, and the adjacent feature. 3 . The method according to claim 2 , wherein generating the video database further comprises: for the video frame in the video, determining a difference between the video frame and a subsequent frame; generating, by a sequence model, a temporal difference feature based on the difference; and generating the video database based on the contextual feature, the audio feature, the visual feature, the adjacent feature, and the temporal difference feature. 4 . The method according to claim 3 , wherein generating the video database further comprises: generating, by the first fusion model, a first fused feature based on the contextual feature, the audio feature, and the adjacent feature; generating, by the second fusion model, a second fused feature based on the first fused feature and the temporal difference feature; generating, by the third fusion model, a third fused feature based on the second fused feature and the visual feature; and integrating the third fused feature and one or more additional fused features in temporal order to obtain a video feature corresponding to the video so as to generate the video database. 5 . The method according to claim 4 , wherein training of the feature fusion models comprises: generating, by the first fusion model, a first training feature based on a training contextual feature, a training audio feature, and a training adjacent feature; generating, by the second fusion model, a second training feature based on the first training feature and a training temporal difference feature; generating, by the third fusion model, a third training feature based on the second training feature and a training visual feature; and integrating the third training feature and one or more additional training features in temporal order to generate a training video feature corresponding to the video so as to generate a training database. 6 . The method according to claim 5 , further comprising: in response to receiving a training user query, converting the training user query into a training query feature; and training the first fusion model, the second fusion model, and the third fusion model based on the training query feature and the training database. 7 . The method according to claim 6 , wherein training the first fusion model, the second fusion model, and the third fusion model comprises: determining a positive sample and a negative sample in the training database based on the training query feature and a preset strategy; calculating a similarity between the training query feature and the positive sample as well as the negative sample; and training the first fusion model, the second fusion model, and the third fusion model based on a contrastive loss function and the similarity. 8 . The method according to claim 1 , further comprising: in response to receiving a user query, converting the user query into a query feature; determining one or more video frames associated with the user query in the video database based on the query feature; and displaying a video clip based on the one or more video frames associated with the user query. 9 . The method according to claim 8 , wherein determining one or more video frames associated with the user query in the video database comprises: determining a similarity between the query feature and a video feature stored in the video database; and determining one or more video frames associated with the user query in the video database based on the similarity. 10 . An electronic device, comprising: at least one processor; and memory coupled to the at least one processor and having instructions stored therein, wherein the instructions, when executed by the at least one processor, cause the electronic device to perform actions comprising: determining, in a processor-based machine learning system, a contextual feature indicating contextual information of a video, the processor-based machine learning system implementing a plurality of feature fusion models, including at least a first fusion model, a second fusion model and a third fusion model, each such fusion model comprising at least one neural network; for a video frame in the video, determining, in the processor-based machine learning system, an audio feature indicating voice text and an ambient sound associated with the video frame; for the video frame, determining, in the processor-based machine learning system, a visual feature of the video frame; and generating, in the processor-based machine learning system, a video database based on the contextual feature, the audio feature, and the visual feature; wherein generating the video database comprises: applying the contextual feature and the audio feature to respective inputs of the first fusion model; applying an output of the first fusion model to an input of the second fusion model, the second fusion model also receiving as an additional input an additional feature different than the contextual feature, the audio feature and the visual feature, the additional feature comprising a temporal-related feature; applying an output of the second fusion model to an input of the third fusion model, the third fusion model also receiving as an additional input the visual feature; and generating the video database based on an output of the third fusion model. 11 . The electronic device according to claim 10 , wherein generating the video database comprises: for the video frame in the video, determining an adjacent frame to the video frame; determining an adjacent feature corresponding to the video frame based on the adjacent frame; and generating the video database based on the contextual feature, the audio feature, the vi

Assignees

Dell Products Lp

Inventors

Classifications

G06F16/71Primary
Indexing; Data structures therefor; Storage structures · CPC title
G06F16/783
using metadata automatically derived from the content · CPC title
G06F16/739Primary
in form of a video summary, e.g. the video summary being a video sequence, a composite still image or having synthesized frames · CPC title

Patent family

Related publications grouped by family.

View patent family 98143536

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12511329B1 cover?: Illustrative embodiments of the disclosure include a method, device, and computer program product for generating a video database. The method includes determining a contextual feature indicating contextual information of a video. The method further includes determining, for a video frame in the video, an audio feature indicating voice text and an ambient sound associated with the video frame. T…
Who is the assignee on this patent?: Dell Products Lp
What technology area does this patent fall under?: Primary CPC classification G06F16/71. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Dec 30 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Systems and Methods for Video Genre Classification

Intelligent reframing

Document body vectorization and noise-contrastive training

Cognitive video and audio search aggregation

Gating model for video analysis

Systems and methods for video processing

Systems and methods for video paragraph captioning using hierarchical recurrent neural networks

Frequently asked questions