What technology area does this patent fall under?

Primary CPC classification G06F18/214. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Jul 22 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Method, electronic device, and computer program product for training model

US12367259B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12367259-B2
Application number	US-202217588515-A
Country	US
Kind code	B2
Filing date	Jan 31, 2022
Priority date	Dec 31, 2021
Publication date	Jul 22, 2025
Grant date	Jul 22, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Embodiments of the present disclosure provide a method, an electronic device, and a computer program product for training a model. The method may include determining image features, audio features, and text features of a reference object based on reference image information, reference audio information, and reference text information associated with the reference object, respectively. The method may also include constructing a feature tensor from the image features, the audio features, and the text features. In addition, the method may further include decomposing the feature tensor into a first feature vector, a second feature vector, and a third feature vector corresponding to the image features, the audio features, and the text features, respectively, to determine a loss function value of the model. The method may also include updating parameters of the model based on the loss function value.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for training a model, comprising: determining image features, audio features, and text features of a reference object, in respective ones of a video encoder, an audio encoder and a text encoder, based on reference image information, reference audio information, and reference text information associated with the reference object, respectively; constructing a feature tensor from the image features, the audio features, and the text features, the feature tensor defining a multi-dimensional space in which a given position within the multi-dimensional space corresponds to a combination of a corresponding image feature of the image features, a corresponding audio feature of the audio features, and a corresponding text feature of the text features, wherein a match type indicating a relationship between the corresponding image feature, the corresponding audio feature and the corresponding text feature is represented in the feature tensor as a particular one of a plurality of values each indicating a different match type; decomposing the feature tensor into a first feature vector, a second feature vector, and a third feature vector corresponding to the image features, the audio features, and the text features, respectively, to determine a loss function value of the model, wherein the loss function value is computed at least in part based on a combination of a first absolute value of a difference between the first feature vector and one or more of the image features, a second absolute value of a difference between the second feature vector and one or more of the audio features, and a third absolute value of a difference between the third feature vector and one or more of the text features; and updating parameters of the model based on the loss function value; wherein the method is performed by at least one processing device comprising a processor coupled to a memory. 2. The method of claim 1 , wherein determining the loss function value comprises: determining the loss function value of the model based on the first feature vector, the second feature vector, the third feature vector, and the corresponding image features, audio features, and text features. 3. The method of claim 1 , wherein constructing the feature tensor comprises: arranging the image features, the audio features, and the text features respectively along a first coordinate, a second coordinate, and a third coordinate to form a three-dimensional space, a given position in the three-dimensional space corresponding to a combination of the corresponding image feature of the image features, the corresponding audio feature of the audio features, and the corresponding text feature of the text features; and determining a value of the position based on pre-labeled associated information of the combination to form a part of the feature tensor. 4. The method of claim 3 , wherein the first feature vector, the second feature vector, and the third feature vector each comprise the associated information of the feature tensor which has been de-noised. 5. The method of claim 1 , wherein determining the image features, the audio features, and the text features comprises: determining the image features based on the reference image information using a video encoder; determining the audio features based on the reference audio information using an audio encoder; and determining the text features based on the reference text information using a text encoder. 6. The method of claim 1 , wherein the feature tensor corresponds to the first feature vector, the second feature vector, the third feature vector, and noise. 7. The method of claim 1 , wherein the reference object is a human face. 8. An electronic device, comprising: a processor; and a memory coupled to the processor and having instructions stored therein which, when executed by the processor, cause the electronic device to perform actions for training a model, the actions comprising: determining image features, audio features, and text features of a reference object, in respective ones of a video encoder, an audio encoder and a text encoder, based on reference image information, reference audio information, and reference text information associated with the reference object, respectively; constructing a feature tensor from the image features, the audio features, and the text features, the feature tensor defining a multi-dimensional space in which a given position within the multi-dimensional space corresponds to a combination of a corresponding image feature of the image features, a corresponding audio feature of the audio features, and a corresponding text feature of the text features, wherein a match type indicating a relationship between the corresponding image feature, the corresponding audio feature and the corresponding text feature is represented in the feature tensor as a particular one of a plurality of values each indicating a different match type; decomposing the feature tensor into a first feature vector, a second feature vector, and a third feature vector corresponding to the image features, the audio features, and the text features, respectively, to determine a loss function value of the model, wherein the loss function value is computed at least in part based on a combination of a first absolute value of a difference between the first feature vector and one or more of the image features, a second absolute value of a difference between the second feature vector and one or more of the audio features, and a third absolute value of a difference between the third feature vector and one or more of the text features; and updating parameters of the model based on the loss function value. 9. The electronic device of claim 8 , wherein determining the loss function value comprises: determining the loss function value of the model based on the first feature vector, the second feature vector, the third feature vector, and the corresponding image features, audio features, and text features. 10. The electronic device of claim 8 , wherein constructing the feature tensor comprises: arranging the image features, the audio features, and the text features respectively along a first coordinate, a second coordinate, and a third coordinate to form a three-dimensional space, a given position in the three-dimensional space corresponding to a combination of the corresponding image feature of the image features, the corresponding audio feature of the audio features, and the corresponding text feature of the text features; and determining a value of the position based on pre-labeled associated information of the combination to form a part of the feature tensor. 11. The electronic device of claim 10 , wherein the first feature vector, the second feature vector, and the third feature vector each comprise the associated information of the feature tensor which has been de-noised. 12. The electronic device of claim 8 , wherein determining the image features, the audio features, and the text features comprises: determining the image features based on the reference image information using a video encoder; determining the audio features based on the reference audio information using an audio encoder; and determining the text features based on the reference text information using a text encoder. 13. The electronic device of claim 8 , wherein the feature tensor corresponds to the first feature vector, the second feature vector, the third feature vector, and noise. 14. The electronic device of claim 8 , wherein the reference object is a human face. 15. A computer program product comprising a non-trans

Assignees

Dell Products Lp

Inventors

Classifications

G06F18/253
of extracted features · CPC title
G06V20/46
Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames · CPC title
G06F40/126
Character encoding · CPC title
G06V40/168
Feature extraction; Face representation · CPC title
G06N3/08
Learning methods · CPC title

Patent family

Related publications grouped by family.

View patent family 86991682

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12367259B2 cover?: Embodiments of the present disclosure provide a method, an electronic device, and a computer program product for training a model. The method may include determining image features, audio features, and text features of a reference object based on reference image information, reference audio information, and reference text information associated with the reference object, respectively. The metho…
Who is the assignee on this patent?: Dell Products Lp
What technology area does this patent fall under?: Primary CPC classification G06F18/214. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Jul 22 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).