Anomaly detection in text
US-11763086-B1 · Sep 19, 2023 · US
US12367259B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12367259-B2 |
| Application number | US-202217588515-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jan 31, 2022 |
| Priority date | Dec 31, 2021 |
| Publication date | Jul 22, 2025 |
| Grant date | Jul 22, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Embodiments of the present disclosure provide a method, an electronic device, and a computer program product for training a model. The method may include determining image features, audio features, and text features of a reference object based on reference image information, reference audio information, and reference text information associated with the reference object, respectively. The method may also include constructing a feature tensor from the image features, the audio features, and the text features. In addition, the method may further include decomposing the feature tensor into a first feature vector, a second feature vector, and a third feature vector corresponding to the image features, the audio features, and the text features, respectively, to determine a loss function value of the model. The method may also include updating parameters of the model based on the loss function value.
Opening claim text (preview).
What is claimed is: 1. A method for training a model, comprising: determining image features, audio features, and text features of a reference object, in respective ones of a video encoder, an audio encoder and a text encoder, based on reference image information, reference audio information, and reference text information associated with the reference object, respectively; constructing a feature tensor from the image features, the audio features, and the text features, the feature tensor defining a multi-dimensional space in which a given position within the multi-dimensional space corresponds to a combination of a corresponding image feature of the image features, a corresponding audio feature of the audio features, and a corresponding text feature of the text features, wherein a match type indicating a relationship between the corresponding image feature, the corresponding audio feature and the corresponding text feature is represented in the feature tensor as a particular one of a plurality of values each indicating a different match type; decomposing the feature tensor into a first feature vector, a second feature vector, and a third feature vector corresponding to the image features, the audio features, and the text features, respectively, to determine a loss function value of the model, wherein the loss function value is computed at least in part based on a combination of a first absolute value of a difference between the first feature vector and one or more of the image features, a second absolute value of a difference between the second feature vector and one or more of the audio features, and a third absolute value of a difference between the third feature vector and one or more of the text features; and updating parameters of the model based on the loss function value; wherein the method is performed by at least one processing device comprising a processor coupled to a memory. 2. The method of claim 1 , wherein determining the loss function value comprises: determining the loss function value of the model based on the first feature vector, the second feature vector, the third feature vector, and the corresponding image features, audio features, and text features. 3. The method of claim 1 , wherein constructing the feature tensor comprises: arranging the image features, the audio features, and the text features respectively along a first coordinate, a second coordinate, and a third coordinate to form a three-dimensional space, a given position in the three-dimensional space corresponding to a combination of the corresponding image feature of the image features, the corresponding audio feature of the audio features, and the corresponding text feature of the text features; and determining a value of the position based on pre-labeled associated information of the combination to form a part of the feature tensor. 4. The method of claim 3 , wherein the first feature vector, the second feature vector, and the third feature vector each comprise the associated information of the feature tensor which has been de-noised. 5. The method of claim 1 , wherein determining the image features, the audio features, and the text features comprises: determining the image features based on the reference image information using a video encoder; determining the audio features based on the reference audio information using an audio encoder; and determining the text features based on the reference text information using a text encoder. 6. The method of claim 1 , wherein the feature tensor corresponds to the first feature vector, the second feature vector, the third feature vector, and noise. 7. The method of claim 1 , wherein the reference object is a human face. 8. An electronic device, comprising: a processor; and a memory coupled to the processor and having instructions stored therein which, when executed by the processor, cause the electronic device to perform actions for training a model, the actions comprising: determining image features, audio features, and text features of a reference object, in respective ones of a video encoder, an audio encoder and a text encoder, based on reference image information, reference audio information, and reference text information associated with the reference object, respectively; constructing a feature tensor from the image features, the audio features, and the text features, the feature tensor defining a multi-dimensional space in which a given position within the multi-dimensional space corresponds to a combination of a corresponding image feature of the image features, a corresponding audio feature of the audio features, and a corresponding text feature of the text features, wherein a match type indicating a relationship between the corresponding image feature, the corresponding audio feature and the corresponding text feature is represented in the feature tensor as a particular one of a plurality of values each indicating a different match type; decomposing the feature tensor into a first feature vector, a second feature vector, and a third feature vector corresponding to the image features, the audio features, and the text features, respectively, to determine a loss function value of the model, wherein the loss function value is computed at least in part based on a combination of a first absolute value of a difference between the first feature vector and one or more of the image features, a second absolute value of a difference between the second feature vector and one or more of the audio features, and a third absolute value of a difference between the third feature vector and one or more of the text features; and updating parameters of the model based on the loss function value. 9. The electronic device of claim 8 , wherein determining the loss function value comprises: determining the loss function value of the model based on the first feature vector, the second feature vector, the third feature vector, and the corresponding image features, audio features, and text features. 10. The electronic device of claim 8 , wherein constructing the feature tensor comprises: arranging the image features, the audio features, and the text features respectively along a first coordinate, a second coordinate, and a third coordinate to form a three-dimensional space, a given position in the three-dimensional space corresponding to a combination of the corresponding image feature of the image features, the corresponding audio feature of the audio features, and the corresponding text feature of the text features; and determining a value of the position based on pre-labeled associated information of the combination to form a part of the feature tensor. 11. The electronic device of claim 10 , wherein the first feature vector, the second feature vector, and the third feature vector each comprise the associated information of the feature tensor which has been de-noised. 12. The electronic device of claim 8 , wherein determining the image features, the audio features, and the text features comprises: determining the image features based on the reference image information using a video encoder; determining the audio features based on the reference audio information using an audio encoder; and determining the text features based on the reference text information using a text encoder. 13. The electronic device of claim 8 , wherein the feature tensor corresponds to the first feature vector, the second feature vector, the third feature vector, and noise. 14. The electronic device of claim 8 , wherein the reference object is a human face. 15. A computer program product comprising a non-trans
of extracted features · CPC title
Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames · CPC title
Character encoding · CPC title
Feature extraction; Face representation · CPC title
Learning methods · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.