Weighted deep fusion architecture

US12293273B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12293273-B2
Application numberUS-202016928094-A
CountryUS
Kind codeB2
Filing dateJul 14, 2020
Priority dateJul 14, 2020
Publication dateMay 6, 2025
Grant dateMay 6, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method, a computer program product, and a computer system fuse features for multi-modal classifications for a plurality of modality inputs. The method includes receiving a request indicative of the modality inputs to be selected. The method includes performing an embeddings level fusion operation to concatenate features from the modality inputs. The method includes performing a multi-modal discriminative feature level fusion operation that integrates feature representations learned by applying different network structures on the modality inputs. The method includes determining weights of the concatenated features and the feature representations based on a measure of the concatenated features and the feature representations indicative of affecting a final prediction performance. The method includes generating fused features for the modality inputs based on the concatenated features, the feature representations, and the weights. The method includes generating a response to the request based on the fused features. The method includes transmitting the response.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method for fusing data for multi-modal classifications for a plurality of modality inputs, the computer-implemented method comprising: receiving, by a system operatively coupled to a processor, via the Internet, from an entity employing a service client at a first location remote from a second location, a request indicative of the modality inputs to be selected, the modality inputs including text, audio, and video; performing, by the system, via a multi-modal embeddings level fusion of automatic weighted deep fusion (AWD) process, features from the modality inputs employing a similarity matrix by aligning representative feature maps across the modality inputs; applying, by the system, different network structures on the modality inputs; integrating, by the system, via a multi-modal discriminative feature level fusion of the AWD process, feature representations learned by the applying the different network structures on the modality inputs; generating, by the system, weights of the concatenated features and the feature representations based on a measure of the concatenated features and the feature representations indicative of affecting a final prediction performance; generating, by the system, fused features for the modality inputs based on the concatenated features, the feature representations, and the weights; generating, by the system, a response to the request based on the fused features; and transmitting, by the system, via the Internet, the response to the service client associated with the entity. 2. The computer-implemented method of claim 1 , wherein the modality inputs have a deep architecture including a convolution neural network, a recurrent neural network, or a combination thereof. 3. The computer-implemented method of claim 1 , wherein the concatenated features in the modality inputs are concatenated based on a distribution, an embedding, or a combination thereof of the feature in the modality inputs. 4. The computer-implemented method of claim 1 , wherein the multi-modal discriminative level feature fusion operation includes a deep correlation fusion operation that determines contributions of correlations of the feature representations. 5. The computer-implemented method of claim 4 , wherein the deep correlation fusion operation determines a degree of correlation of a first one of the feature representations to a second one of the feature representations. 6. The computer-implemented method of claim 5 , wherein the deep correlation fusion operation determines a corresponding contribution for each of the modality inputs through a weighted sum of each degree of correlation of the feature representations. 7. The computer-implemented method of claim 1 , wherein the multi-modal discriminative level feature fusion operation includes a pair-wise matching fusion operation is indicative of a pair-wise matching degree of the feature representations according to embeddings obtained for different modality inputs. 8. A computer program product for fusing data for multi-modal classifications for a plurality of modality inputs, the computer program product comprising: one or more non-transitory computer-readable storage media and program instructions stored on the one or more non-transitory computer-readable storage media capable of performing a method, the method comprising: receiving, via the Internet, from an entity employing a service client at a first location remote from a second location of the computer program product, a request indicative of the modality inputs to be selected, the modality inputs including text, audio, and video; concatenating, via a multi-modal embeddings level fusion of automatic weighted deep fusion (AWD) process, features from the modality inputs employing a similarity matrix by aligning representative feature maps across the modality inputs; applying different network structures on the modality inputs; integrating, via a multi-modal discriminative feature level fusion of the AWD process, feature representations learned by the applying the different network structures on the modality inputs; generating weights of the concatenated features and the feature representations based on a measure of the concatenated features and the feature representations indicative of affecting a final prediction performance; generating fused features for the modality inputs based on the concatenated features, the feature representations, and the weights; generating a response to the request based on the fused features; and transmitting, via the Internet, the response to the service client associated with the entity. 9. The computer program product of claim 8 , wherein the modality inputs have a deep architecture including a convolution neural network, a recurrent neural network, or a combination thereof. 10. The computer program product of claim 8 , wherein the concatenated features in the modality inputs are concatenated based on a distribution, an embedding, or a combination thereof of the feature in the modality inputs. 11. The computer program product of claim 8 , wherein the multi-modal discriminative level feature fusion operation includes a deep correlation fusion operation that determines contributions of correlations of the feature representations. 12. The computer program product of claim 11 , wherein the deep correlation fusion operation determines a degree of correlation of a first one of the feature representations to a second one of the feature representations. 13. The computer program product of claim 12 , wherein the deep correlation fusion operation determines a corresponding contribution for each of the modality inputs through a weighted sum of each degree of correlation of the feature representations. 14. The computer program product of claim 8 , wherein the multi-modal discriminative level feature fusion operation includes a pair-wise matching fusion operation is indicative of a pair-wise matching degree of the feature representations according to embeddings obtained for different modality inputs. 15. A computer system for fusing data for multi-modal classifications for a plurality of modality inputs, the computer system comprising: one or more computer processors, one or more computer-readable storage media, and program instructions stored on the one or more of the computer-readable storage media for execution by at least one of the one or more computer processors capable of performing a method, the method comprising: receiving via the Internet, from an entity employing a service client at a first location remote from a second location, a request indicative of the modality inputs to be selected, the modality inputs including text, audio, and video; performing, via a multi-modal embeddings level fusion operation to concatenate of automatic weighted deep fusion (AWD) process, features from the modality inputs based on a similarity matrix that aligns representative feature maps across the modality inputs; applying, by the system, different network structures on the modality inputs; integrating, via a multi-modal discriminative feature level fusion of the AWD process operation that integrates feature representations learned by the applying the different network structures on the modality inputs; generating weights of the concatenated features and the feature representations based on a measure of the concatenated features and the feature representations indicative of affecting a final prediction performance; generating fused features for the modality inputs based on the concatenated features, the feature representations, and the we

Assignees

Inventors

Classifications

  • Supervised learning · CPC title

  • Convolutional networks [CNN, ConvNet] · CPC title

  • characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title

  • G06N3/08Primary

    Learning methods · CPC title

  • Interfaces, programming languages or software development kits, e.g. for simulating neural networks · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12293273B2 cover?
A method, a computer program product, and a computer system fuse features for multi-modal classifications for a plurality of modality inputs. The method includes receiving a request indicative of the modality inputs to be selected. The method includes performing an embeddings level fusion operation to concatenate features from the modality inputs. The method includes performing a multi-modal di…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06N3/08. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue May 06 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).