Multimodal fine-grained mixing method and system, device, and storage medium

US11436451B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11436451-B2
Application numberUS-202217577099-A
CountryUS
Kind codeB2
Filing dateJan 17, 2022
Priority dateJan 25, 2021
Publication dateSep 6, 2022
Grant dateSep 6, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The present disclosure provides a multimodal fine-grained mixing method and system, a device, and a storage medium. The method includes: extracting data features from multimodal graphic and textual data, and obtaining each composition of the data features, the data features including a visual regional feature and a text word feature; performing fine-grained classification on modal information of each composition of the data features, to obtain classification results; and performing inter-modal and intra-modal information fusion on each composition according to the classification results, to obtain a fusion feature. The method enables a multimodal model to utilize a complementary characteristic of the multimodal data, with no influence by irrelevant information.

First claim

Opening claim text (preview).

What is claimed is: 1. A multimodal fine-grained mixing method, comprising: extracting data features from multimodal graphic and textual data, and obtaining each composition of the data features, the data features comprising a visual regional feature and a text word feature; performing fine-grained classification on modal information of each composition of the data features, to obtain classification results of the data features; and performing intra-modal and inter-modal information fusion on each composition according to the classification results of the data features, to obtain a fusion feature; wherein the step of performing fine-grained classification on modal information of each composition of the data features, to obtain classification results of the data features comprises: calculating an intra-modal correlation and an inter-modal correlation of each visual feature composition V i , to obtain characteristics of each visual feature composition V i , so as to obtain a classification result of the visual regional feature; and calculating an intra-modal correlation and an inter-modal correlation of each text feature composition E i , to obtain characteristics of each text feature composition E i , so as to obtain a classification result of the text word feature; wherein the step of calculating an intra-modal correlation and an inter-modal correlation of each visual feature composition V i , to obtain characteristics of each visual feature composition V i , so as to obtain a classification result of the visual regional feature comprises: performing normalization on the intra-modal correlation R i VB and the inter-modal correlation R i VA of each visual feature composition V i , to obtain the characteristics of each visual feature composition V i : R i VA =softmax( R i VA ,R i VB ); R i VB =softmax( R i VB ,R i VA ); wherein the step of calculating an intra-modal correlation and an inter-modal correlation of each text feature composition E i , to obtain characteristics of each text feature composition E i , so as to obtain a classification result of the text word feature comprises: performing normalization on the intra-modal correlation R i EB and the inter-modal correlation R i EA of each text feature composition E i , to obtain the characteristics of each text feature composition E i : R i EA =softmax( R i EA ,R i EB ); R i EB =softmax( R i EB ,R i EA ); and wherein the step of performing intra-modal and inter-modal information fusion on each composition according to the classification results of the data features, to obtain a fusion feature comprises: converting each visual feature composition and each text feature composition into corresponding query features and key-value pair features; calculating a dot product of a visual regional query feature and a visual key feature corresponding to each visual feature composition, to obtain a self-attention weight of each visual feature composition, and performing normalization on the self-attention weight of each visual feature composition, to obtain self-modal information; calculating a dot product of the visual regional query feature corresponding to each visual feature composition and a word key feature, to obtain a cross-modal attention weight of each visual feature composition, and performing normalization on the cross-modal attention weight of each visual feature composition, to obtain cross-modal information of each visual feature composition; and obtaining, according to products obtained by respectively multiplying the characteristics of each visual regional composition with the self-modal information and the cross-modal information of each visual regional composition, a fusion visual feature composition by using a residual structure, and constructing a fusion visual feature with each fusion visual feature composition. 2. The multimodal fine-grained mixing method according to claim 1 , wherein the step of calculating an intra-modal correlation and an inter-modal correlation of each visual feature composition V i , to obtain characteristics of each visual feature composition V i , so as to obtain a classification result of the visual regional feature comprises: calculating the intra-modal correlation R i VB of each visual feature composition V i : M ij V = V i T ⁢ V j ❘ "\[LeftBracketingBar]" ❘ "\[LeftBracketingBar]" V i ❘ "\[RightBracketingBar]" ❘ "\[RightBracketingBar]" ⁢ ❘ "\[LeftBracketingBar]" ❘ "\[LeftBracketingBar]" V j ❘ "\[RightBracketingBar]" ❘ "\[RightBracketingBar]" , i ∈ [ 1 , L V ] , j ∈ [ 1 , L V ] ; B i V = ∑ j = 1 L V β ij V ⁢ V

Assignees

Inventors

Classifications

  • Probabilistic or stochastic networks · CPC title

  • G06F18/256Primary

    of results relating to different input data, e.g. multimodal recognition · CPC title

  • G06F18/241Primary

    relating to the classification model, e.g. parametric or non-parametric approaches · CPC title

  • of extracted features · CPC title

  • Combinations of networks · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11436451B2 cover?
The present disclosure provides a multimodal fine-grained mixing method and system, a device, and a storage medium. The method includes: extracting data features from multimodal graphic and textual data, and obtaining each composition of the data features, the data features including a visual regional feature and a text word feature; performing fine-grained classification on modal information o…
Who is the assignee on this patent?
Harbin Institute Of Tech Shenzhen Institute Of Science And Tech Innovation, Univ Dongguan Technology
What technology area does this patent fall under?
Primary CPC classification G06F18/256. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 06 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).