Collation apparatus and method for the same, and image searching apparatus and method for the same
US-2015339516-A1 · Nov 26, 2015 · US
US11436451B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11436451-B2 |
| Application number | US-202217577099-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jan 17, 2022 |
| Priority date | Jan 25, 2021 |
| Publication date | Sep 6, 2022 |
| Grant date | Sep 6, 2022 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
The present disclosure provides a multimodal fine-grained mixing method and system, a device, and a storage medium. The method includes: extracting data features from multimodal graphic and textual data, and obtaining each composition of the data features, the data features including a visual regional feature and a text word feature; performing fine-grained classification on modal information of each composition of the data features, to obtain classification results; and performing inter-modal and intra-modal information fusion on each composition according to the classification results, to obtain a fusion feature. The method enables a multimodal model to utilize a complementary characteristic of the multimodal data, with no influence by irrelevant information.
Opening claim text (preview).
What is claimed is: 1. A multimodal fine-grained mixing method, comprising: extracting data features from multimodal graphic and textual data, and obtaining each composition of the data features, the data features comprising a visual regional feature and a text word feature; performing fine-grained classification on modal information of each composition of the data features, to obtain classification results of the data features; and performing intra-modal and inter-modal information fusion on each composition according to the classification results of the data features, to obtain a fusion feature; wherein the step of performing fine-grained classification on modal information of each composition of the data features, to obtain classification results of the data features comprises: calculating an intra-modal correlation and an inter-modal correlation of each visual feature composition V i , to obtain characteristics of each visual feature composition V i , so as to obtain a classification result of the visual regional feature; and calculating an intra-modal correlation and an inter-modal correlation of each text feature composition E i , to obtain characteristics of each text feature composition E i , so as to obtain a classification result of the text word feature; wherein the step of calculating an intra-modal correlation and an inter-modal correlation of each visual feature composition V i , to obtain characteristics of each visual feature composition V i , so as to obtain a classification result of the visual regional feature comprises: performing normalization on the intra-modal correlation R i VB and the inter-modal correlation R i VA of each visual feature composition V i , to obtain the characteristics of each visual feature composition V i : R i VA =softmax( R i VA ,R i VB ); R i VB =softmax( R i VB ,R i VA ); wherein the step of calculating an intra-modal correlation and an inter-modal correlation of each text feature composition E i , to obtain characteristics of each text feature composition E i , so as to obtain a classification result of the text word feature comprises: performing normalization on the intra-modal correlation R i EB and the inter-modal correlation R i EA of each text feature composition E i , to obtain the characteristics of each text feature composition E i : R i EA =softmax( R i EA ,R i EB ); R i EB =softmax( R i EB ,R i EA ); and wherein the step of performing intra-modal and inter-modal information fusion on each composition according to the classification results of the data features, to obtain a fusion feature comprises: converting each visual feature composition and each text feature composition into corresponding query features and key-value pair features; calculating a dot product of a visual regional query feature and a visual key feature corresponding to each visual feature composition, to obtain a self-attention weight of each visual feature composition, and performing normalization on the self-attention weight of each visual feature composition, to obtain self-modal information; calculating a dot product of the visual regional query feature corresponding to each visual feature composition and a word key feature, to obtain a cross-modal attention weight of each visual feature composition, and performing normalization on the cross-modal attention weight of each visual feature composition, to obtain cross-modal information of each visual feature composition; and obtaining, according to products obtained by respectively multiplying the characteristics of each visual regional composition with the self-modal information and the cross-modal information of each visual regional composition, a fusion visual feature composition by using a residual structure, and constructing a fusion visual feature with each fusion visual feature composition. 2. The multimodal fine-grained mixing method according to claim 1 , wherein the step of calculating an intra-modal correlation and an inter-modal correlation of each visual feature composition V i , to obtain characteristics of each visual feature composition V i , so as to obtain a classification result of the visual regional feature comprises: calculating the intra-modal correlation R i VB of each visual feature composition V i : M ij V = V i T V j ❘ "\[LeftBracketingBar]" ❘ "\[LeftBracketingBar]" V i ❘ "\[RightBracketingBar]" ❘ "\[RightBracketingBar]" ❘ "\[LeftBracketingBar]" ❘ "\[LeftBracketingBar]" V j ❘ "\[RightBracketingBar]" ❘ "\[RightBracketingBar]" , i ∈ [ 1 , L V ] , j ∈ [ 1 , L V ] ; B i V = ∑ j = 1 L V β ij V V
Probabilistic or stochastic networks · CPC title
of results relating to different input data, e.g. multimodal recognition · CPC title
relating to the classification model, e.g. parametric or non-parametric approaches · CPC title
of extracted features · CPC title
Combinations of networks · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.