What technology area does this patent fall under?

Primary CPC classification G06F40/284. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Mar 10 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 9 related publications on this page (citations in our corpus or others sharing the same primary CPC).

System and method for cross-modal interaction based on pre-trained model

US12572780B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12572780-B2
Application number	US-202217900592-A
Country	US
Kind code	B2
Filing date	Aug 31, 2022
Priority date	Aug 31, 2022
Publication date	Mar 10, 2026
Grant date	Mar 10, 2026

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method is provided for data processing performed by a processing system. The method comprises determining a set of first tokens for first data and a set of second token for second data, each token comprising information associated with a segment of the respective data, determining pair-wise similarities between the set of first tokens and the set of second tokens, each pair comprising a first token in the set of first tokens and a second token in the set of second tokens, determining, for each first token in the set of first tokens, a maximum similarity based on the determined pair-wise similarities between the respective first token and the second tokens in the set of second tokens, and determining a first similarity between the first data and the second data by aggregating the maximum similarities corresponding to the first tokens in the set of first set of tokens.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method for processing multi-modal data, comprising: obtaining, by a processing system, an image comprising the multi-modal data, wherein the multi-modal data comprises image data and text data, wherein the processing system implements a neural network model comprising a first neural network and a second neural network; dividing, by the processing system, the image into a plurality of image blocks, wherein each image block has a predefined width and height; dividing, by the processing system, a first respective image block of the plurality of image blocks into a plurality of image patches; dividing, by the processing system, text from a second respective image block of the image blocks into a plurality of semantic units; generating, by the processing system, using the first neural network, a set of visual tokens for image data of the first respective image block based on the plurality of image patches; generating, by the processing system, using the second neural network, a set of textual tokens for text data of the second respective image block based on the plurality of semantic units; determining, by the processing system, pair-wise similarities between the set of visual tokens and the set of textual tokens, each pair comprising a visual token from the set of visual tokens and a textual token from the set of second textual tokens; determining, by the processing system, for each visual token in the set of visual tokens, a maximum similarity based on the determined pair-wise similarities between the respective visual token and the textual tokens in the set of textual tokens; and determining, by the processing system, a first similarity between the image data and the text data by aggregating the maximum similarities corresponding to the visual tokens in the set of visual tokens. 2 . The method according to claim 1 , wherein the neural network model comprises an adjustable embedding size, and wherein the method further comprises: reducing the adjustable embedding size to 256; wherein the adjustable embedding size defines dimensions of vectors obtained by the first and second neural networks in the neural network model. 3 . The method according to claim 1 , wherein the set of visual tokens or the set of textual tokens is obtained using a half-precision floating point format. 4 . The method according to claim 1 , further comprising: training the neural network model on a set of training data, the training data comprising training image data and training text data; determining, by the processing system, first similarities between image data in the training image data and text data in the training text data; and determining, by the processing system based on the first similarities, a first contractive loss for the set of training data. 5 . The method according to claim 4 , further comprising: determining, by the processing system, for each textual token in a respective set of textual tokens, a maximum similarity based on pair-wise similarities between the respective textual token and respective visual tokens in a respective set of visual tokens; and determining, by the processing system, a second similarity between respective text data corresponding to the respective set of textual tokens and respective image data corresponding to the respective set of visual tokens by aggregating the maximum similarities corresponding to the respective textual tokens in the respective set of textual tokens. 6 . The method according to claim 5 , further comprising: determining, by the processing system, second similarities between respective image data in the training image data and respective text data in the training text data; determining, by the processing system, based on the second similarities, a second contractive loss for the set of training data; determining, by the processing system, an aggregated contractive loss by combining the first contractive loss and the second contractive loss by weights; and updating the neural network model based on the aggregated contractive loss. 7 . The method according to claim 6 , further comprising: generating, by the processing system, a plurality of derived texts by applying templates to respective text of the training text data, wherein the plurality of derived texts are added to the set of training data as additional training text data to obtain an updated set of training data, and the derived texts are associated with the respective text; and determining, by the processing system, for image data in the updated set of training data, a mean similarity associated with respective text. 8 . The method according to claim 7 , wherein determining the mean similarity further comprises: determining, by the processing system, token-wise similarities between respective image data and the plurality of derived texts; determining, by the processing system, first similarities between the respective image data and the plurality of derived texts; and aggregating, by the processing system, the first similarities between the respective image data and the plurality of derived texts. 9 . The method according to claim 1 , wherein generating the set of visual tokens for the image data comprises: passing image patches of the image data through a linear projection layer of the first neural network to obtain a first number of vectors based on the image patches and associate the first number of vectors with positional embeddings; and encoding, by an image encoder layer of the first neural network, the first number of vectors with the positional embeddings to obtain the set of visual tokens. 10 . The method according to claim 9 , wherein the first number of vectors includes a first vector defined as a class embedding that is associated with a respective input image. 11 . The method according to claim 1 , wherein generating the set of textual tokens for the text data comprises: converting, by a token embedding layer of the second neural network, semantic units of the text data into a vector representation with a predefined dimension to obtain a second number of vectors based on the semantic units and associate the second number of vectors with positional embeddings; and encoding, by a text encoder layer of the second neural network, the second number of vectors with the positional embeddings to obtain the set of textual tokens. 12 . A system for processing multi-modal data, comprising: one or more processors; and one or more memories having processor-executable instructions stored thereon; wherein the one or more processors are configured to execute the processor-executable instructions to cause the system to perform the following: obtaining an image comprising the multi-modal data, wherein the multi-modal data comprises image data and text data, wherein the system implements a neural network model comprising a first neural network and a second neural network; dividing the image into a plurality of image blocks, wherein each image block has a predefined width and height; dividing a first respective image block of the plurality of image blocks into a plurality of image patches; dividing text from a second respective image block of the image blocks into a plurality of semantic units; generating, using the first neural network, a set of visual tokens for image data of the first respective image block based on the plurality of image patches; generating, using the second neural network, a set of textual tokens for text data of the second respective image block based on the plurality of semantic units; determining pair-wise similarities between the set of visual tokens and the set of te

Assignees

Huawei Tech Co Ltd

Inventors

Classifications

G06F40/284Primary
Lexical analysis, e.g. tokenisation or collocates · CPC title
G06N3/045Primary
Combinations of networks · CPC title
G06F40/56
Natural language generation · CPC title
G06N3/08
Learning methods · CPC title

Patent family

Related publications grouped by family.

View patent family 89997086

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12572780B2 cover?: A method is provided for data processing performed by a processing system. The method comprises determining a set of first tokens for first data and a set of second token for second data, each token comprising information associated with a segment of the respective data, determining pair-wise similarities between the set of first tokens and the set of second tokens, each pair comprising a first…
Who is the assignee on this patent?: Huawei Tech Co Ltd
What technology area does this patent fall under?: Primary CPC classification G06F40/284. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Mar 10 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 9 related publications on this page (citations in our corpus or others sharing the same primary CPC).