Who is the assignee on this patent?

Beijing Baidu Netcom Sci & Tech Co Ltd

What technology area does this patent fall under?

Primary CPC classification G06V10/82. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Jun 17 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Multimodal data processing

US12333795B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12333795-B2
Application number	US-202217945415-A
Country	US
Kind code	B2
Filing date	Sep 15, 2022
Priority date	Sep 17, 2021
Publication date	Jun 17, 2025
Grant date	Jun 17, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Disclosed are a method for processing multimodal data using a neural network, a device, and a medium, and relates to the field of artificial intelligence and, in particular to multimodal data processing, video classification, and deep learning. The neural network includes: an input subnetwork configured to receive the multimodal data to output respective first features of a plurality of modalities; a plurality of cross-modal feature subnetworks, each of which is configured to receive respective first features of two corresponding modalities to output a cross-modal feature corresponding to the two modalities; a plurality of cross-modal fusion subnetworks, each of which is configured to receive at least one cross-modal feature corresponding to a corresponding target modality and other modalities to output a second feature of the target modality; and an output subnetwork configured to receive respective second features of the plurality of modalities to output a processing result of the multimodal data.

First claim

Opening claim text (preview).

What is claimed is: 1. A neural network for multimodal data, comprising: an input subnetwork configured to receive multimodal data to output respective first features of a plurality of modalities comprised in the multimodal data; a plurality of cross-modal feature subnetworks, each cross-modal feature subnetwork of the plurality of cross-modal feature subnetworks corresponds to two modalities of the plurality of modalities and is configured to receive the respective first features of the two modalities to output a cross-modal feature corresponding to the two modalities; a plurality of cross-modal fusion subnetworks in a one-to-one correspondence with the plurality of modalities, wherein each cross-modal fusion subnetwork of the plurality of cross-modal fusion subnetworks is configured to: for a modality corresponding to the cross-modal fusion subnetwork, receive at least one cross-modal feature corresponding to the modality to output a second feature of the modality; an output subnetwork configured to receive the respective second features of the plurality of modalities to output a processing result of the multimodal data; and a first correlation calculation subnetwork configured to calculate a correlation coefficient between every two modalities of the plurality of modalities; wherein each of the cross-modal fusion subnetworks is further configured to fuse the at least one cross-modal feature based on a correlation coefficient between respective two modalities corresponding to the at least one cross-modal feature, to output the second feature of a target modality. 2. The network according to claim 1 , wherein for each cross-modal feature subnetwork of the cross-modal feature subnetworks, the cross-modal feature subnetwork is configured to: for a first modality and a second modality corresponding to the cross-modal feature subnetwork, output a first cross-modal feature of the first modality with respect to the second modality and a second cross-modal feature of the second modality with respect to the first modality; and wherein for each cross-modal fusion subnetwork of the cross-modal fusion subnetworks, the cross-modal fusion subnetwork is configured to receive at least one cross-modal feature of a target modality with respect to at least one of the other modalities, to output the second feature of the target modality. 3. The network according to claim 2 , wherein the input subnetwork is further configured to map each of the respective first features of the plurality of modalities to a query feature, a key feature, and a value feature for outputting; and wherein each cross-modal feature subnetwork of the cross-modal feature subnetworks is further configured to: receive a query feature, a key feature, and a value feature of the corresponding first modality and a query feature, a key feature, and a value feature of the corresponding second modality; determine the first cross-modal feature based on the query feature of the corresponding first modality, the key feature of the corresponding second modality, and the value feature of the corresponding second modality; and determine the second cross-modal feature based on the query feature of the corresponding second modality, the key feature of the corresponding first modality, and the value feature of the corresponding first modality. 4. The network according to claim 2 , further comprising: a second correlation calculation subnetwork configured to determine a correlation coefficient of each modality of the plurality of modalities with respect to each modality of modalities other than the modality, wherein the correlation coefficient is determined at least based on respective first features of the two corresponding modalities, and wherein each cross-modal fusion subnetwork of the cross-modal fusion subnetworks is further configured to fuse the at least one cross-modal feature of the target modality with respect to at least one of the other modalities based on at least one correlation coefficient of the target modality with respect to at least one of the other modalities, to output the second feature of the target modality. 5. The network according to claim 4 , wherein the second correlation calculation subnetwork is further configured to: normalize, for each modality of the plurality of modalities, the correlation coefficient of the modality with respect to each modality of the modalities other than the modality. 6. The network according to claim 1 , wherein the input subnetwork comprises: a plurality of feature extraction subnetworks in a one-to-one correspondence with the plurality of modalities, wherein each feature extraction subnetwork of the plurality of feature extraction subnetworks is configured to: determine an initial feature sequence of a modality in the multimodal data corresponding to the feature extraction subnetwork based on data of the modality, wherein each item in the initial feature sequence corresponds to one part of the data of the modality; and determine the first feature of the modality at least based on the initial feature sequence. 7. The network according to claim 6 , wherein the determining the first feature of the modality at least based on the initial feature sequence comprises: determining a first feature component based on the initial feature sequence; determining a second feature component, wherein the second feature component indicating a type of the modality; and determining the first feature of the modality based on the first feature component and the second feature component. 8. The network according to claim 7 , wherein the first feature component is determined by performing max-pooling on the initial feature sequence. 9. The network according to claim 1 , wherein the multimodal data is video data. 10. The network according to claim 9 , wherein the plurality of modalities comprises an image modality, a text modality, and an audio modality. 11. A method for processing multimodal data using a neural network, wherein the neural network comprises an input subnetwork, a plurality of parallel cross-modal feature subnetworks, a plurality of parallel cross-modal fusion subnetworks, a first correlation calculation subnetwork, and an output subnetwork, wherein the plurality of parallel cross-modal feature subnetworks, the plurality of parallel cross-modal fusion subnetworks, and the output subnetwork are sequentially connected, wherein each cross-modal feature subnetwork of the plurality of parallel cross-modal feature subnetworks corresponds to two modalities in a plurality of modalities comprised in the multimodal data, and the plurality of parallel cross-modal fusion subnetworks are in a one-to-one correspondence with the plurality of modalities, wherein the first correlation calculation subnetwork is located between the input subnetwork and the plurality of parallel cross-modal fusion subnetworks, and wherein the method comprises: inputting the multimodal data to the input subnetwork to obtain respective first features of the plurality of modalities that are output by the input subnetwork; inputting the respective first features of every two modalities of the plurality of modalities to a corresponding cross-modal feature subnetwork of the plurality of parallel cross-modal feature subnetworks, to obtain a cross-modal feature to the corresponding two modalities that is output by each cross-modal feature subnetwork of the plurality of parallel cross-modal feature subnetworks; for each modality of the plurality of modalities, inputting at least one cross-modal feature corresponding to the modality to a cross-modal fusion subnetwork of the plurality of parallel cross-modal fusion subnetworks corresponding to

Assignees

Beijing Baidu Netcom Sci & Tech Co Ltd

Inventors

Classifications

G06N3/0464
Convolutional networks [CNN, ConvNet] · CPC title
G06N3/09
Supervised learning · CPC title
G06N3/08
Learning methods · CPC title
G06V10/80
Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level (multimodal speaker identification or verification G10L17/10) · CPC title
G06V10/82Primary
using neural networks · CPC title

Patent family

Related publications grouped by family.

View patent family 78895862

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12333795B2 cover?: Disclosed are a method for processing multimodal data using a neural network, a device, and a medium, and relates to the field of artificial intelligence and, in particular to multimodal data processing, video classification, and deep learning. The neural network includes: an input subnetwork configured to receive the multimodal data to output respective first features of a plurality of modalit…
Who is the assignee on this patent?: Beijing Baidu Netcom Sci & Tech Co Ltd
What technology area does this patent fall under?: Primary CPC classification G06V10/82. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Jun 17 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Dual-modality relation networks for audio-visual event localization

Methods and systems for multimodal content analytics

Neural network training method and apparatus, computer device, and storage medium

Method and apparatus for classifying video

Frequently asked questions