Multimodal speech recognition method and system, and computer-readable storage medium

US12112744B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12112744-B2
Application numberUS-202217684958-A
CountryUS
Kind codeB2
Filing dateMar 2, 2022
Priority dateAug 10, 2021
Publication dateOct 8, 2024
Grant dateOct 8, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The disclosure provides a multimodal speech recognition method and system, and a computer-readable storage medium. The method includes calculating a first logarithmic mel-frequency spectral coefficient and a second logarithmic mel-frequency spectral coefficient when a target millimeter-wave signal and a target audio signal both contain speech information corresponding to a target user; inputting the first and the second logarithmic mel-frequency spectral coefficient into a fusion network to determine a target fusion feature, where the fusion network includes at least a calibration module and a mapping module, the calibration module is configured to perform mutual feature calibration on the target audio/millimeter-wave signals, and the mapping module is configured to fuse a calibrated millimeter-wave feature and a calibrated audio feature; and inputting the target fusion feature into a semantic feature network to determine a speech recognition result corresponding to the target user. The disclosure can implement high-accuracy speech recognition.

First claim

Opening claim text (preview).

What is claimed is: 1. A multimodal speech recognition method, comprising: obtaining a target millimeter-wave signal and a target audio signal; calculating a first logarithmic mel-frequency spectral coefficient and a second logarithmic mel-frequency spectral coefficient when the target millimeter-wave signal and the target audio signal both contain speech information corresponding to a target user, wherein the first logarithmic mel-frequency spectral coefficient is determined based on the target millimeter-wave signal, and the second logarithmic mel-frequency spectral coefficient is determined based on the target audio signal; inputting the first logarithmic mel-frequency spectral coefficient and the second logarithmic mel-frequency spectral coefficient into a fusion network to determine a target fusion feature, wherein the fusion network comprises at least a calibration module and a mapping module; the calibration module is configured to perform feature calibration on the target millimeter-wave signal based on the target audio signal to obtain a calibrated millimeter-wave feature and perform feature calibration on the target audio signal based on the target millimeter-wave signal to obtain a calibrated audio feature; and the mapping module is configured to fuse the calibrated millimeter-wave feature and the calibrated audio feature to obtain the target fusion feature; and inputting the target fusion feature into a semantic feature network to determine a speech recognition result corresponding to the target user, wherein the fusion network further comprises two identical branch networks including a first branch network and a second branch network; and each branch network comprises a first residual block with efficient channel attention (ResECA), a second ResECA, a third ResECA, a fourth ResECA, and a fifth ResECA; wherein an input end of the calibration module is respectively connected to an output end of the third ResECA of the first branch network and an output end of the third ResECA of the second branch network; and an output end of the calibration module is respectively connected to an input end of the fourth ResECA of the first branch network and an input end of the fourth ResECA of the second branch network; an input end of the first ResECA of the first branch network is used to input the first logarithmic mel-frequency spectral coefficient; and an output end of the first ResECA of the first branch network is connected to an input end of the second ResECA of the first branch network, an output end of the second ResECA of the first branch network is connected to an input end of the third ResECA of the first branch network, and an output end of the fourth ResECA of the first branch network is connected to an input end of the fifth ResECA of the first branch network; an input end of the first ResECA of the second branch network is used to input the second logarithmic mel-frequency spectral coefficient; and an output end of the first ResECA of the second branch network is connected to an input end of the second ResECA of the second branch network, an output end of the second ResECA of the second branch network is connected to an input end of the third ResECA of the second branch network, and an output end of the fourth ResECA of the second branch network is connected to an input end of the fifth ResECA of the second branch network; and an input end of the mapping module is respectively connected to an output end of the fifth ResECA of the first branch network and an output end of the fifth ResECA of the second branch network. 2. The multimodal speech recognition method according to claim 1 , wherein obtaining the target millimeter-wave signal and the target audio signal comprises: obtaining the target millimeter-wave signal acquired by a millimeter-wave radar; and obtaining the target audio signal acquired by a microphone. 3. The multimodal speech recognition method according to claim 1 , wherein calculating the first logarithmic mel-frequency spectral coefficient and the second logarithmic mel-frequency spectral coefficient when the target millimeter-wave signal and the target audio signal both contain the speech information corresponding to the target user, comprises: determining whether the target millimeter-wave signal and the target audio signal both contain the speech information to obtain a first determining result; when the first determining result indicates that the target millimeter-wave signal and the target audio signal both contain the speech information, determining whether the target millimeter-wave signal and the target audio signal both come from the target user to obtain a second determining result; and when the second determining result indicates that the target millimeter-wave signal and the target audio signal both come from the target user, performing short-time Fourier transform (STFT) on the target millimeter-wave signal and the target audio signal to determine the first logarithmic mel-frequency spectral coefficient and the second logarithmic mel-frequency spectral coefficient. 4. The multimodal speech recognition method according to claim 3 , wherein determining whether the target millimeter-wave signal and the target audio signal both contain the speech information to obtain the first determining result, comprises: preprocessing the target millimeter-wave signal and the target audio signal; performing fast Fourier transform (FFT) on the preprocessed target millimeter-wave signal to extract a millimeter-wave phase signal; performing a difference operation on the millimeter-wave phase signal to extract a millimeter-wave phase difference signal; multiplying a preprocessed target audio signal and the millimeter-wave phase difference signal to obtain a target product component; calculating a spectral entropy of the target product component; and determining whether the spectral entropy is greater than a specified threshold; wherein when the spectral entropy is greater than the specified threshold, it indicates that the target millimeter-wave signal and the target audio signal both contain the speech information. 5. The multimodal speech recognition method according to claim 4 , wherein determining whether the target millimeter-wave signal and the target audio signal both come from the target user, comprises: processing the target product component to extract a target linear prediction coding (LPC) component; and inputting the target LPC component into a trained one-class support vector machine (OC-SVM) to determine whether the target millimeter-wave signal and the target audio signal both come from the target user; wherein the trained OC-SVM is determined based on training data and an OC-SVM; the training data comprises a plurality of calibration product components and a label corresponding to each calibration product component; the label is a calibration user; and the calibration product component is a product component determined based on a millimeter-wave signal and an audio signal corresponding to the calibration user. 6. The multimodal speech recognition method according to claim 1 , wherein the feature calibration performed by the calibration module comprises: calculating a first channel feature distribution based on a first intermediate feature, wherein the first intermediate feature is a signal output by the output end of the third ResECA of the first branch network; calculating a second channel feature distribution based on a second intermediate feature, wherein the second intermediate feature is a signal output by the output end of the third ResECA of the second branch network; calibrating the first intermediate feature based on the second channel feature distribution; and calibrating the second intermediate feature based on the first ch

Assignees

Inventors

Classifications

  • Detection of presence or absence of voice signals (switching of direction of transmission by voice frequency in two-way loud-speaking telephone systems H04M9/10) · CPC title

  • the extracted parameters being spectral information of each sub-band · CPC title

  • Constructional details of speech recognition systems · CPC title

  • Procedures used during a speech recognition process, e.g. man-machine dialogue · CPC title

  • Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12112744B2 cover?
The disclosure provides a multimodal speech recognition method and system, and a computer-readable storage medium. The method includes calculating a first logarithmic mel-frequency spectral coefficient and a second logarithmic mel-frequency spectral coefficient when a target millimeter-wave signal and a target audio signal both contain speech information corresponding to a target user; inputtin…
Who is the assignee on this patent?
Univ Zhejiang
What technology area does this patent fall under?
Primary CPC classification G10L15/02. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Oct 08 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).