Audio matching method and related device
US-2023008363-A1 · Jan 12, 2023 · US
US12112744B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12112744-B2 |
| Application number | US-202217684958-A |
| Country | US |
| Kind code | B2 |
| Filing date | Mar 2, 2022 |
| Priority date | Aug 10, 2021 |
| Publication date | Oct 8, 2024 |
| Grant date | Oct 8, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
The disclosure provides a multimodal speech recognition method and system, and a computer-readable storage medium. The method includes calculating a first logarithmic mel-frequency spectral coefficient and a second logarithmic mel-frequency spectral coefficient when a target millimeter-wave signal and a target audio signal both contain speech information corresponding to a target user; inputting the first and the second logarithmic mel-frequency spectral coefficient into a fusion network to determine a target fusion feature, where the fusion network includes at least a calibration module and a mapping module, the calibration module is configured to perform mutual feature calibration on the target audio/millimeter-wave signals, and the mapping module is configured to fuse a calibrated millimeter-wave feature and a calibrated audio feature; and inputting the target fusion feature into a semantic feature network to determine a speech recognition result corresponding to the target user. The disclosure can implement high-accuracy speech recognition.
Opening claim text (preview).
What is claimed is: 1. A multimodal speech recognition method, comprising: obtaining a target millimeter-wave signal and a target audio signal; calculating a first logarithmic mel-frequency spectral coefficient and a second logarithmic mel-frequency spectral coefficient when the target millimeter-wave signal and the target audio signal both contain speech information corresponding to a target user, wherein the first logarithmic mel-frequency spectral coefficient is determined based on the target millimeter-wave signal, and the second logarithmic mel-frequency spectral coefficient is determined based on the target audio signal; inputting the first logarithmic mel-frequency spectral coefficient and the second logarithmic mel-frequency spectral coefficient into a fusion network to determine a target fusion feature, wherein the fusion network comprises at least a calibration module and a mapping module; the calibration module is configured to perform feature calibration on the target millimeter-wave signal based on the target audio signal to obtain a calibrated millimeter-wave feature and perform feature calibration on the target audio signal based on the target millimeter-wave signal to obtain a calibrated audio feature; and the mapping module is configured to fuse the calibrated millimeter-wave feature and the calibrated audio feature to obtain the target fusion feature; and inputting the target fusion feature into a semantic feature network to determine a speech recognition result corresponding to the target user, wherein the fusion network further comprises two identical branch networks including a first branch network and a second branch network; and each branch network comprises a first residual block with efficient channel attention (ResECA), a second ResECA, a third ResECA, a fourth ResECA, and a fifth ResECA; wherein an input end of the calibration module is respectively connected to an output end of the third ResECA of the first branch network and an output end of the third ResECA of the second branch network; and an output end of the calibration module is respectively connected to an input end of the fourth ResECA of the first branch network and an input end of the fourth ResECA of the second branch network; an input end of the first ResECA of the first branch network is used to input the first logarithmic mel-frequency spectral coefficient; and an output end of the first ResECA of the first branch network is connected to an input end of the second ResECA of the first branch network, an output end of the second ResECA of the first branch network is connected to an input end of the third ResECA of the first branch network, and an output end of the fourth ResECA of the first branch network is connected to an input end of the fifth ResECA of the first branch network; an input end of the first ResECA of the second branch network is used to input the second logarithmic mel-frequency spectral coefficient; and an output end of the first ResECA of the second branch network is connected to an input end of the second ResECA of the second branch network, an output end of the second ResECA of the second branch network is connected to an input end of the third ResECA of the second branch network, and an output end of the fourth ResECA of the second branch network is connected to an input end of the fifth ResECA of the second branch network; and an input end of the mapping module is respectively connected to an output end of the fifth ResECA of the first branch network and an output end of the fifth ResECA of the second branch network. 2. The multimodal speech recognition method according to claim 1 , wherein obtaining the target millimeter-wave signal and the target audio signal comprises: obtaining the target millimeter-wave signal acquired by a millimeter-wave radar; and obtaining the target audio signal acquired by a microphone. 3. The multimodal speech recognition method according to claim 1 , wherein calculating the first logarithmic mel-frequency spectral coefficient and the second logarithmic mel-frequency spectral coefficient when the target millimeter-wave signal and the target audio signal both contain the speech information corresponding to the target user, comprises: determining whether the target millimeter-wave signal and the target audio signal both contain the speech information to obtain a first determining result; when the first determining result indicates that the target millimeter-wave signal and the target audio signal both contain the speech information, determining whether the target millimeter-wave signal and the target audio signal both come from the target user to obtain a second determining result; and when the second determining result indicates that the target millimeter-wave signal and the target audio signal both come from the target user, performing short-time Fourier transform (STFT) on the target millimeter-wave signal and the target audio signal to determine the first logarithmic mel-frequency spectral coefficient and the second logarithmic mel-frequency spectral coefficient. 4. The multimodal speech recognition method according to claim 3 , wherein determining whether the target millimeter-wave signal and the target audio signal both contain the speech information to obtain the first determining result, comprises: preprocessing the target millimeter-wave signal and the target audio signal; performing fast Fourier transform (FFT) on the preprocessed target millimeter-wave signal to extract a millimeter-wave phase signal; performing a difference operation on the millimeter-wave phase signal to extract a millimeter-wave phase difference signal; multiplying a preprocessed target audio signal and the millimeter-wave phase difference signal to obtain a target product component; calculating a spectral entropy of the target product component; and determining whether the spectral entropy is greater than a specified threshold; wherein when the spectral entropy is greater than the specified threshold, it indicates that the target millimeter-wave signal and the target audio signal both contain the speech information. 5. The multimodal speech recognition method according to claim 4 , wherein determining whether the target millimeter-wave signal and the target audio signal both come from the target user, comprises: processing the target product component to extract a target linear prediction coding (LPC) component; and inputting the target LPC component into a trained one-class support vector machine (OC-SVM) to determine whether the target millimeter-wave signal and the target audio signal both come from the target user; wherein the trained OC-SVM is determined based on training data and an OC-SVM; the training data comprises a plurality of calibration product components and a label corresponding to each calibration product component; the label is a calibration user; and the calibration product component is a product component determined based on a millimeter-wave signal and an audio signal corresponding to the calibration user. 6. The multimodal speech recognition method according to claim 1 , wherein the feature calibration performed by the calibration module comprises: calculating a first channel feature distribution based on a first intermediate feature, wherein the first intermediate feature is a signal output by the output end of the third ResECA of the first branch network; calculating a second channel feature distribution based on a second intermediate feature, wherein the second intermediate feature is a signal output by the output end of the third ResECA of the second branch network; calibrating the first intermediate feature based on the second channel feature distribution; and calibrating the second intermediate feature based on the first ch
Detection of presence or absence of voice signals (switching of direction of transmission by voice frequency in two-way loud-speaking telephone systems H04M9/10) · CPC title
the extracted parameters being spectral information of each sub-band · CPC title
Constructional details of speech recognition systems · CPC title
Procedures used during a speech recognition process, e.g. man-machine dialogue · CPC title
Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.