Voice signal dereverberation processing method and apparatus, computer device and storage medium
US-2022230651-A1 · Jul 21, 2022 · US
US12230249B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12230249-B2 |
| Application number | US-202217655903-A |
| Country | US |
| Kind code | B2 |
| Filing date | Mar 22, 2022 |
| Priority date | Mar 26, 2021 |
| Publication date | Feb 18, 2025 |
| Grant date | Feb 18, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method includes receiving audio data corresponding to an utterance and generating a pair of positive audio data examples. Here, each positive audio data example includes a respective augmented copy of the received audio data. For each respective positive audio data example, the method includes generating a respective sequence of encoder outputs and projecting the respective sequence of encoder outputs for the positive data example into a contrastive loss space. The method also includes determining a L2 distance between each corresponding encoder output in the projected sequences of encoder outputs for the positive audio data examples and determining a per-utterance consistency loss by averaging the L2 distances. The method also includes generating corresponding speech recognition results for each respective positive audio data example. The method also includes updating parameters of the speech recognition model based on a respective supervised loss term and the per-utterance consistency loss.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method when executed by data processing hardware causes the data processing hardware to perform operations comprising: receiving a set of training utterances, each training utterance in the set of training utterances comprising a non-synthetic speech representation of a corresponding utterance; for each training utterance in the set of training utterances, converting, using a text-to-speech (TTS) model, a ground-truth transcription of the corresponding utterance to generate one or more synthetic speech representations of the same corresponding utterance; receiving audio data corresponding to an utterance by: receiving one of the non-synthetic speech representations of the corresponding utterance; or receiving one of the one or more synthetic speech representations of the corresponding utterance; generating, using a data augmentation module, a pair of positive audio data examples, each positive audio data example in the pair of positive audio data examples comprising a respective augmented copy of the received audio data corresponding to the utterance; for each respective positive audio data example in the pair of positive audio data examples: generating, using a neural network encoder, a respective sequence of encoder outputs; and projecting, using a convolutional neural network (CNN), the respective sequence of encoder outputs for the respective positive audio data example into a contrastive loss space; determining a L2 distance between each corresponding encoder output in each projected respective sequence of encoder outputs for the pair of positive audio data examples; determining a per-utterance consistency loss by averaging a set of L2 distances determined for the respective sequence of encoder outputs in the projected respective sequence of encoder outputs; generating, using a speech recognition model, corresponding speech recognition results for each respective positive audio data example in the pair of positive audio data examples; and updating parameters of the speech recognition model based on a respective supervised loss term associated with each corresponding speech recognition result and the per-utterance consistency loss. 2. The method of claim 1 , wherein the CNN comprises a first CNN layer, followed by a rectified linear activation function (ReLU) activation and LayerNorm layer, and a second CNN layer with linear activation. 3. The method of claim 1 , wherein the data augmentation module adds at least one of noise, reverberation, or manipulates timing of the received audio data. 4. The method of claim 1 , wherein the speech recognition model comprises a sequence transducer model having a Conformer-based encoder and a long short-term memory (LSTM) decoder. 5. The method of claim 4 , wherein the Conformer-based encoder comprises a stack of conformer layers each comprising a series of multi-headed self-attention, depth-wise convolution, and feedforward layers. 6. The method of claim 1 , wherein generating the corresponding speech recognition results for each respective positive audio data example in the pair of positive audio data examples comprises determining, using a decoder, a probability distribution over possible speech recognition hypotheses for the respective sequence of encoder outputs. 7. The method of claim 1 , wherein the operations further comprise determining the respective supervised loss term by comparing the corresponding speech recognition result for the respective positive audio data example and a corresponding ground-truth transcription of the respective positive audio data example. 8. The method of claim 1 , wherein each positive audio data example in the pair of positive audio data examples comprises a different respective augmented copy of the received audio data corresponding to the utterance than each other positive audio data example in the pair of positive audio data examples. 9. The method of claim 1 , wherein generating the pair of positive audio data examples comprises generating each positive audio data example in the pair of positive audio data examples based on a single observation of the utterance. 10. A system comprising: data processing hardware; and memory hardware in communication with the data processing hardware, the memory hardware storing instructions that when executed on the data processing hardware cause the data processing hardware to perform operations comprising: receiving a set of training utterances, each training utterance in the set of training utterances comprising a non-synthetic speech representation of a corresponding utterance; for each training utterance in the set of training utterances, converting, using a text-to-speech (TTS) model, a ground-truth transcription of the corresponding utterance to generate one or more synthetic speech representations of the same corresponding utterance; receiving audio data corresponding to an utterance by: receiving one of the non-synthetic speech representations of the corresponding utterance: or receiving one of the one or more synthetic speech representations of the corresponding utterance; generating, using a data augmentation module, a pair of positive audio data examples, each positive audio data example in the pair of positive audio data examples comprising a respective augmented copy of the received audio data corresponding to the utterance; for each respective positive audio data example in the pair of positive audio data examples: generating, using a neural network encoder, a respective sequence of encoder outputs; and projecting, using a convolutional neural network (CNN), the respective sequence of encoder outputs for the respective positive audio data example into a contrastive loss space; determining a L2 distance between each corresponding encoder output in each projected respective sequence of encoder outputs for the pair of positive audio data examples; determining a per-utterance consistency loss by averaging a set of L2 distances determined for the respective sequence of encoder outputs in the projected respective sequence of encoder outputs; generating, using a speech recognition model, corresponding speech recognition results for each respective positive audio data example in the pair of positive audio data examples; and updating parameters of the speech recognition model based on a respective supervised loss term associated with each corresponding speech recognition result and the per-utterance consistency loss. 11. The system of claim 10 , wherein the CNN comprises a first CNN layer, followed by a rectified linear activation function (ReLU) activation and LayerNorm layer, and a second CNN layer with linear activation. 12. The system of claim 10 , wherein the data augmentation module adds one of noise, reverberation, or manipulates timing of the received audio data. 13. The system of claim 10 , wherein the speech recognition model comprises a sequence transducer model having a Conformer-based encoder and a long short-term memory (LSTM) decoder. 14. The system of claim 13 , wherein the Conformer-based encoder comprises a stack of conformer layers each comprising a series of multi-headed self-attention, depth-wise convolution, and feedforward layers. 15. The system of claim 10 , wherein generating the corresponding speech recognition results for each respective positive audio data example in the pair of positive audio data examples comprises determining, using a decoder, a probability distribution over possible speech recognition hypotheses for the respective sequence of encoder outputs. 16. The sy
Supervised learning · CPC title
Weakly supervised learning, e.g. semi-supervised or self-supervised learning · CPC title
Convolutional networks [CNN, ConvNet] · CPC title
Auto-encoder networks; Encoder-decoder networks · CPC title
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.