Noisy far-field speech recognition

US12266346B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12266346-B2
Application numberUS-202117390788-A
CountryUS
Kind codeB2
Filing dateJul 30, 2021
Priority dateJul 30, 2021
Publication dateApr 1, 2025
Grant dateApr 1, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The accuracy of automatic speech recognition (ASR) tasks is improved using trained models. A speech recognition model is applied in a noisy environment where speech is spoken at a distance from the microphones. The techniques may include extracting speech features, data augmentation by adding feature perturbation, and/or a multi-domain end-to-end speech recognition model. In some implementations, the described technology includes using a teacher-group knowledge distillation strategy to train a deep end-to-end speech recognition model on original speech samples and the sample speech augmentation of the original speech samples, that outputs recognized text transcriptions corresponding to speech detected in the original speech samples and the sample speech augmentation.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: training a model using audio recordings from noise scenarios in a set of training data; decomposing a training signal from the set of training data into a message component and a noise component; scaling the noise component by a random scale factor to obtain a scaled noise, wherein the random scale factor is a power with a base that is a constant and an exponent that includes a random variable; adding the scaled noise to the message component to obtain a perturbed audio signal that is included in the set of training data; training a first teacher model using a first subset of the set of training data associated with a first noise scenario of the noise scenarios; training a second teacher model using a second subset of the set of training data associated with a second noise scenario of the noise scenarios; and training a student model using soft labels output from the first teacher model and soft labels output from the second teacher model. 2. The method of claim 1 , wherein training the student model using soft labels output from the second teacher model comprises: determining a label for a training signal from the set of training data as a linear interpolation of a soft label from the second teacher model and a hard label for the training signal. 3. The method of claim 1 , wherein training the student model using soft labels output from the second teacher model comprises: randomly selecting a training signal from the set of training data; identifying a noise scenario associated with the selected training signal; and determining a label for the selected training signal as a linear interpolation of a hard label for the training signal and a soft label from a teacher model trained using a subset of the training data associated with the identified noise scenario. 4. The method of claim 1 , wherein the first subset of the training data associated with the first noise scenario is based on audio recordings from streets, and the second subset of the training data associated with the second noise scenario is based on audio recordings from rooms inside buildings. 5. The method of claim 1 , wherein the random scale factor is chosen from a range uniformly sampled in [−8 dB, −1 dB). 6. The method of claim 1 , wherein the message component is an audio signal recorded with a microphone near a desired audio source while the training signal is recorded with a microphone far from the desired audio source. 7. The method of claim 1 , wherein decomposing the training signal from the set of training data into the message component and the noise component comprises: applying feature extraction, including a log-mel filter bank, to the training signal and to the message component; and subtracting features of the message component from features of the training signal to obtain features of the noise component. 8. A system comprising: a network interface, a processor, and a memory, wherein the memory stores instructions executable by the processor to: train a model using audio recordings from noise scenarios in a set of training data; decompose a training signal from the set of training data into a message component and a noise component; scale the noise component by a random scale factor to obtain a scaled noise, wherein the random scale factor is a power with a base that is a constant and an exponent that includes a random variable; add the scaled noise to the message component to obtain a perturbed audio signal that is included in the set of training data; train a first teacher model using a first subset of the set of training data associated with a first noise scenario of the noise scenarios; train a second teacher model using a second subset of the set of training data associated with a second noise scenario of the noise scenarios; and train a student model using soft labels output from the first teacher model and soft labels output from the second teacher model. 9. The system of claim 8 , wherein the memory stores instructions executable by the processor to: determine a label for a training signal from the set of training data as a linear interpolation of a soft label from the second teacher model and a hard label for the training signal. 10. The system of claim 8 , wherein the memory stores instructions executable by the processor to: randomly select a training signal from the set of training data; identify a noise scenario associated with the selected training signal; and determine a label for the selected training signal as a linear interpolation of a hard label for the training signal and a soft label from a teacher model trained using a subset of the training data associated with the identified noise scenario. 11. The system of claim 8 , wherein the memory stores instructions executable by the processor to: input data based on an audio signal to the student model to obtain a transcript of speech recorded in the audio signal. 12. The system of claim 8 , wherein the a random scale factor is chosen from a range uniformly sampled in [−8 dB, −1 dB). 13. The system of claim 8 , wherein the message component is an audio signal recorded with a microphone near a desired audio source while the training signal is recorded with a microphone far from the desired audio source. 14. The system of claim 8 , wherein the memory stores instructions executable by the processor to: apply feature extraction, including a log-mel filter bank, to the training signal and to the message component; and subtract features of the message component from features of the training signal to obtain features of the noise component. 15. A non-transitory computer-readable storage medium, comprising executable instructions that, when executed by a processor, facilitate performance of operations, comprising: training a model using audio recordings from noise scenarios in a set of training data; decomposing a training signal from the set of training data into a message component and a noise component; scaling the noise component by a random scale factor to obtain a scaled noise, wherein the random scale factor is a power with a base that is a constant and an exponent that includes a random variable; adding the scaled noise to the message component to obtain a perturbed audio signal that is included in the set of training data; training a first teacher model using a first subset of the set of training data associated with a first noise scenario of the noise scenarios; training a second teacher model using a second subset of the set of training data associated with a second noise scenario of the noise scenarios; and training a student model using soft labels output from the first teacher model and soft labels output from the second teacher model. 16. The non-transitory computer-readable storage medium of claim 15 , wherein training the student model using soft labels output from the second teacher model comprises: determining a label for a training signal from the set of training data as a linear interpolation of a soft label from the second teacher model and a hard label for the training signal. 17. The non-transitory computer-readable storage medium of claim 15 , wherein the random scale factor is chosen from a range uniformly sampled in [−8 dB, −1 dB).

Assignees

Inventors

Classifications

  • Feature extraction for speech recognition; Selection of recognition unit · CPC title

  • for discriminating voice from noise · CPC title

  • Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech (G10L21/02 takes precedence) · CPC title

  • G10L15/063Primary

    Training · CPC title

  • G10L15/16Primary

    using artificial neural networks · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12266346B2 cover?
The accuracy of automatic speech recognition (ASR) tasks is improved using trained models. A speech recognition model is applied in a noisy environment where speech is spoken at a distance from the microphones. The techniques may include extracting speech features, data augmentation by adding feature perturbation, and/or a multi-domain end-to-end speech recognition model. In some implementation…
Who is the assignee on this patent?
Zoom Communications Inc
What technology area does this patent fall under?
Primary CPC classification G10L15/063. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Apr 01 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).