What technology area does this patent fall under?

Primary CPC classification G10L15/063. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Apr 01 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Noisy far-field speech recognition

US12266346B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12266346-B2
Application number	US-202117390788-A
Country	US
Kind code	B2
Filing date	Jul 30, 2021
Priority date	Jul 30, 2021
Publication date	Apr 1, 2025
Grant date	Apr 1, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The accuracy of automatic speech recognition (ASR) tasks is improved using trained models. A speech recognition model is applied in a noisy environment where speech is spoken at a distance from the microphones. The techniques may include extracting speech features, data augmentation by adding feature perturbation, and/or a multi-domain end-to-end speech recognition model. In some implementations, the described technology includes using a teacher-group knowledge distillation strategy to train a deep end-to-end speech recognition model on original speech samples and the sample speech augmentation of the original speech samples, that outputs recognized text transcriptions corresponding to speech detected in the original speech samples and the sample speech augmentation.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: training a model using audio recordings from noise scenarios in a set of training data; decomposing a training signal from the set of training data into a message component and a noise component; scaling the noise component by a random scale factor to obtain a scaled noise, wherein the random scale factor is a power with a base that is a constant and an exponent that includes a random variable; adding the scaled noise to the message component to obtain a perturbed audio signal that is included in the set of training data; training a first teacher model using a first subset of the set of training data associated with a first noise scenario of the noise scenarios; training a second teacher model using a second subset of the set of training data associated with a second noise scenario of the noise scenarios; and training a student model using soft labels output from the first teacher model and soft labels output from the second teacher model. 2. The method of claim 1 , wherein training the student model using soft labels output from the second teacher model comprises: determining a label for a training signal from the set of training data as a linear interpolation of a soft label from the second teacher model and a hard label for the training signal. 3. The method of claim 1 , wherein training the student model using soft labels output from the second teacher model comprises: randomly selecting a training signal from the set of training data; identifying a noise scenario associated with the selected training signal; and determining a label for the selected training signal as a linear interpolation of a hard label for the training signal and a soft label from a teacher model trained using a subset of the training data associated with the identified noise scenario. 4. The method of claim 1 , wherein the first subset of the training data associated with the first noise scenario is based on audio recordings from streets, and the second subset of the training data associated with the second noise scenario is based on audio recordings from rooms inside buildings. 5. The method of claim 1 , wherein the random scale factor is chosen from a range uniformly sampled in [−8 dB, −1 dB). 6. The method of claim 1 , wherein the message component is an audio signal recorded with a microphone near a desired audio source while the training signal is recorded with a microphone far from the desired audio source. 7. The method of claim 1 , wherein decomposing the training signal from the set of training data into the message component and the noise component comprises: applying feature extraction, including a log-mel filter bank, to the training signal and to the message component; and subtracting features of the message component from features of the training signal to obtain features of the noise component. 8. A system comprising: a network interface, a processor, and a memory, wherein the memory stores instructions executable by the processor to: train a model using audio recordings from noise scenarios in a set of training data; decompose a training signal from the set of training data into a message component and a noise component; scale the noise component by a random scale factor to obtain a scaled noise, wherein the random scale factor is a power with a base that is a constant and an exponent that includes a random variable; add the scaled noise to the message component to obtain a perturbed audio signal that is included in the set of training data; train a first teacher model using a first subset of the set of training data associated with a first noise scenario of the noise scenarios; train a second teacher model using a second subset of the set of training data associated with a second noise scenario of the noise scenarios; and train a student model using soft labels output from the first teacher model and soft labels output from the second teacher model. 9. The system of claim 8 , wherein the memory stores instructions executable by the processor to: determine a label for a training signal from the set of training data as a linear interpolation of a soft label from the second teacher model and a hard label for the training signal. 10. The system of claim 8 , wherein the memory stores instructions executable by the processor to: randomly select a training signal from the set of training data; identify a noise scenario associated with the selected training signal; and determine a label for the selected training signal as a linear interpolation of a hard label for the training signal and a soft label from a teacher model trained using a subset of the training data associated with the identified noise scenario. 11. The system of claim 8 , wherein the memory stores instructions executable by the processor to: input data based on an audio signal to the student model to obtain a transcript of speech recorded in the audio signal. 12. The system of claim 8 , wherein the a random scale factor is chosen from a range uniformly sampled in [−8 dB, −1 dB). 13. The system of claim 8 , wherein the message component is an audio signal recorded with a microphone near a desired audio source while the training signal is recorded with a microphone far from the desired audio source. 14. The system of claim 8 , wherein the memory stores instructions executable by the processor to: apply feature extraction, including a log-mel filter bank, to the training signal and to the message component; and subtract features of the message component from features of the training signal to obtain features of the noise component. 15. A non-transitory computer-readable storage medium, comprising executable instructions that, when executed by a processor, facilitate performance of operations, comprising: training a model using audio recordings from noise scenarios in a set of training data; decomposing a training signal from the set of training data into a message component and a noise component; scaling the noise component by a random scale factor to obtain a scaled noise, wherein the random scale factor is a power with a base that is a constant and an exponent that includes a random variable; adding the scaled noise to the message component to obtain a perturbed audio signal that is included in the set of training data; training a first teacher model using a first subset of the set of training data associated with a first noise scenario of the noise scenarios; training a second teacher model using a second subset of the set of training data associated with a second noise scenario of the noise scenarios; and training a student model using soft labels output from the first teacher model and soft labels output from the second teacher model. 16. The non-transitory computer-readable storage medium of claim 15 , wherein training the student model using soft labels output from the second teacher model comprises: determining a label for a training signal from the set of training data as a linear interpolation of a soft label from the second teacher model and a hard label for the training signal. 17. The non-transitory computer-readable storage medium of claim 15 , wherein the random scale factor is chosen from a range uniformly sampled in [−8 dB, −1 dB).

Assignees

Zoom Communications Inc

Inventors

Classifications

G10L15/02
Feature extraction for speech recognition; Selection of recognition unit · CPC title
G10L25/84
for discriminating voice from noise · CPC title
G10L15/20
Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech (G10L21/02 takes precedence) · CPC title
G10L15/063Primary
Training · CPC title
G10L15/16Primary
using artificial neural networks · CPC title

Patent family

Related publications grouped by family.

View patent family 85038524

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12266346B2 cover?: The accuracy of automatic speech recognition (ASR) tasks is improved using trained models. A speech recognition model is applied in a noisy environment where speech is spoken at a distance from the microphones. The techniques may include extracting speech features, data augmentation by adding feature perturbation, and/or a multi-domain end-to-end speech recognition model. In some implementation…
Who is the assignee on this patent?: Zoom Communications Inc
What technology area does this patent fall under?: Primary CPC classification G10L15/063. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Apr 01 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Extreme Language Model Compression with Optimal Sub-Words and Shared Projections

Soft label generation for knowledge distillation

Acoustic model training

Frequently asked questions