Systems and methods for human listening and live captioning

US11922963B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11922963-B2
Application numberUS-202117331448-A
CountryUS
Kind codeB2
Filing dateMay 26, 2021
Priority dateMay 26, 2021
Publication dateMar 5, 2024
Grant dateMar 5, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems and methods are provided for generating and operating a speech enhancement model optimized for generating noise-suppressed speech outputs for improved human listening and live captioning. A computing system obtains a speech enhancement model trained on a first training dataset to generate noise-suppressed speech outputs and an automatic speech recognition model trained on a second training dataset to generate transcription labels for spoken language utterances. A third training dataset comprising a set of spoken language utterances is applied to the speech enhancement model to obtain a first noise-suppressed speech output which is applied to the automatic speech recognition model to generate a noise-suppressed transcription output for the set of spoken language utterances. Speech enhancement model parameters are updated to optimize the speech enhancement model to generate optimized noise-suppressed speech outputs based on a comparison of the noise-suppressed transcription output and ground truth transcription labels.

First claim

Opening claim text (preview).

What is claimed is: 1. A computing system comprising: one or more processors; and one or more storage devices storing computer-readable instructions that are executable by the one or more processors to configure the computing system to at least: obtain a speech enhancement model trained on a first training dataset to generate noise-suppressed speech outputs; obtain an automatic speech recognition model trained on a second training dataset to generate noise-suppressed transcription output for spoken language utterances; and optimize the speech enhancement model by iteratively alternating between (i) performing automatic speech recognition training by applying the speech enhancement model to a third training dataset comprising noisy audio and corresponding ground truth transcription labels and (ii) performing noise suppressing training applying the speech enhancement model to a fourth training dataset comprising parallel noisy audio and clean reference audio data. 2. The computing system of claim 1 , the computer-readable instructions being further executable to further configure the computing system to: apply a set of spoken language utterances to the speech enhancement model to obtain a first noise-suppressed speech output; apply the first noise-suppressed speech output from the speech enhancement model to the automatic speech recognition model to generate a first noise-suppressed transcription output for the set of spoken language utterances; access a set of ground truth transcription labels corresponding to the set of spoken language utterances; update one or more speech enhancement model parameters to optimize the speech enhancement model to generate optimized noise-suppressed speech outputs based on a first comparison of the noise-suppressed transcription output and the ground truth transcription labels; and prior to updating the one or more speech enhancement model parameters, freezing a set of internal layers of the automatic speech recognition model. 3. The computing system of claim 2 , the computer-readable instructions being further executable by the one or more processors to further configure the computing system to: after obtaining the speech enhancement model and the automatic speech recognition model but prior to updating the one or more speech enhancement model parameters, concatenating the speech enhancement model and the automatic speech recognition model. 4. The computing system of claim 2 , the computer-readable instructions being further executable by the one or more processors to further configure the computing system to: obtain a set of noisy audio data and a set of clean reference audio data corresponding to the set of noisy audio data; apply the noisy audio data to the speech enhancement model to obtain a second noise-suppressed speech output; and update the one or more speech enhancement model parameters to minimize signal quality loss during generation of the optimized noise-suppressed speech outputs based on a second comparison of the second noise-suppressed speech output and the set of clean reference audio data. 5. The computing system of claim 4 , the fourth training dataset comprising a subset of the first training dataset. 6. The computing system of claim 1 , the computer-readable instructions being further executable by the one or more processors to further configure the computing system to: obtain user enrollment data comprising a speaker embedding vector corresponding to a target speaker; extract the speaker embedding vector corresponding to the target speaker; and personalize the speech enhancement model to the target speaker by appending the speaker embedding vector to an internal layer of the speech enhancement model to configure the speech enhancement model to remove background noise and non-target speaker speech in order to generate personalized noise-suppressed speech outputs. 7. The computing system of claim 1 , the speech enhancement model configured as a deep complex convolution recurrent network for phase-aware speech enhancement comprising one or more short time Fourier transform layers, a complex encoder layer, a complex unified long short term memory layer, or a complex decoder layer. 8. The computing system of claim 1 , the first training dataset comprising simulated data comprising a mixture of clean speech and one or more of: room impulse responses, isotropic noise, or transient noise. 9. The computing system of claim 1 , the automatic speech recognition model configured as a sequence-to-sequence model using an attention-based encoder-decoder structure. 10. The computing system of claim 1 , the second training dataset comprising non-simulated audio data comprising spoken language utterances without a corresponding clean speech reference signal. 11. The computing system of claim 1 , the second training dataset comprising non-simulated audio data and simulated audio data. 12. The computing system of claim 1 , the computer-readable instructions that are executable by the one or more processors to configure the computing system to at least apply the third training dataset comprising a set of spoken language utterances to the speech enhancement model to obtain a first noise-suppressed speech output, the third training dataset comprising a subset of the second training dataset. 13. The computing system of claim 1 , the computer-readable instructions that are executable by the one or more processors to configure the computing system to at least apply the third training dataset comprising a set of spoken language utterances to the speech enhancement model to obtain a first noise-suppressed speech output, the third training dataset comprising speech data for a target domain corresponding to one or more of: a target enterprise or a target speaking context. 14. The computing system of claim 1 , the computer-readable instructions that are executable by the one or more processors to configure the computing system to at least apply the third training dataset comprising a set of spoken language utterances to the speech enhancement model to obtain a first noise-suppressed speech output, the third training dataset comprising speech data for a target domain corresponding to a particular target user. 15. The computing system of claim 1 , the computer-readable instructions being further executable by the one or more processors to further configure the computing system to: apply a set of spoken language utterances to the speech enhancement model to obtain a first noise-suppressed speech output; apply the first noise-suppressed speech output from the speech enhancement model to the automatic speech recognition model to generate a noise-suppressed transcription output for the set of spoken language utterances; obtain ground truth transcription labels for the set of spoken language utterances included in the third training dataset; update one or more speech enhancement model parameters to optimize the speech enhancement model to generate optimized noise-suppressed speech outputs based on a first comparison of the noise-suppressed transcription output and the ground truth transcription labels; and update the one or more speech enhancement model parameters by adjusting a probability parameter corresponding to a frequency at which the speech enhancement model is updated. 16. The computing system of claim 1 , the computer-readable instructions being further executable by the one or more processors to further configure the computing system to: apply a set of spoken language utterances to the speech enhancement model to obtain a first noise-suppressed spee

Assignees

Inventors

Classifications

  • Noise filtering · CPC title

  • Backpropagation, e.g. using gradient descent · CPC title

  • using neural networks · CPC title

  • for comparison or discrimination · CPC title

  • Training · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11922963B2 cover?
Systems and methods are provided for generating and operating a speech enhancement model optimized for generating noise-suppressed speech outputs for improved human listening and live captioning. A computing system obtains a speech enhancement model trained on a first training dataset to generate noise-suppressed speech outputs and an automatic speech recognition model trained on a second train…
Who is the assignee on this patent?
Microsoft Technology Licensing Llc
What technology area does this patent fall under?
Primary CPC classification G10L21/0208. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 05 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 9 related publications on this page (citations in our corpus or others sharing the same primary CPC).