User-perceived latency while maintaining accuracy

US11929076B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11929076-B2
Application numberUS-202218060949-A
CountryUS
Kind codeB2
Filing dateDec 1, 2022
Priority dateDec 15, 2020
Publication dateMar 12, 2024
Grant dateMar 12, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Disclosed speech recognition techniques improve user-perceived latency while maintaining accuracy by: receiving an audio stream, in parallel, by a primary (e.g., accurate) speech recognition engine (SRE) and a secondary (e.g., fast) SRE; generating, with the primary SRE, a primary result; generating, with the secondary SRE, a secondary result; appending the secondary result to a word list; and merging the primary result into the secondary result in the word list. Combining output from the primary and secondary SREs into a single decoder as described herein improves user-perceived latency while maintaining or improving accuracy, among other advantages.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method comprising: receiving an audio stream, in parallel, by a primary speech recognition engine (SRE) and a secondary SRE; generating a concatenated output including an encoded output of the secondary SRE and an early-stage encoded output of the primary SRE; processing the concatenated output by the primary SRE; and based upon the processing of the concatenated output by the primary SRE, generating a speech recognition result. 2. The computer-implemented method of claim 1 , wherein the primary SRE uses a first acoustic model (AM) and the secondary SRE uses a second AM, wherein the second AM is faster than the first AM. 3. The computer-implemented method of claim 2 , wherein the first AM uses a hybrid model and the second AM uses a recurrent neural network transducer (RNN-T) model. 4. The computer-implemented method of claim 1 , wherein the speech recognition result is generated by a primary decoder included in the primary SRE. 5. The computer-implemented method of claim 1 , wherein the primary SRE is on a remote node and the secondary SRE is on a local node. 6. The computer-implemented method of claim 1 , wherein the primary SRE has a first look-ahead buffer and the secondary SRE has a second look-ahead buffer, wherein the first look-ahead buffer is longer than the second look-ahead buffer. 7. The computer-implemented method of claim 1 , wherein the primary SRE and the secondary SRE reside on a single computing device. 8. A system comprising: a processor; and a computer-readable medium storing instructions that are operative upon execution by the processor to: receive an audio stream, in parallel, by a primary speech recognition engine (SRE) and a secondary SRE; generate a concatenated output including an encoded output of the secondary SRE and an early-stage encoded output of the primary SRE; process the concatenated output by the primary SRE; and based upon the processing of the concatenated output by the primary SRE, generate a speech recognition result. 9. The system of claim 8 , wherein the primary SRE uses a first acoustic model (AM) and the secondary SRE uses a second AM, wherein the second AM is faster than the first AM. 10. The system of claim 8 , wherein the primary SRE comprises an early-stage encoder, a joint encoder, and a primary decoder, wherein the secondary SRE comprises a secondary encoder and a secondary decoder, wherein the encoded output of the secondary SRE is generated by the secondary encoder and the early-stage encoded output of the primary SRE is generated by the early-stage encoder, wherein the concatenated output is provided to the joint encoder, wherein the speech recognition result is generated by the primary decoder included in the primary SRE. 11. The system of claim 8 , wherein the instructions are further operative to: based on the speech recognition result, display a word list as captioning for a streaming video, captioning for a video conference, or a real-time transcription of a live conversation. 12. The system of claim 8 , wherein the primary SRE is on a first node and the secondary SRE is on a second node, wherein the first node is different from the second node. 13. The system of claim 8 , wherein the primary SRE has a first look-ahead buffer and the secondary SRE has a second look-ahead buffer, wherein the first look-ahead buffer is longer than the second look-ahead buffer. 14. The system of claim 8 , wherein the primary SRE and the secondary SRE reside on a single computing device. 15. A computer storage medium having computer-executable instructions stored thereon, which, on execution by a processor, cause the processor to perform operations comprising: receiving an audio stream, in parallel, by a primary speech recognition engine (SRE) and a secondary SRE; generating a concatenated output including an encoded output of the secondary SRE and an early-stage encoded output of the primary SRE; processing the concatenated output by the primary SRE; and based upon the processing of the concatenated output by the primary SRE, generating a speech recognition result. 16. The computer storage medium of claim 15 , wherein the primary SRE uses a first acoustic model (AM) and the secondary SRE uses a second AM, wherein the second AM is faster than the first AM. 17. The computer storage medium of claim 16 , wherein the first AM uses a hybrid model and the second AM uses a recurrent neural network transducer (RNN-T) model. 18. The computer storage medium of claim 15 , wherein the speech recognition result is generated by a primary decoder included in the primary SRE. 19. The computer storage medium of claim 15 , wherein the primary SRE is on a first node and the secondary SRE is on a second node, wherein the first node is different from the second node. 20. The computer storage medium of claim 15 , wherein the primary SRE has a first look-ahead buffer and the secondary SRE has a second look-ahead buffer, wherein the first look-ahead buffer is longer than the second look-ahead buffer.

Assignees

Inventors

Classifications

  • G10L15/32Primary

    Multiple recognisers used in sequence or in parallel; Score combination systems therefor, e.g. voting systems · CPC title

  • using artificial neural networks · CPC title

  • Distributed recognition, e.g. in client-server systems, for mobile phones or network applications · CPC title

  • Audio streaming, i.e. formatting and decoding of an encoded audio signal representation into a data stream for transmission or storage purposes · CPC title

  • for comparison or discrimination · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11929076B2 cover?
Disclosed speech recognition techniques improve user-perceived latency while maintaining accuracy by: receiving an audio stream, in parallel, by a primary (e.g., accurate) speech recognition engine (SRE) and a secondary (e.g., fast) SRE; generating, with the primary SRE, a primary result; generating, with the secondary SRE, a secondary result; appending the secondary result to a word list; and …
Who is the assignee on this patent?
Microsoft Technology Licensing Llc
What technology area does this patent fall under?
Primary CPC classification G10L15/32. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 12 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).