Deliberation Model-Based Two-Pass End-To-End Speech Recognition
US-2023186907-A1 · Jun 15, 2023 · US
US11929076B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11929076-B2 |
| Application number | US-202218060949-A |
| Country | US |
| Kind code | B2 |
| Filing date | Dec 1, 2022 |
| Priority date | Dec 15, 2020 |
| Publication date | Mar 12, 2024 |
| Grant date | Mar 12, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Disclosed speech recognition techniques improve user-perceived latency while maintaining accuracy by: receiving an audio stream, in parallel, by a primary (e.g., accurate) speech recognition engine (SRE) and a secondary (e.g., fast) SRE; generating, with the primary SRE, a primary result; generating, with the secondary SRE, a secondary result; appending the secondary result to a word list; and merging the primary result into the secondary result in the word list. Combining output from the primary and secondary SREs into a single decoder as described herein improves user-perceived latency while maintaining or improving accuracy, among other advantages.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method comprising: receiving an audio stream, in parallel, by a primary speech recognition engine (SRE) and a secondary SRE; generating a concatenated output including an encoded output of the secondary SRE and an early-stage encoded output of the primary SRE; processing the concatenated output by the primary SRE; and based upon the processing of the concatenated output by the primary SRE, generating a speech recognition result. 2. The computer-implemented method of claim 1 , wherein the primary SRE uses a first acoustic model (AM) and the secondary SRE uses a second AM, wherein the second AM is faster than the first AM. 3. The computer-implemented method of claim 2 , wherein the first AM uses a hybrid model and the second AM uses a recurrent neural network transducer (RNN-T) model. 4. The computer-implemented method of claim 1 , wherein the speech recognition result is generated by a primary decoder included in the primary SRE. 5. The computer-implemented method of claim 1 , wherein the primary SRE is on a remote node and the secondary SRE is on a local node. 6. The computer-implemented method of claim 1 , wherein the primary SRE has a first look-ahead buffer and the secondary SRE has a second look-ahead buffer, wherein the first look-ahead buffer is longer than the second look-ahead buffer. 7. The computer-implemented method of claim 1 , wherein the primary SRE and the secondary SRE reside on a single computing device. 8. A system comprising: a processor; and a computer-readable medium storing instructions that are operative upon execution by the processor to: receive an audio stream, in parallel, by a primary speech recognition engine (SRE) and a secondary SRE; generate a concatenated output including an encoded output of the secondary SRE and an early-stage encoded output of the primary SRE; process the concatenated output by the primary SRE; and based upon the processing of the concatenated output by the primary SRE, generate a speech recognition result. 9. The system of claim 8 , wherein the primary SRE uses a first acoustic model (AM) and the secondary SRE uses a second AM, wherein the second AM is faster than the first AM. 10. The system of claim 8 , wherein the primary SRE comprises an early-stage encoder, a joint encoder, and a primary decoder, wherein the secondary SRE comprises a secondary encoder and a secondary decoder, wherein the encoded output of the secondary SRE is generated by the secondary encoder and the early-stage encoded output of the primary SRE is generated by the early-stage encoder, wherein the concatenated output is provided to the joint encoder, wherein the speech recognition result is generated by the primary decoder included in the primary SRE. 11. The system of claim 8 , wherein the instructions are further operative to: based on the speech recognition result, display a word list as captioning for a streaming video, captioning for a video conference, or a real-time transcription of a live conversation. 12. The system of claim 8 , wherein the primary SRE is on a first node and the secondary SRE is on a second node, wherein the first node is different from the second node. 13. The system of claim 8 , wherein the primary SRE has a first look-ahead buffer and the secondary SRE has a second look-ahead buffer, wherein the first look-ahead buffer is longer than the second look-ahead buffer. 14. The system of claim 8 , wherein the primary SRE and the secondary SRE reside on a single computing device. 15. A computer storage medium having computer-executable instructions stored thereon, which, on execution by a processor, cause the processor to perform operations comprising: receiving an audio stream, in parallel, by a primary speech recognition engine (SRE) and a secondary SRE; generating a concatenated output including an encoded output of the secondary SRE and an early-stage encoded output of the primary SRE; processing the concatenated output by the primary SRE; and based upon the processing of the concatenated output by the primary SRE, generating a speech recognition result. 16. The computer storage medium of claim 15 , wherein the primary SRE uses a first acoustic model (AM) and the secondary SRE uses a second AM, wherein the second AM is faster than the first AM. 17. The computer storage medium of claim 16 , wherein the first AM uses a hybrid model and the second AM uses a recurrent neural network transducer (RNN-T) model. 18. The computer storage medium of claim 15 , wherein the speech recognition result is generated by a primary decoder included in the primary SRE. 19. The computer storage medium of claim 15 , wherein the primary SRE is on a first node and the secondary SRE is on a second node, wherein the first node is different from the second node. 20. The computer storage medium of claim 15 , wherein the primary SRE has a first look-ahead buffer and the secondary SRE has a second look-ahead buffer, wherein the first look-ahead buffer is longer than the second look-ahead buffer.
Multiple recognisers used in sequence or in parallel; Score combination systems therefor, e.g. voting systems · CPC title
using artificial neural networks · CPC title
Distributed recognition, e.g. in client-server systems, for mobile phones or network applications · CPC title
Audio streaming, i.e. formatting and decoding of an encoded audio signal representation into a data stream for transmission or storage purposes · CPC title
for comparison or discrimination · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.