What technology area does this patent fall under?

Primary CPC classification G10L15/26. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Mar 18 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).

One model unifying streaming and non-streaming speech recognition

US12254869B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12254869-B2
Application number	US-202318357225-A
Country	US
Kind code	B2
Filing date	Jul 24, 2023
Priority date	Oct 5, 2020
Publication date	Mar 18, 2025
Grant date	Mar 18, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A transformer-transducer model for unifying streaming and non-streaming speech recognition includes an audio encoder, a label encoder, and a joint network. The audio encoder receives a sequence of acoustic frames, and generates, at each of a plurality of time steps, a higher order feature representation for a corresponding acoustic frame. The label encoder receives a sequence of non-blank symbols output by a final softmax layer, and generates, at each of the plurality of time steps, a dense representation. The joint network receives the higher order feature representation and the dense representation at each of the plurality of time steps, and generates a probability distribution over possible speech recognition hypothesis. The audio encoder of the model further includes a neural network having an initial stack of transformer layers trained with zero look ahead audio context, and a final stack of transformer layers trained with a variable look ahead audio context.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method executed on data processing hardware that causes the data processing hardware to perform operations comprising: receiving audio data corresponding to a spoken utterance; encoding, by an initial stack of multi-head attention layers, the audio data to compute shared activations: while receiving the audio data corresponding to the spoken utterance: encoding, by a final stack of multi-head attention layers while applying a first look ahead audio context, the shared activations to compute low latency activations; and decoding the low latency activations into partial speech recognition results for the spoken utterance; and after the audio data corresponding to the spoken utterance is received: encoding, by the final stack of multi-head attention layers while applying a second look ahead audio context, the shared activations to compute high latency activations; and decoding the high latency activations into a final speech recognition result for the spoken utterance, wherein: the initial stack of multi-head attention layers are trained with zero look ahead audio context; and the final stack of multi-head attention layers are trained with variable look ahead audio context. 2. The computer-implemented method of claim 1 , wherein the operations further comprise, while receiving the audio data corresponding to the spoken utterance, streaming the partial speech recognition results for the spoken utterance. 3. The computer-implemented method of claim 2 , wherein the operations further comprise, after the audio data corresponding to the spoken utterance is received, replacing the streamed partial speech recognition results with the final speech recognition result. 4. The computer-implemented method of claim 1 , wherein during training, the variable look ahead audio context is uniformly sampled for each multi-head attention layer in the final stack of multi-head attention layers. 5. The computer-implemented method of claim 1 , wherein the initial stack of multi-head attention layers comprises more multi-head attention layers than the final stack of multi-head attention layers. 6. The computer-implemented method of claim 1 , wherein the first look ahead audio context comprises zero look ahead audio context. 7. The computer-implemented method of claim 1 , wherein a low latency decoding branch decodes the low latency activations into the partial speech recognition results for the spoken utterance in parallel with a high latency decoding branch decoding the high latency activations into the final speech recognition result for the spoken utterance. 8. The computer-implemented method of claim 1 , wherein the final speech recognition result decoded from the shared activations is delayed from the partial speech recognition results decoded from the shared activations by a duration based on a difference between the second look ahead audio context and the first look ahead audio context. 9. The computer-implemented method of claim 1 , wherein the operations further comprise: receiving an application identifier indicating a type of application the spoken utterance is directed toward; and setting a duration of the second look ahead audio context based on the application identifier. 10. A system comprising: data processing hardware; and memory hardware in communication with the data processing hardware and storing instructions that when executed by the data processing hardware cause the data processing hardware to perform operations comprising: receiving audio data corresponding to a spoken utterance; encoding, by an initial stack of multi-head attention layers, the audio data to compute shared activations: while receiving the audio data corresponding to the spoken utterance: encoding, by a final stack of multi-head attention layers while applying a first look ahead audio context, the shared activations to compute low latency activations; and decoding the low latency activations into partial speech recognition results for the spoken utterance; and after the audio data corresponding to the spoken utterance is received: encoding, by the final stack of multi-head attention layers while applying a second look ahead audio context, the shared activations to compute high latency activations; and decoding the high latency activations into a final speech recognition result for the spoken utterance, wherein: the initial stack of multi-head attention layers are trained with zero look ahead audio context; and the final stack of multi-head attention layers are trained with variable look ahead audio context. 11. The system of claim 10 , wherein the operations further comprise, while receiving the audio data corresponding to the spoken utterance, streaming the partial speech recognition results for the spoken utterance. 12. The system of claim 11 , wherein the operations further comprise, after the audio data corresponding to the spoken utterance is received, replacing the streamed partial speech recognition results with the final speech recognition result. 13. The system of claim 10 , wherein during training, the variable look ahead audio context is uniformly sampled for each multi-head attention layer in the final stack of multi-head attention layers. 14. The system of claim 10 , wherein the initial stack of multi-head attention layers comprises more multi-head attention layers than the final stack of multi-head attention layers. 15. The system of claim 10 , wherein the first look ahead audio context comprises zero look ahead audio context. 16. The system of claim 10 , wherein a low latency decoding branch decodes the low latency activations into the partial speech recognition results for the spoken utterance in parallel with a high latency decoding branch decoding the high latency activations into the final speech recognition result for the spoken utterance. 17. The system of claim 10 , wherein the final speech recognition result decoded from the shared activations is delayed from the partial speech recognition results decoded from the shared activations by a duration based on a difference between the second look ahead audio context and the first look ahead audio context. 18. The system of claim 10 , wherein the operations further comprise: receiving an application identifier indicating a type of application the spoken utterance is directed toward; and setting a duration of the second look ahead audio context based on the application identifier.

Assignees

Google Llc

Inventors

Classifications

G06N3/0499
Feedforward networks · CPC title
G06N3/09
Supervised learning · CPC title
G06N3/0455
Auto-encoder networks; Encoder-decoder networks · CPC title
G10L15/30
Distributed recognition, e.g. in client-server systems, for mobile phones or network applications · CPC title
G10L15/22
Procedures used during a speech recognition process, e.g. man-machine dialogue · CPC title

Patent family

Related publications grouped by family.

View patent family 75539917

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12254869B2 cover?: A transformer-transducer model for unifying streaming and non-streaming speech recognition includes an audio encoder, a label encoder, and a joint network. The audio encoder receives a sequence of acoustic frames, and generates, at each of a plurality of time steps, a higher order feature representation for a corresponding acoustic frame. The label encoder receives a sequence of non-blank symbo…
Who is the assignee on this patent?: Google Llc
What technology area does this patent fall under?: Primary CPC classification G10L15/26. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Mar 18 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Method and apparatus with speech processing

Pre-Training With Alignments For Recurrent Neural Network Transducer Based End-To-End Speech Recognition

System and Method for Streaming end-to-end Speech Recognition with Asynchronous Decoders

Multi-task training architecture and strategy for attention-based speech recognition system

End-to-end speech recognition

Frequently asked questions