What technology area does this patent fall under?

Primary CPC classification G10L15/26. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Aug 06 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).

Reducing streaming ASR model delay with self alignment

US12057124B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12057124-B2
Application number	US-202117644377-A
Country	US
Kind code	B2
Filing date	Dec 15, 2021
Priority date	Mar 26, 2021
Publication date	Aug 6, 2024
Grant date	Aug 6, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A streaming speech recognition model includes an audio encoder configured to receive a sequence of acoustic frames and generate a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The streaming speech recognition model also includes a label encoder configured to receive a sequence of non-blank symbols output by a final softmax layer and generate a dense representation. The streaming speech recognition model also includes a joint network configured to receive the higher order feature representation generated by the audio encoder and the dense representation generated by the label encoder and generate a probability distribution over possible speech recognition hypotheses. Here, the streaming speech recognition model is trained using self-alignment to reduce prediction delay by encouraging an alignment path that is one frame left from a reference forced-alignment frame.

First claim

Opening claim text (preview).

What is claimed is: 1. A system comprising data processing hardware and memory hardware in communication with the data processing hardware, the memory hardware storing instructions that cause the data processing hardware to execute a streaming speech recognition model, the speech recognition model comprising: an audio encoder configured to: receive, as input, a sequence of acoustic frames characterizing an utterance; and generate, at each of a plurality of time steps, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames; a label encoder configured to: receive, as input, a sequence of non-blank symbols output by a final softmax layer; and generate, at each of the plurality of time steps, a dense representation; and a joint network configured to: receive, as input, the higher order feature representation generated by the audio encoder at each of the plurality of time steps and the dense representation generated by the label encoder at each of the plurality of time steps; and generate, at each of the plurality of time steps, a probability distribution over possible speech recognition hypotheses at the corresponding time step, wherein the streaming speech recognition model is trained using self alignment to reduce prediction delay by; obtaining, using a decoding graph, a speech recognition result for the utterance based on the probability distribution over possible speech recognition hypotheses generated by the joint network at each of the plurality of time steps; obtaining, from the decoding graph, a reference-forced alignment path comprising reference forced-alignment frames; identifying, from the decoding graph, one frame to the left from each reference forced-alignment frame in the reference-forced alignment path; summing label transition probabilities based on the identified frames to the left from each forced-alignment frame in the reference-forced alignment path; and updating the streaming speech recognition model based on the summing of the label transition probabilities. 2. The system of claim 1 , wherein the streaming speech recognition model comprises a Transformer-Transducer model. 3. The system of claim 2 , wherein the audio encoder comprises a stack of transformer layers, each transformer layer comprising: a normalization layer; a masked multi-head attention layer with relative position encoding; residual connections; a stacking/unstacking layer; and a feedforward layer. 4. The system of claim 3 , wherein the stacking/unstacking layer is configured to change a frame rate of the corresponding transformer layer to adjust processing time by the Transformer-Transducer model during training and inference. 5. The system of claim 2 , wherein the label encoder comprises a stack of transformer layers, each transformer layer comprising: a normalization layer; a masked multi-head attention layer with relative position encoding; residual connections; a stacking/unstacking layer; and a feedforward layer. 6. The system of claim 1 , wherein the label encoder comprises a bigram embedding lookup decoder model. 7. The system of claim 1 , wherein the streaming speech recognition model comprises one of: a recurrent neural-transducer (RNN-T) model; a Transformer-Transducer model; a Convolutional Network-Transducer (ConvNet-Transducer) model; or a Conformer-Transducer model. 8. The system of claim 1 , wherein training the streaming speech recognition model using self alignment to reduce prediction delay comprises using self alignment without using any external aligner model to constrain alignment of F. 9. The system of claim 1 , wherein the streaming speech recognition model executes on a user device or a server. 10. The system of claim 1 , wherein each acoustic frame in the sequence of acoustic frames comprises a dimensional feature vector. 11. A computer-implemented method when executed on data processing hardware causes the data processing hardware to perform operations for training a streaming speech recognition model using self alignment to reduce prediction delay, the operations comprising: receiving, as input to the streaming speech recognition model, a sequence of acoustic frames corresponding to an utterance, the streaming speech recognition model configured to learn an alignment probability between the sequence of acoustic frames and an output sequence of label tokens; generating, as output from the streaming speech recognition model, using a decoding graph, a speech recognition result for the utterance, the speech recognition result comprising the output sequence of label tokens; generating a speech recognition model loss based on the speech recognition result and a ground-truth transcription of the utterance; obtaining, from the decoding graph, a reference-forced alignment path comprising reference forced-alignment frames; identifying, from the decoding graph, one frame to the left from each reference forced-alignment frame in the reference-forced alignment path; summing label transition probabilities based on the identified frames to the left from each forced-alignment frame in the reference-forced alignment path; and updating the streaming speech recognition model based on the summing of the label transition probabilities and the speech recognition model loss. 12. The computer-implemented method of claim 11 , wherein the operations further comprise: generating, by an audio encoder of the streaming speech recognition model, at each of a plurality of time steps, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames; receiving, as input to a label encoder of the streaming speech recognition model, a sequence of non-blank symbols output by a final softmax layer; generating, by the label encoder, at each of the plurality of time steps, a dense representation; receiving, as input to a joint network of the streaming speech recognition model, the higher order feature representation generated by the audio encoder at each of the plurality of time steps and the dense representation generated by the label encoder at each of the plurality of time steps; and generating, by the joint network, at each of the plurality of time steps, a probability distribution over possible speech recognition hypotheses at the corresponding time step. 13. The computer-implemented method of claim 12 , wherein the label encoder comprises a stack of transformer layers, each transformer layer comprising: a normalization layer; a masked multi-head attention layer with relative position encoding; residual connections; a stacking/unstacking layer; and a feedforward layer. 14. The computer-implemented method of claim 12 , wherein the label encoder comprises a bigram embedding lookup decoder model. 15. The computer-implemented method of claim 12 , wherein the streaming speech recognition model comprises a Transformer-Transducer model. 16. The computer-implemented method of claim 15 , wherein the audio encoder comprises a stack of transformer layers, each transformer layer comprising: a normalization layer; a masked multi-head attention layer with relative position encoding; residual connections; a stacking/unstacking layer; and a feedforward layer. 17. The computer-implemented method of claim 16 , wherein the stacking/unstacking layer is configured to change a frame rate of the corresponding transformer layer to adjust processing time by the Transformer-Transducer model during training and inference. 18. The

Assignees

Google Llc

Inventors

Classifications

G10L15/16
using artificial neural networks · CPC title
G10L15/26Primary
Speech to text systems (G10L15/08 takes precedence) · CPC title
G10L15/063Primary
Training · CPC title

Patent family

Related publications grouped by family.

View patent family 80168120

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12057124B2 cover?: A streaming speech recognition model includes an audio encoder configured to receive a sequence of acoustic frames and generate a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The streaming speech recognition model also includes a label encoder configured to receive a sequence of non-blank symbols output by a final softmax layer and g…
Who is the assignee on this patent?: Google Llc
What technology area does this patent fall under?: Primary CPC classification G10L15/26. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Aug 06 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).