Reducing streaming ASR model delay with self alignment

US12057124B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12057124-B2
Application numberUS-202117644377-A
CountryUS
Kind codeB2
Filing dateDec 15, 2021
Priority dateMar 26, 2021
Publication dateAug 6, 2024
Grant dateAug 6, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A streaming speech recognition model includes an audio encoder configured to receive a sequence of acoustic frames and generate a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The streaming speech recognition model also includes a label encoder configured to receive a sequence of non-blank symbols output by a final softmax layer and generate a dense representation. The streaming speech recognition model also includes a joint network configured to receive the higher order feature representation generated by the audio encoder and the dense representation generated by the label encoder and generate a probability distribution over possible speech recognition hypotheses. Here, the streaming speech recognition model is trained using self-alignment to reduce prediction delay by encouraging an alignment path that is one frame left from a reference forced-alignment frame.

First claim

Opening claim text (preview).

What is claimed is: 1. A system comprising data processing hardware and memory hardware in communication with the data processing hardware, the memory hardware storing instructions that cause the data processing hardware to execute a streaming speech recognition model, the speech recognition model comprising: an audio encoder configured to: receive, as input, a sequence of acoustic frames characterizing an utterance; and generate, at each of a plurality of time steps, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames; a label encoder configured to: receive, as input, a sequence of non-blank symbols output by a final softmax layer; and generate, at each of the plurality of time steps, a dense representation; and a joint network configured to: receive, as input, the higher order feature representation generated by the audio encoder at each of the plurality of time steps and the dense representation generated by the label encoder at each of the plurality of time steps; and generate, at each of the plurality of time steps, a probability distribution over possible speech recognition hypotheses at the corresponding time step, wherein the streaming speech recognition model is trained using self alignment to reduce prediction delay by; obtaining, using a decoding graph, a speech recognition result for the utterance based on the probability distribution over possible speech recognition hypotheses generated by the joint network at each of the plurality of time steps; obtaining, from the decoding graph, a reference-forced alignment path comprising reference forced-alignment frames; identifying, from the decoding graph, one frame to the left from each reference forced-alignment frame in the reference-forced alignment path; summing label transition probabilities based on the identified frames to the left from each forced-alignment frame in the reference-forced alignment path; and updating the streaming speech recognition model based on the summing of the label transition probabilities. 2. The system of claim 1 , wherein the streaming speech recognition model comprises a Transformer-Transducer model. 3. The system of claim 2 , wherein the audio encoder comprises a stack of transformer layers, each transformer layer comprising: a normalization layer; a masked multi-head attention layer with relative position encoding; residual connections; a stacking/unstacking layer; and a feedforward layer. 4. The system of claim 3 , wherein the stacking/unstacking layer is configured to change a frame rate of the corresponding transformer layer to adjust processing time by the Transformer-Transducer model during training and inference. 5. The system of claim 2 , wherein the label encoder comprises a stack of transformer layers, each transformer layer comprising: a normalization layer; a masked multi-head attention layer with relative position encoding; residual connections; a stacking/unstacking layer; and a feedforward layer. 6. The system of claim 1 , wherein the label encoder comprises a bigram embedding lookup decoder model. 7. The system of claim 1 , wherein the streaming speech recognition model comprises one of: a recurrent neural-transducer (RNN-T) model; a Transformer-Transducer model; a Convolutional Network-Transducer (ConvNet-Transducer) model; or a Conformer-Transducer model. 8. The system of claim 1 , wherein training the streaming speech recognition model using self alignment to reduce prediction delay comprises using self alignment without using any external aligner model to constrain alignment of F. 9. The system of claim 1 , wherein the streaming speech recognition model executes on a user device or a server. 10. The system of claim 1 , wherein each acoustic frame in the sequence of acoustic frames comprises a dimensional feature vector. 11. A computer-implemented method when executed on data processing hardware causes the data processing hardware to perform operations for training a streaming speech recognition model using self alignment to reduce prediction delay, the operations comprising: receiving, as input to the streaming speech recognition model, a sequence of acoustic frames corresponding to an utterance, the streaming speech recognition model configured to learn an alignment probability between the sequence of acoustic frames and an output sequence of label tokens; generating, as output from the streaming speech recognition model, using a decoding graph, a speech recognition result for the utterance, the speech recognition result comprising the output sequence of label tokens; generating a speech recognition model loss based on the speech recognition result and a ground-truth transcription of the utterance; obtaining, from the decoding graph, a reference-forced alignment path comprising reference forced-alignment frames; identifying, from the decoding graph, one frame to the left from each reference forced-alignment frame in the reference-forced alignment path; summing label transition probabilities based on the identified frames to the left from each forced-alignment frame in the reference-forced alignment path; and updating the streaming speech recognition model based on the summing of the label transition probabilities and the speech recognition model loss. 12. The computer-implemented method of claim 11 , wherein the operations further comprise: generating, by an audio encoder of the streaming speech recognition model, at each of a plurality of time steps, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames; receiving, as input to a label encoder of the streaming speech recognition model, a sequence of non-blank symbols output by a final softmax layer; generating, by the label encoder, at each of the plurality of time steps, a dense representation; receiving, as input to a joint network of the streaming speech recognition model, the higher order feature representation generated by the audio encoder at each of the plurality of time steps and the dense representation generated by the label encoder at each of the plurality of time steps; and generating, by the joint network, at each of the plurality of time steps, a probability distribution over possible speech recognition hypotheses at the corresponding time step. 13. The computer-implemented method of claim 12 , wherein the label encoder comprises a stack of transformer layers, each transformer layer comprising: a normalization layer; a masked multi-head attention layer with relative position encoding; residual connections; a stacking/unstacking layer; and a feedforward layer. 14. The computer-implemented method of claim 12 , wherein the label encoder comprises a bigram embedding lookup decoder model. 15. The computer-implemented method of claim 12 , wherein the streaming speech recognition model comprises a Transformer-Transducer model. 16. The computer-implemented method of claim 15 , wherein the audio encoder comprises a stack of transformer layers, each transformer layer comprising: a normalization layer; a masked multi-head attention layer with relative position encoding; residual connections; a stacking/unstacking layer; and a feedforward layer. 17. The computer-implemented method of claim 16 , wherein the stacking/unstacking layer is configured to change a frame rate of the corresponding transformer layer to adjust processing time by the Transformer-Transducer model during training and inference. 18. The

Assignees

Inventors

Classifications

  • using artificial neural networks · CPC title

  • G10L15/26Primary

    Speech to text systems (G10L15/08 takes precedence) · CPC title

  • G10L15/063Primary

    Training · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12057124B2 cover?
A streaming speech recognition model includes an audio encoder configured to receive a sequence of acoustic frames and generate a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The streaming speech recognition model also includes a label encoder configured to receive a sequence of non-blank symbols output by a final softmax layer and g…
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification G10L15/26. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Aug 06 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).