Fast emit low-latency streaming ASR with sequence-level emission regularization utilizing forward and backward probabilities between nodes of an alignment lattice

US12094453B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12094453-B2
Application numberUS-202117447285-A
CountryUS
Kind codeB2
Filing dateSep 9, 2021
Priority dateOct 20, 2020
Publication dateSep 17, 2024
Grant dateSep 17, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A computer-implemented method of training a streaming speech recognition model that includes receiving, as input to the streaming speech recognition model, a sequence of acoustic frames. The streaming speech recognition model is configured to learn an alignment probability between the sequence of acoustic frames and an output sequence of vocabulary tokens. The vocabulary tokens include a plurality of label tokens and a blank token. At each output step, the method includes determining a first probability of emitting one of the label tokens and determining a second probability of emitting the blank token. The method also includes generating the alignment probability at a sequence level based on the first probability and the second probability. The method also includes applying a tuning parameter to the alignment probability at the sequence level to maximize the first probability of emitting one of the label tokens.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method when executed on data processing hardware causes the data processing hardware to perform operations for training a streaming speech recognition model, the operations comprising: receiving, as input to the streaming speech recognition model, a sequence of acoustic frames, the streaming speech recognition model configured to learn an alignment probability between the sequence of acoustic frames and an output sequence of vocabulary tokens, the vocabulary tokens comprising a plurality of label tokens and a blank token; generating an alignment lattice comprising a plurality of nodes, the alignment lattice defined as a matrix with T columns of nodes and U rows of nodes, each column of the T columns corresponding to a corresponding step of the plurality of output steps, each row of the U rows corresponding to a label that textually represents the sequence of acoustic frames; at each node location in the matrix of the alignment lattice: determining a forward probability for predicting a subsequent node adjacent to the respective node; and determining, from the subsequent node adjacent to the respective node, a backward probability of including the respective subsequent node in an output sequence of vocabulary tokens; at each step of a plurality of output steps: determining a first probability of emitting one of the label tokens; and determining a second probability of emitting the blank token, wherein the forward probability comprises the first probability and the second probability; generating the alignment probability at a sequence level based on the first probability of emitting one of the label tokens and the second probability of emitting the blank token at each output step; and applying a tuning parameter to the alignment probability at the sequence level to maximize the first probability of emitting one of the label tokens, the tuning parameter applied to the alignment probability independent of any speech-word alignment information. 2. The computer-implemented method of claim 1 , wherein the first probability of emitting one of the label tokens at a respective step corresponds to a probability of emitting one of the label tokens after previously emitting a respective label token. 3. The computer-implemented method of claim 1 , wherein the second probability of emitting the blank token at a respective step corresponds to a probability of emitting the blank token after emitting one of the blank token or a label token at a step immediately preceding the respective step. 4. The computer-implemented method of claim 1 , wherein the first probability and the second probability define a forward variable of a forward-backward propagation algorithm. 5. The computer-implemented method of claim 1 , wherein generating the alignment probability at the sequence level comprises aggregating the forward probability and the backward probability for all nodes at each respective step of the alignment lattice. 6. The computer-implemented method of claim 1 , wherein applying the tuning parameter to the alignment probability at the sequence level balances a loss at the streaming speech recognition model and a regularization loss when training the streaming speech recognition model. 7. The computer-implemented method of claim 1 , wherein emission of the blank token at one of the output steps is not penalized. 8. The computer-implemented method of claim 1 , wherein the streaming speech recognition model comprises at least one of: a recurrent neural-transducer (RNN-T) model; a Transformer-Transducer model; a Convolutional Network-Transducer (ConvNet-Transducer) model; or a Conformer-Transducer model. 9. The computer-implemented method of claim 1 , wherein the streaming speech recognition model comprises a recurrent neural-transducer (RNN-T) model. 10. The computer-implemented method of claim 1 , wherein the streaming speech recognition model comprises a Conformer-Transducer model. 11. The computer-implemented method of claim 1 , wherein after training the streaming speech recognition model, the trained streaming speech recognition model executes on a user device to transcribe speech in a streaming fashion. 12. The computer-implemented method of claim 1 , wherein, after training the streaming speech recognition model, the trained streaming speech recognition model executes on a server. 13. A system of training a streaming speech recognition model, the system comprising: data processing hardware; and memory hardware in communication with the data processing hardware, the memory hardware storing instructions that when executed by the data processing hardware cause the data processing hardware to perform operations comprising: receiving, as input to the speech recognition model, a sequence of acoustic frames, the streaming speech recognition model configured to learn an alignment probability between the sequence of acoustic frames and an output sequence of vocabulary tokens, the vocabulary tokens comprising a plurality of label tokens and a blank token; generating an alignment lattice comprising a plurality of nodes, the alignment lattice defined as a matrix with T columns of nodes and U rows of nodes, each column of the T columns corresponding to a corresponding step of the plurality of output steps, each row of the U rows corresponding to a label that textually represents the sequence of acoustic frames; at each node location in the matrix of the alignment lattice: determining a forward probability for predicting a subsequent node adjacent to the respective node; and determining, from the subsequent node adjacent to the respective node, a backward probability of including the respective subsequent node in an output sequence of vocabulary tokens; at each step of a plurality of output steps: determining a first probability of emitting one of the label tokens; and determining a second probability of emitting the blank token, wherein the forward probability comprises the first probability and the second probability; generating the alignment probability at a sequence level based on the first probability of emitting one of the label tokens and the second probability of emitting the blank token at each output step; and applying a tuning parameter to the alignment probability at the sequence level to maximize the first probability of emitting one of the label tokens, the tuning parameter applied to the alignment probability independent of any speech-word alignment information. 14. The system of claim 13 , wherein the first probability of emitting one of the label tokens at a respective step corresponds to a probability of emitting one of the label tokens after previously emitting a respective label token. 15. The system of claim 13 , wherein the second probability of emitting the blank token at a respective step corresponds to a probability of emitting the blank label after emitting one of the blank label or a label token at a step immediately preceding the respective step. 16. The system of claim 13 , wherein the first probability and the second probability define a forward variable of a forward-backward propagation algorithm. 17. The system of claim 13 , wherein generating the alignment probability at the sequence level comprises aggregating the forward probability and the backward probability for all nodes at each respective step of the alignment lattice. 18. The system of claim 13 , wherein applying the tuning parameter to the alignment probability at the sequence level balances a loss at the streaming spee

Assignees

Inventors

Classifications

  • Convolutional networks [CNN, ConvNet] · CPC title

  • Supervised learning · CPC title

  • characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title

  • Phonemic context, e.g. pronunciation rules, phonotactical constraints or phoneme n-grams · CPC title

  • Distributed recognition, e.g. in client-server systems, for mobile phones or network applications · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12094453B2 cover?
A computer-implemented method of training a streaming speech recognition model that includes receiving, as input to the streaming speech recognition model, a sequence of acoustic frames. The streaming speech recognition model is configured to learn an alignment probability between the sequence of acoustic frames and an output sequence of vocabulary tokens. The vocabulary tokens include a plural…
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification G10L15/063. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 17 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).