Method and system for acoustic data selection for training the parameters of an acoustic model
US-2018114525-A1 · Apr 26, 2018 · US
US12057124B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12057124-B2 |
| Application number | US-202117644377-A |
| Country | US |
| Kind code | B2 |
| Filing date | Dec 15, 2021 |
| Priority date | Mar 26, 2021 |
| Publication date | Aug 6, 2024 |
| Grant date | Aug 6, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A streaming speech recognition model includes an audio encoder configured to receive a sequence of acoustic frames and generate a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The streaming speech recognition model also includes a label encoder configured to receive a sequence of non-blank symbols output by a final softmax layer and generate a dense representation. The streaming speech recognition model also includes a joint network configured to receive the higher order feature representation generated by the audio encoder and the dense representation generated by the label encoder and generate a probability distribution over possible speech recognition hypotheses. Here, the streaming speech recognition model is trained using self-alignment to reduce prediction delay by encouraging an alignment path that is one frame left from a reference forced-alignment frame.
Opening claim text (preview).
What is claimed is: 1. A system comprising data processing hardware and memory hardware in communication with the data processing hardware, the memory hardware storing instructions that cause the data processing hardware to execute a streaming speech recognition model, the speech recognition model comprising: an audio encoder configured to: receive, as input, a sequence of acoustic frames characterizing an utterance; and generate, at each of a plurality of time steps, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames; a label encoder configured to: receive, as input, a sequence of non-blank symbols output by a final softmax layer; and generate, at each of the plurality of time steps, a dense representation; and a joint network configured to: receive, as input, the higher order feature representation generated by the audio encoder at each of the plurality of time steps and the dense representation generated by the label encoder at each of the plurality of time steps; and generate, at each of the plurality of time steps, a probability distribution over possible speech recognition hypotheses at the corresponding time step, wherein the streaming speech recognition model is trained using self alignment to reduce prediction delay by; obtaining, using a decoding graph, a speech recognition result for the utterance based on the probability distribution over possible speech recognition hypotheses generated by the joint network at each of the plurality of time steps; obtaining, from the decoding graph, a reference-forced alignment path comprising reference forced-alignment frames; identifying, from the decoding graph, one frame to the left from each reference forced-alignment frame in the reference-forced alignment path; summing label transition probabilities based on the identified frames to the left from each forced-alignment frame in the reference-forced alignment path; and updating the streaming speech recognition model based on the summing of the label transition probabilities. 2. The system of claim 1 , wherein the streaming speech recognition model comprises a Transformer-Transducer model. 3. The system of claim 2 , wherein the audio encoder comprises a stack of transformer layers, each transformer layer comprising: a normalization layer; a masked multi-head attention layer with relative position encoding; residual connections; a stacking/unstacking layer; and a feedforward layer. 4. The system of claim 3 , wherein the stacking/unstacking layer is configured to change a frame rate of the corresponding transformer layer to adjust processing time by the Transformer-Transducer model during training and inference. 5. The system of claim 2 , wherein the label encoder comprises a stack of transformer layers, each transformer layer comprising: a normalization layer; a masked multi-head attention layer with relative position encoding; residual connections; a stacking/unstacking layer; and a feedforward layer. 6. The system of claim 1 , wherein the label encoder comprises a bigram embedding lookup decoder model. 7. The system of claim 1 , wherein the streaming speech recognition model comprises one of: a recurrent neural-transducer (RNN-T) model; a Transformer-Transducer model; a Convolutional Network-Transducer (ConvNet-Transducer) model; or a Conformer-Transducer model. 8. The system of claim 1 , wherein training the streaming speech recognition model using self alignment to reduce prediction delay comprises using self alignment without using any external aligner model to constrain alignment of F. 9. The system of claim 1 , wherein the streaming speech recognition model executes on a user device or a server. 10. The system of claim 1 , wherein each acoustic frame in the sequence of acoustic frames comprises a dimensional feature vector. 11. A computer-implemented method when executed on data processing hardware causes the data processing hardware to perform operations for training a streaming speech recognition model using self alignment to reduce prediction delay, the operations comprising: receiving, as input to the streaming speech recognition model, a sequence of acoustic frames corresponding to an utterance, the streaming speech recognition model configured to learn an alignment probability between the sequence of acoustic frames and an output sequence of label tokens; generating, as output from the streaming speech recognition model, using a decoding graph, a speech recognition result for the utterance, the speech recognition result comprising the output sequence of label tokens; generating a speech recognition model loss based on the speech recognition result and a ground-truth transcription of the utterance; obtaining, from the decoding graph, a reference-forced alignment path comprising reference forced-alignment frames; identifying, from the decoding graph, one frame to the left from each reference forced-alignment frame in the reference-forced alignment path; summing label transition probabilities based on the identified frames to the left from each forced-alignment frame in the reference-forced alignment path; and updating the streaming speech recognition model based on the summing of the label transition probabilities and the speech recognition model loss. 12. The computer-implemented method of claim 11 , wherein the operations further comprise: generating, by an audio encoder of the streaming speech recognition model, at each of a plurality of time steps, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames; receiving, as input to a label encoder of the streaming speech recognition model, a sequence of non-blank symbols output by a final softmax layer; generating, by the label encoder, at each of the plurality of time steps, a dense representation; receiving, as input to a joint network of the streaming speech recognition model, the higher order feature representation generated by the audio encoder at each of the plurality of time steps and the dense representation generated by the label encoder at each of the plurality of time steps; and generating, by the joint network, at each of the plurality of time steps, a probability distribution over possible speech recognition hypotheses at the corresponding time step. 13. The computer-implemented method of claim 12 , wherein the label encoder comprises a stack of transformer layers, each transformer layer comprising: a normalization layer; a masked multi-head attention layer with relative position encoding; residual connections; a stacking/unstacking layer; and a feedforward layer. 14. The computer-implemented method of claim 12 , wherein the label encoder comprises a bigram embedding lookup decoder model. 15. The computer-implemented method of claim 12 , wherein the streaming speech recognition model comprises a Transformer-Transducer model. 16. The computer-implemented method of claim 15 , wherein the audio encoder comprises a stack of transformer layers, each transformer layer comprising: a normalization layer; a masked multi-head attention layer with relative position encoding; residual connections; a stacking/unstacking layer; and a feedforward layer. 17. The computer-implemented method of claim 16 , wherein the stacking/unstacking layer is configured to change a frame rate of the corresponding transformer layer to adjust processing time by the Transformer-Transducer model during training and inference. 18. The
using artificial neural networks · CPC title
Speech to text systems (G10L15/08 takes precedence) · CPC title
Training · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.