Tied and Reduced RNN-T
US-2022310071-A1 · Sep 29, 2022 · US
US11727920B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11727920-B2 |
| Application number | US-202117330446-A |
| Country | US |
| Kind code | B2 |
| Filing date | May 26, 2021 |
| Priority date | Mar 23, 2021 |
| Publication date | Aug 15, 2023 |
| Grant date | Aug 15, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A RNN-T model includes a prediction network configured to, at each of a plurality of times steps subsequent to an initial time step, receive a sequence of non-blank symbols. For each non-blank symbol the prediction network is also configured to generate, using a shared embedding matrix, an embedding of the corresponding non-blank symbol, assign a respective position vector to the corresponding non-blank symbol, and weight the embedding proportional to a similarity between the embedding and the respective position vector. The prediction network is also configured to generate a single embedding vector at the corresponding time step. The RNN-T model also includes a joint network configured to, at each of the plurality of time steps subsequent to the initial time step, receive the single embedding vector generated as output from the prediction network at the corresponding time step and generate a probability distribution over possible speech recognition hypotheses.
Opening claim text (preview).
What is claimed is: 1. A system comprising: data processing hardware; and memory hardware in communication with the data processing hardware and storing instructions that, when executed on the data processing hardware, cause the data processing hardware to perform operations, the operations comprising executing a recurrent neural network-transducer (RNN-T) model, the RNN-T model comprising: an audio encoder configured to: receive, as input, a sequence of acoustic frames; and generate, at each of a plurality of time steps, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames; a prediction network configured to, at each of the plurality of time steps subsequent to an initial time step: receive, as input, a sequence of non-blank symbols output by a final Softmax layer; for each non-blank symbol in the sequence of non-blank symbols received as input at the corresponding time step: generate, using a shared embedding matrix, an embedding of the corresponding non-blank symbol; assign a respective position vector to the corresponding non-blank symbol; and weight the embedding proportional to a similarity between the embedding and the respective position vector; and generate, as output, a single embedding vector at the corresponding time step, the single embedding vector based on a weighted average of the weighted embeddings; a joint network configured to, at each of the plurality of time steps subsequent to the initial time step: receive, as input, the single embedding vector generated as output from the prediction network at the corresponding time step; receive, as input, the higher order feature representation generated by the audio encoder at the corresponding time step; and generate, as output, a probability distribution over possible speech recognition hypotheses at the corresponding time step; and the final Softmax layer, the final Softmax layer configured to: receive, as input, the probability distribution over possible speech recognition hypotheses generated as output from the joint network; and determine, as output of the RNN-T model, a speech recognition result for the sequence of acoustic frames based on the probability distribution over possible speech recognition hypotheses. 2. The system of claim 1 , wherein the prediction network ties a dimensionality of the shared embedding matrix to a dimensionality of an output layer of the joint network. 3. The system of claim 1 , wherein weighting the embedding proportional to the similarity between the embedding and the respective position vector comprises weighting the embedding proportional to a cosine similarity between the embedding and the respective position vector. 4. The system of claim 1 , wherein the sequence of non-blank symbols output by the final Softmax layer comprise wordpieces. 5. The system of claim 1 , wherein the sequence of non-blank symbols output by the final Softmax layer comprise graphemes. 6. The system of claim 1 , wherein each of the embeddings comprise a same dimension size as each of the position vectors. 7. The system of claim 1 , wherein the sequence of non-blank symbols received as input is limited to N previous non-blank symbols output by the final Softmax layer. 8. The system of claim 7 , wherein N is equal to two. 9. The system of claim 7 , wherein N is equal to five. 10. The system of claim 1 , wherein the prediction network comprises a multi-headed attention mechanism, the multi-headed attention mechanism sharing the shared embedding matrix across each head of the multi-headed attention mechanism. 11. The system of claim 10 , wherein the prediction network is configured to, at each of the plurality of time steps subsequent to the initial time step: at each head of the multi-headed attention mechanism: for each non-blank symbol in the sequence of non-blank symbols received as input at the corresponding time step: generate, using the shared embedding matrix, the same embedding of the corresponding non-blank symbol as the embedding generated at each other head of the multi-headed attention mechanism; assign a different respective position vector to the corresponding non-blank symbol than the respective position vectors assigned to the corresponding non-blank symbol at each other head of the multi-headed attention mechanism; and weight the embedding proportional to the similarity between the embedding and the respective position vector; and generate, as output from the corresponding head of the multi-headed attention mechanism, a respective weighted average of the weighted embeddings of the sequence of non-blank symbols; and generate, as output, the single embedding vector at the corresponding time step by averaging the respective weighted averages output from the corresponding heads of the multi-headed attention mechanism. 12. The system of claim 10 , wherein the multi-headed attention mechanism comprises four heads. 13. A computer-implemented method when executed on data processing hardware causes the data processing hardware to perform operations comprising: receiving a sequence of acoustic frames; at each of a plurality of time steps subsequent to an initial time step: generating, by an audio encoder, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames; receiving, as input to a prediction network of a recurrent neural network-transducer (RNN-T) model, a sequence of non-blank symbols output by a final Softmax layer; for each non-blank symbol in the sequence of non-blank symbols received as input at the corresponding time step: generating, by the prediction network, using a shared embedding matrix, an embedding of the corresponding non-blank symbol; assigning, by the prediction network, a respective position vector to the corresponding non-blank symbol; and weighting, by the prediction network, the embedding proportional to a similarity between the embedding and the respective position vector; generating, as output from the prediction network, a single embedding vector at the corresponding time step, the single embedding vector based on a weighted average of the weighted embeddings; and generating, by a joint network of the RNN-T model, using the single embedding vector generated as output from the prediction network at the corresponding time step and the higher order feature representation generated by the audio encoder, a probability distribution over possible speech recognition hypotheses at the corresponding time step; and generating, by the final Softmax layer, as output, a speech recognition result for the sequence of acoustic frames based on the probability distribution over possible speech recognition hypotheses. 14. The computer-implemented method of claim 13 , wherein weighting the embedding proportional to the similarity between the embedding and the respective position vector comprises weighting the embedding proportional to a cosine similarity between the embedding and the respective position vector. 15. The computer-implemented method of claim 13 , wherein the sequence of non-blank symbols output by the final Softmax layer comprise wordpieces. 16. The computer-implemented method of claim 13 , wherein the sequence of non-blank symbols output by the final Softmax layer comprise graphemes. 17. The computer-implemented method of claim 13 , wherein each of the embeddings comprise a same dimension size as each of the position vectors. 18. The computer-implemented method of claim 13 , wherein the sequenc
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title
Supervised learning · CPC title
Quantised networks; Sparse networks; Compressed networks · CPC title
using artificial neural networks · CPC title
Recognition networks (G10L15/142, G10L15/16 take precedence) · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.