Tied and reduced RNN-T

US11727920B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11727920-B2
Application numberUS-202117330446-A
CountryUS
Kind codeB2
Filing dateMay 26, 2021
Priority dateMar 23, 2021
Publication dateAug 15, 2023
Grant dateAug 15, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A RNN-T model includes a prediction network configured to, at each of a plurality of times steps subsequent to an initial time step, receive a sequence of non-blank symbols. For each non-blank symbol the prediction network is also configured to generate, using a shared embedding matrix, an embedding of the corresponding non-blank symbol, assign a respective position vector to the corresponding non-blank symbol, and weight the embedding proportional to a similarity between the embedding and the respective position vector. The prediction network is also configured to generate a single embedding vector at the corresponding time step. The RNN-T model also includes a joint network configured to, at each of the plurality of time steps subsequent to the initial time step, receive the single embedding vector generated as output from the prediction network at the corresponding time step and generate a probability distribution over possible speech recognition hypotheses.

First claim

Opening claim text (preview).

What is claimed is: 1. A system comprising: data processing hardware; and memory hardware in communication with the data processing hardware and storing instructions that, when executed on the data processing hardware, cause the data processing hardware to perform operations, the operations comprising executing a recurrent neural network-transducer (RNN-T) model, the RNN-T model comprising: an audio encoder configured to: receive, as input, a sequence of acoustic frames; and generate, at each of a plurality of time steps, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames; a prediction network configured to, at each of the plurality of time steps subsequent to an initial time step: receive, as input, a sequence of non-blank symbols output by a final Softmax layer; for each non-blank symbol in the sequence of non-blank symbols received as input at the corresponding time step: generate, using a shared embedding matrix, an embedding of the corresponding non-blank symbol; assign a respective position vector to the corresponding non-blank symbol; and weight the embedding proportional to a similarity between the embedding and the respective position vector; and generate, as output, a single embedding vector at the corresponding time step, the single embedding vector based on a weighted average of the weighted embeddings; a joint network configured to, at each of the plurality of time steps subsequent to the initial time step: receive, as input, the single embedding vector generated as output from the prediction network at the corresponding time step; receive, as input, the higher order feature representation generated by the audio encoder at the corresponding time step; and generate, as output, a probability distribution over possible speech recognition hypotheses at the corresponding time step; and the final Softmax layer, the final Softmax layer configured to: receive, as input, the probability distribution over possible speech recognition hypotheses generated as output from the joint network; and determine, as output of the RNN-T model, a speech recognition result for the sequence of acoustic frames based on the probability distribution over possible speech recognition hypotheses. 2. The system of claim 1 , wherein the prediction network ties a dimensionality of the shared embedding matrix to a dimensionality of an output layer of the joint network. 3. The system of claim 1 , wherein weighting the embedding proportional to the similarity between the embedding and the respective position vector comprises weighting the embedding proportional to a cosine similarity between the embedding and the respective position vector. 4. The system of claim 1 , wherein the sequence of non-blank symbols output by the final Softmax layer comprise wordpieces. 5. The system of claim 1 , wherein the sequence of non-blank symbols output by the final Softmax layer comprise graphemes. 6. The system of claim 1 , wherein each of the embeddings comprise a same dimension size as each of the position vectors. 7. The system of claim 1 , wherein the sequence of non-blank symbols received as input is limited to N previous non-blank symbols output by the final Softmax layer. 8. The system of claim 7 , wherein N is equal to two. 9. The system of claim 7 , wherein N is equal to five. 10. The system of claim 1 , wherein the prediction network comprises a multi-headed attention mechanism, the multi-headed attention mechanism sharing the shared embedding matrix across each head of the multi-headed attention mechanism. 11. The system of claim 10 , wherein the prediction network is configured to, at each of the plurality of time steps subsequent to the initial time step: at each head of the multi-headed attention mechanism: for each non-blank symbol in the sequence of non-blank symbols received as input at the corresponding time step: generate, using the shared embedding matrix, the same embedding of the corresponding non-blank symbol as the embedding generated at each other head of the multi-headed attention mechanism; assign a different respective position vector to the corresponding non-blank symbol than the respective position vectors assigned to the corresponding non-blank symbol at each other head of the multi-headed attention mechanism; and weight the embedding proportional to the similarity between the embedding and the respective position vector; and generate, as output from the corresponding head of the multi-headed attention mechanism, a respective weighted average of the weighted embeddings of the sequence of non-blank symbols; and generate, as output, the single embedding vector at the corresponding time step by averaging the respective weighted averages output from the corresponding heads of the multi-headed attention mechanism. 12. The system of claim 10 , wherein the multi-headed attention mechanism comprises four heads. 13. A computer-implemented method when executed on data processing hardware causes the data processing hardware to perform operations comprising: receiving a sequence of acoustic frames; at each of a plurality of time steps subsequent to an initial time step: generating, by an audio encoder, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames; receiving, as input to a prediction network of a recurrent neural network-transducer (RNN-T) model, a sequence of non-blank symbols output by a final Softmax layer; for each non-blank symbol in the sequence of non-blank symbols received as input at the corresponding time step: generating, by the prediction network, using a shared embedding matrix, an embedding of the corresponding non-blank symbol; assigning, by the prediction network, a respective position vector to the corresponding non-blank symbol; and weighting, by the prediction network, the embedding proportional to a similarity between the embedding and the respective position vector; generating, as output from the prediction network, a single embedding vector at the corresponding time step, the single embedding vector based on a weighted average of the weighted embeddings; and generating, by a joint network of the RNN-T model, using the single embedding vector generated as output from the prediction network at the corresponding time step and the higher order feature representation generated by the audio encoder, a probability distribution over possible speech recognition hypotheses at the corresponding time step; and generating, by the final Softmax layer, as output, a speech recognition result for the sequence of acoustic frames based on the probability distribution over possible speech recognition hypotheses. 14. The computer-implemented method of claim 13 , wherein weighting the embedding proportional to the similarity between the embedding and the respective position vector comprises weighting the embedding proportional to a cosine similarity between the embedding and the respective position vector. 15. The computer-implemented method of claim 13 , wherein the sequence of non-blank symbols output by the final Softmax layer comprise wordpieces. 16. The computer-implemented method of claim 13 , wherein the sequence of non-blank symbols output by the final Softmax layer comprise graphemes. 17. The computer-implemented method of claim 13 , wherein each of the embeddings comprise a same dimension size as each of the position vectors. 18. The computer-implemented method of claim 13 , wherein the sequenc

Assignees

Inventors

Classifications

  • characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title

  • Supervised learning · CPC title

  • Quantised networks; Sparse networks; Compressed networks · CPC title

  • G10L15/16Primary

    using artificial neural networks · CPC title

  • Recognition networks (G10L15/142, G10L15/16 take precedence) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11727920B2 cover?
A RNN-T model includes a prediction network configured to, at each of a plurality of times steps subsequent to an initial time step, receive a sequence of non-blank symbols. For each non-blank symbol the prediction network is also configured to generate, using a shared embedding matrix, an embedding of the corresponding non-blank symbol, assign a respective position vector to the corresponding …
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification G10L15/16. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Aug 15 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).