What technology area does this patent fall under?

Primary CPC classification G10L15/16. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Aug 15 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Tied and reduced RNN-T

US11727920B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11727920-B2
Application number	US-202117330446-A
Country	US
Kind code	B2
Filing date	May 26, 2021
Priority date	Mar 23, 2021
Publication date	Aug 15, 2023
Grant date	Aug 15, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A RNN-T model includes a prediction network configured to, at each of a plurality of times steps subsequent to an initial time step, receive a sequence of non-blank symbols. For each non-blank symbol the prediction network is also configured to generate, using a shared embedding matrix, an embedding of the corresponding non-blank symbol, assign a respective position vector to the corresponding non-blank symbol, and weight the embedding proportional to a similarity between the embedding and the respective position vector. The prediction network is also configured to generate a single embedding vector at the corresponding time step. The RNN-T model also includes a joint network configured to, at each of the plurality of time steps subsequent to the initial time step, receive the single embedding vector generated as output from the prediction network at the corresponding time step and generate a probability distribution over possible speech recognition hypotheses.

First claim

Opening claim text (preview).

What is claimed is: 1. A system comprising: data processing hardware; and memory hardware in communication with the data processing hardware and storing instructions that, when executed on the data processing hardware, cause the data processing hardware to perform operations, the operations comprising executing a recurrent neural network-transducer (RNN-T) model, the RNN-T model comprising: an audio encoder configured to: receive, as input, a sequence of acoustic frames; and generate, at each of a plurality of time steps, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames; a prediction network configured to, at each of the plurality of time steps subsequent to an initial time step: receive, as input, a sequence of non-blank symbols output by a final Softmax layer; for each non-blank symbol in the sequence of non-blank symbols received as input at the corresponding time step: generate, using a shared embedding matrix, an embedding of the corresponding non-blank symbol; assign a respective position vector to the corresponding non-blank symbol; and weight the embedding proportional to a similarity between the embedding and the respective position vector; and generate, as output, a single embedding vector at the corresponding time step, the single embedding vector based on a weighted average of the weighted embeddings; a joint network configured to, at each of the plurality of time steps subsequent to the initial time step: receive, as input, the single embedding vector generated as output from the prediction network at the corresponding time step; receive, as input, the higher order feature representation generated by the audio encoder at the corresponding time step; and generate, as output, a probability distribution over possible speech recognition hypotheses at the corresponding time step; and the final Softmax layer, the final Softmax layer configured to: receive, as input, the probability distribution over possible speech recognition hypotheses generated as output from the joint network; and determine, as output of the RNN-T model, a speech recognition result for the sequence of acoustic frames based on the probability distribution over possible speech recognition hypotheses. 2. The system of claim 1 , wherein the prediction network ties a dimensionality of the shared embedding matrix to a dimensionality of an output layer of the joint network. 3. The system of claim 1 , wherein weighting the embedding proportional to the similarity between the embedding and the respective position vector comprises weighting the embedding proportional to a cosine similarity between the embedding and the respective position vector. 4. The system of claim 1 , wherein the sequence of non-blank symbols output by the final Softmax layer comprise wordpieces. 5. The system of claim 1 , wherein the sequence of non-blank symbols output by the final Softmax layer comprise graphemes. 6. The system of claim 1 , wherein each of the embeddings comprise a same dimension size as each of the position vectors. 7. The system of claim 1 , wherein the sequence of non-blank symbols received as input is limited to N previous non-blank symbols output by the final Softmax layer. 8. The system of claim 7 , wherein N is equal to two. 9. The system of claim 7 , wherein N is equal to five. 10. The system of claim 1 , wherein the prediction network comprises a multi-headed attention mechanism, the multi-headed attention mechanism sharing the shared embedding matrix across each head of the multi-headed attention mechanism. 11. The system of claim 10 , wherein the prediction network is configured to, at each of the plurality of time steps subsequent to the initial time step: at each head of the multi-headed attention mechanism: for each non-blank symbol in the sequence of non-blank symbols received as input at the corresponding time step: generate, using the shared embedding matrix, the same embedding of the corresponding non-blank symbol as the embedding generated at each other head of the multi-headed attention mechanism; assign a different respective position vector to the corresponding non-blank symbol than the respective position vectors assigned to the corresponding non-blank symbol at each other head of the multi-headed attention mechanism; and weight the embedding proportional to the similarity between the embedding and the respective position vector; and generate, as output from the corresponding head of the multi-headed attention mechanism, a respective weighted average of the weighted embeddings of the sequence of non-blank symbols; and generate, as output, the single embedding vector at the corresponding time step by averaging the respective weighted averages output from the corresponding heads of the multi-headed attention mechanism. 12. The system of claim 10 , wherein the multi-headed attention mechanism comprises four heads. 13. A computer-implemented method when executed on data processing hardware causes the data processing hardware to perform operations comprising: receiving a sequence of acoustic frames; at each of a plurality of time steps subsequent to an initial time step: generating, by an audio encoder, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames; receiving, as input to a prediction network of a recurrent neural network-transducer (RNN-T) model, a sequence of non-blank symbols output by a final Softmax layer; for each non-blank symbol in the sequence of non-blank symbols received as input at the corresponding time step: generating, by the prediction network, using a shared embedding matrix, an embedding of the corresponding non-blank symbol; assigning, by the prediction network, a respective position vector to the corresponding non-blank symbol; and weighting, by the prediction network, the embedding proportional to a similarity between the embedding and the respective position vector; generating, as output from the prediction network, a single embedding vector at the corresponding time step, the single embedding vector based on a weighted average of the weighted embeddings; and generating, by a joint network of the RNN-T model, using the single embedding vector generated as output from the prediction network at the corresponding time step and the higher order feature representation generated by the audio encoder, a probability distribution over possible speech recognition hypotheses at the corresponding time step; and generating, by the final Softmax layer, as output, a speech recognition result for the sequence of acoustic frames based on the probability distribution over possible speech recognition hypotheses. 14. The computer-implemented method of claim 13 , wherein weighting the embedding proportional to the similarity between the embedding and the respective position vector comprises weighting the embedding proportional to a cosine similarity between the embedding and the respective position vector. 15. The computer-implemented method of claim 13 , wherein the sequence of non-blank symbols output by the final Softmax layer comprise wordpieces. 16. The computer-implemented method of claim 13 , wherein the sequence of non-blank symbols output by the final Softmax layer comprise graphemes. 17. The computer-implemented method of claim 13 , wherein each of the embeddings comprise a same dimension size as each of the position vectors. 18. The computer-implemented method of claim 13 , wherein the sequenc

Assignees

Google Llc

Inventors

Classifications

G06N3/0442
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title
G06N3/09
Supervised learning · CPC title
G06N3/0495
Quantised networks; Sparse networks; Compressed networks · CPC title
G10L15/16Primary
using artificial neural networks · CPC title
G10L15/083
Recognition networks (G10L15/142, G10L15/16 take precedence) · CPC title

Patent family

Related publications grouped by family.

View patent family 76624155

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11727920B2 cover?: A RNN-T model includes a prediction network configured to, at each of a plurality of times steps subsequent to an initial time step, receive a sequence of non-blank symbols. For each non-blank symbol the prediction network is also configured to generate, using a shared embedding matrix, an embedding of the corresponding non-blank symbol, assign a respective position vector to the corresponding …
Who is the assignee on this patent?: Google Llc
What technology area does this patent fall under?: Primary CPC classification G10L15/16. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Aug 15 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Tied and Reduced RNN-T

Contextual biasing for speech recognition

Synthetic speech processing

Contextual biasing for speech recognition using grapheme and phoneme data

Neural paraphrase generator

Systems and methods for fast novel visual concept learning from sentence descriptions of images

System and method for ranking of hybrid speech recognition results with neural networks

Frequently asked questions