Speech recognition with attention-based recurrent neural networks

US11151985B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11151985-B2
Application numberUS-201916713298-A
CountryUS
Kind codeB2
Filing dateDec 13, 2019
Priority dateFeb 26, 2016
Publication dateOct 19, 2021
Grant dateOct 19, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and apparatus, including computer programs encoded on computer storage media for speech recognition. One method includes obtaining an input acoustic sequence, the input acoustic sequence representing an utterance, and the input acoustic sequence comprising a respective acoustic feature representation at each of a first number of time steps, processing the input acoustic sequence using a first neural network to convert the input acoustic sequence into an alternative representation for the input acoustic sequence, processing the alternative representation for the input acoustic sequence using an attention-based Recurrent Neural Network (RNN) to generate, for each position in an output sequence order, a set of substring scores that includes a respective substring score for each substring in a set of substrings; and generating a sequence of substrings that represent a transcription of the utterance.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: obtaining, at a pyramid Bidirectional Long Short Term Memory (BLSTM) Recurrent Neural Network (RNN) executing on data processing hardware, an input sequence representing an utterance, the BLSTM RNN comprising: a bottom BLSTM layer configured to receive, as input, the input sequence representing the utterance; a first pyramid BLSTM layer configured to receive, as input, an output of the bottom BLSTM layer; and a second pyramid BLSTM layer configured to receive, as input, an output of the first pyramid BLSTM layer; at each of a first number of time steps, processing, using the bottom BLSTM layer, a respective feature representation of the input sequence to generate a respective bottom BLSTM layer output; processing, using the first pyramid BLSTM layer, the respective bottom BLSTM layer outputs generated for each of the first number of time steps to generate a sequence of first pyramid BLSTM layer outputs; at each of a second number of time steps: receiving, at the second pyramid BLSTM layer, a respective concatenation of consecutive first pyramid BLSTM layer outputs of the sequence of first pyramid BLSTM layer outputs generated using the first pyramid BLSTM layer; and processing, using the second pyramid BLSTM layer, the respective concatenation of consecutive first pyramid BLSTM layer outputs to generate a respective alternative feature representation for the corresponding time step of the second number of time steps; receiving, at an attention-based neural network executing on the data processing hardware, an alternative representation for the input sequence; and for each position in an output sequence, generating, using the attention-based neural network, a probability distribution over possible outputs by processing the alternative representation for the input sequence. 2. The method of claim 1 , wherein the alternative representation for the input sequence comprises the respective alternative feature representation for each of the second number of time steps. 3. The method of claim 2 , wherein the second number is smaller than the first number. 4. The method of claim 1 , wherein processing the alternative representation for the input sequence using an attention-based neural network comprises, for an initial position in the output sequence order: processing a placeholder start of sequence token and a placeholder initial attention context vector using the attention-based neural network to update a hidden state of the attention-based neural network from an initial hidden state to a hidden state for the initial position in the output sequence order; generating an attention context vector for the initial position from the alternative representation and the hidden state for the initial position in the output sequence order; and generating the set of substring scores for the initial position using the attention context vector for the initial position and the hidden state for the initial position. 5. The method of claim 4 , further comprising selecting, by the data processing hardware, the highest scoring possible output from the probability distribution of possible outputs at the initial position in the output sequence order. 6. The method of claim 1 , wherein processing the alternative representation for the input sequence using the attention-based neural network comprises, for each position after an initial position in the output sequence order: processing a substring at the preceding position in the output sequence order and the attention context vector for the preceding position in the order using the attention-based network to update the hidden state of the attention-based neural network from the hidden state for the preceding position to a hidden state for the position; generating an attention context vector for the position from the alternative representation and the neural network hidden state for the position in the output sequence order; and generating the set of substring scores for the position using the attention context vector for the position and the neural network hidden state for the position. 7. The method of claim 6 , further comprising selecting, by the data processing hardware, the highest scoring substring from the set of substring scores for the position as the substring at the position in the output sequence of substrings. 8. The method of claim 6 , wherein generating an attention context vector for the position from the alternative representation and the neural network hidden state for the position in the output sequence order comprises: computing a scalar energy for the position using the alternative representation and the hidden state of the attention-based neural network for the position; converting the computed scalar energy into a probability distribution using a softmax function; and using the probability distribution to create a context vector by combining the alternative representation at different positions. 9. The method of claim 1 , wherein the BLSTM RNN and the attention-based recurrent neural network are trained jointly. 10. The method of claim 1 , wherein processing the alternative representation for the input sequence using the attention-based neural network comprises processing the alternative representation using the attention-based neural network using a left to right beam search decoding. 11. A system comprising: data processing hardware; and memory hardware in communication with the data processing hardware and storing instructions that when executed on the data processing hardware cause the data processing hardware to perform operations comprising: obtaining, at a pyramid Bidirectional Long Short Term Memory (BLSTM) Recurrent Neural Network (RNN) executing on the data processing hardware, an input sequence representing an utterance, the BLSTM RNN comprising: a bottom BLSTM layer configured to receive, as input, the input sequence representing the utterance; a first pyramid BLSTM layer configured to receive, as input, an output of the bottom BLSTM layer; and a second pyramid BLSTM layer configured to receive, as input, an output of the first pyramid BLSTM layer; at each of a first number of time steps, processing, using the bottom BLSTM layer, a respective feature representation of the input sequence to generate a respective bottom BLSTM layer output; processing, using the first pyramid BLSTM layer, the respective bottom BLSTM layer outputs generated for each of the first number of time steps to generate a sequence of first pyramid BLSTM layer outputs; at each of a second number of time steps: receiving, at the second pyramid BLSTM layer, a respective concatenation of consecutive first pyramid BLSTM layer outputs of the sequence of first pyramid BLSTM layer outputs generated using the first pyramid BLSTM layer; and processing, using the second pyramid BLSTM layer, the respective concatenation of consecutive first pyramid BLSTM layer outputs to generate a respective alternative feature representation for the corresponding time step of the second number of time steps; receiving, at an attention-based neural network executing on the data processing hardware, an alternative representation for the input sequence; and for each position in an output sequence, generating, using the attention-based neural network, a probability distribution over possible outputs by processing the alternative representation for the input sequence. 12. The system of claim 11 , wherein the alternative representation for the input sequence comprises the respective alternative feature representation for each of the second number of time steps. 13. The system of

Assignees

Inventors

Classifications

  • Recurrent networks, e.g. Hopfield networks · CPC title

  • Combinations of networks · CPC title

  • characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title

  • Auto-encoder networks; Encoder-decoder networks · CPC title

  • Supervised learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11151985B2 cover?
Methods, systems, and apparatus, including computer programs encoded on computer storage media for speech recognition. One method includes obtaining an input acoustic sequence, the input acoustic sequence representing an utterance, and the input acoustic sequence comprising a respective acoustic feature representation at each of a first number of time steps, processing the input acoustic sequen…
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification G10L15/16. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Oct 19 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).