Speech recognition with attention-based recurrent neural networks

US9799327B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-9799327-B1
Application numberUS-201615055476-A
CountryUS
Kind codeB1
Filing dateFeb 26, 2016
Priority dateFeb 26, 2016
Publication dateOct 24, 2017
Grant dateOct 24, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and apparatus, including computer programs encoded on computer storage media for speech recognition. One method includes obtaining an input acoustic sequence, the input acoustic sequence representing an utterance, and the input acoustic sequence comprising a respective acoustic feature representation at each of a first number of time steps; processing the input acoustic sequence using a first neural network to convert the input acoustic sequence into an alternative representation for the input acoustic sequence; processing the alternative representation for the input acoustic sequence using an attention-based Recurrent Neural Network (RNN) to generate, for each position in an output sequence order, a set of substring scores that includes a respective substring score for each substring in a set of substrings; and generating a sequence of substrings that represent a transcription of the utterance.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer implemented method comprising: obtaining an input acoustic sequence, the input acoustic sequence representing an utterance, and the input acoustic sequence comprising a respective acoustic feature representation at each of a first number of time steps; processing the input acoustic sequence using a first neural network to convert the input acoustic sequence into an alternative representation for the input acoustic sequence, wherein the alternative representation for the input acoustic sequence comprises a respective alternative acoustic feature representation for each of a second number of time steps, and wherein the second number is smaller than the first number; processing the alternative representation for the input acoustic sequence using an attention-based Recurrent Neural Network (RNN) to generate, for each position in an output sequence order, a set of substring scores that includes a respective substring score for each substring in a set of substrings; and generating a sequence of substrings that represent a transcription of the utterance. 2. The method of claim 1 , wherein a substring comprises one or more characters. 3. The method of claim 2 , wherein the set of substrings comprises a set of alphabetic letters which is used to write one or more natural languages. 4. The method of claim 3 , wherein the substrings in the set of substrings further comprise a space character, a comma character, a period character, an apostrophe character, and an unknown character. 5. The method of claim 1 , wherein the generated sequence of substrings begins with a start of sequence token <sos> and ends with an end of sequence token <eos>. 6. The method of claim 1 , wherein the first neural network is a pyramid Bidirectional Long Short Term Memory (BLSTM) RNN. 7. The method of claim 6 , wherein processing the input acoustic sequence using a first neural network to convert the input acoustic sequence into an alternative representation for the input acoustic sequence comprises: processing the input acoustic sequence through a bottom BLSTM layer to generate a BLSTM layer output; and processing the BLSTM layer output through each a plurality of pyramid BLSTM layers, wherein consecutive outputs of each pyramid BLSTM layer are concatenated before being provided to the next pyramid BLSTM layer. 8. The method of claim 1 , wherein processing the alternative representation for the input acoustic sequence using an attention-based RNN comprises, for an initial position in the output sequence order: processing a placeholder start of sequence token and a placeholder initial attention context vector using the attention-based RNN to update a hidden state of the attention-based RNN from an initial hidden state to a hidden state for the initial position in the output sequence order; generating an attention context vector for the initial position from the alternative representation and the RNN hidden state for the initial position in the output sequence order; and generating the set of substring scores for the initial position using the attention context vector for the initial position and the RNN hidden state for the initial position. 9. The method of claim 8 , further comprising selecting a highest scoring substring from the set of substring scores as the substring at the initial position in the output sequence of substrings. 10. The method of claim 1 , wherein processing the alternative representation for the input acoustic sequence using an attention-based Recurrent Neural Network (RNN) comprises, for each position after an initial position in the output sequence order: processing a substring at the preceding position in the output sequence order and an attention context vector for the preceding position in the order using the attention-based RNN to update the hidden state of the attention-based RNN from the hidden state for the preceding position to a hidden state for the position; generating an attention context vector for the position from the alternative representation and the RNN hidden state for the position in the output sequence order; and generating the set of substring scores for the position using the attention context vector for the position and the RNN hidden state for the position. 11. The method of claim 10 , further comprising selecting a highest scoring substring from the set of substring scores for the position as the substring at the position in the output sequence of substrings. 12. The method of claim 10 , wherein generating an attention context vector for the position from the alternative representation and the RNN hidden state for the position in the output sequence order comprises: computing a scalar energy for the position using the alternative representation and the hidden state of the attention-based RNN for the position; converting the computed scalar energy into a probability distribution using a softmax function; and using the probability distribution to create a context vector by combining the alternative representation at different positions. 13. The method of claim 10 , wherein generating the set of substring scores for the position using the attention context vector for the position and the RNN hidden state for the position comprises: providing the hidden state of the attention-based RNN for the position and generated attention context vector for the position as input to a multi-layer perceptron (MLP) with a softmax output layer; and processing the hidden state of the attention-based RNN for the position and generated attention context vector for the position using the MLP to generate a set of substring scores for each substring in the set of substrings for the position. 14. The method of claim 1 , wherein the first neural network and the attention-based recurrent neural network are trained jointly. 15. The method of claim 1 , wherein processing the alternative representation for the input sequence using an attention-based Recurrent Neural Network (RNN) comprises processing the alternative representation using an attention-based RNN using a left to right beam search decoding. 16. A system comprising one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising: obtaining an input acoustic sequence, the input acoustic sequence representing an utterance, and the input acoustic sequence comprising a respective acoustic feature representation at each of a first number of time steps; processing the input acoustic sequence using a first neural network to convert the input acoustic sequence into an alternative representation for the input acoustic sequence, wherein the alternative representation for the input acoustic sequence comprises a respective alternative acoustic feature representation for each of a second number of time steps, and wherein the second number is smaller than the first number; processing the alternative representation for the input acoustic sequence using an attention-based Recurrent Neural Network (RNN) to generate, for each position in an output sequence order, a set of substring scores that includes a respective substring score for each substring in a set of substrings; and generating a sequence of substrings that represent a transcription of the utterance. 17. The system of claim 16 , wherein a substring comprises one or more characters. 18. The system of claim 17 , wherein the set of substrings comprises a set of alphabetic letters whic

Assignees

Inventors

Classifications

  • Combinations of networks · CPC title

  • Recurrent networks, e.g. Hopfield networks · CPC title

  • using neural networks · CPC title

  • Version control (for software G06F8/71) · CPC title

  • Use of codes for handling textual entities · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9799327B1 cover?
Methods, systems, and apparatus, including computer programs encoded on computer storage media for speech recognition. One method includes obtaining an input acoustic sequence, the input acoustic sequence representing an utterance, and the input acoustic sequence comprising a respective acoustic feature representation at each of a first number of time steps; processing the input acoustic sequen…
Who is the assignee on this patent?
Google Inc
What technology area does this patent fall under?
Primary CPC classification G10L15/16. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Oct 24 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 11 related publications on this page (citations in our corpus or others sharing the same primary CPC).