Processing acoustic sequences using long short-term memory (LSTM) neural networks that include recurrent projection layers

US9620108B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9620108-B2
Application numberUS-201414557725-A
CountryUS
Kind codeB2
Filing dateDec 2, 2014
Priority dateDec 10, 2013
Publication dateApr 11, 2017
Grant dateApr 11, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for generating phoneme representations of acoustic sequences using projection sequences. One of the methods includes receiving an acoustic sequence, the acoustic sequence representing an utterance, and the acoustic sequence comprising a respective acoustic feature representation at each of a plurality of time steps; for each of the plurality of time steps, processing the acoustic feature representation through each of one or more long short-term memory (LSTM) layers; and for each of the plurality of time steps, processing the recurrent projected output generated by the highest LSTM layer for the time step using an output layer to generate a set of scores for the time step.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: receiving an acoustic sequence, the acoustic sequence representing an utterance, and the acoustic sequence comprising a respective acoustic feature representation at each of a plurality of time steps; for each of the plurality of time steps, processing the acoustic feature representation through each of one or more long short-term memory (LSTM) layers, wherein the one or more LSTM layers are arranged in a sequence from a lowest LSTM layer to a highest LSTM layer, and wherein each of the one or more LSTM layers is configured to perform operations comprising: receiving a layer input at the time step; generating an LSTM output for the time step by processing, through one or more LSTM memory blocks, the layer input at the time step and a previous recurrent projected output, processing the LSTM output for the time step using a recurrent projection layer, wherein the recurrent projection layer is configured to; generate a recurrent projected output for the time step by applying a matrix of current values of weights to the LSTM output to project the LSTM output to a lower dimensional space, and updating the previous recurrent projected output with the recurrent projected output, wherein the updated previous recurrent projected output is used by the LSTM memory blocks in generating an LSTM output for a next time step; and for each of the plurality of time steps, processing the recurrent projected output generated by the highest LSTM layer for the time step using an output layer to generate a set of scores for the time step, the set of scores for the time step comprising a respective score for each of a plurality of phonemes or phoneme subdivisions, the score for each phoneme or phoneme subdivision representing a likelihood that the phoneme or phoneme subdivision represents the utterance at the time step. 2. The method of claim 1 , the operations further comprising: processing the LSTM output for the time step using a non-recurrent projection layer to generate a non-recurrent projected output for the time step. 3. The method of claim 2 , further comprising: for each of the time steps, processing the recurrent projected output generated by the highest LSTM layer for the time step and the non-recurrent projected output generated by the highest LSTM layer for the time step using the output layer to generate the set of scores for the time step. 4. The method of claim 1 , wherein each LSTM memory block comprises one or more LSTM memory cells, and wherein each LSTM memory cell generates a cell output that is aggregated to generate the LSTM output for the time step. 5. The method of claim 4 , wherein the cell output m t for the time step satisfies: i t =σ( W ix x t +W ir r t−1 +W ic c t−1 +b i ) f t =σ( W fx x t +W rf r t−1 +W cf c t−1 +b f ) c t =f t ⊙c t−1 +i i ⊙g ( W cx x t +W cr r t−1 +b c ) o t =σ( W ox x t +W or r t−1 +W oc c t +b o ) m t =o t ⊙h ( c t ) where i t is an input gate activation at the time step, f t is a forget gate activation at the time step, o t is an output gate activation at the time step, c t is a cell activation at the time step, c t−1 is a cell activation for a previous time step, ⊙ is an element-wise product operation, g is a cell input activation function, h is a cell output activation function, each W term is a respective matrix of current weight values for the LSTM memory cell, b i , b f , b c , and b o are bias vectors, and r t−1 is a recurrent projected output generated by the recurrent projected layer for the previous time step. 6. The method of claim 1 , wherein the LSTM output for the time step is a vector having a first dimensionality, and wherein the recurrent projected output for the time step is a vector having a second, smaller dimensionality. 7. The method of claim 1 , wherein the set of scores for the time step defines a probability distribution over a set of Hidden Markov Model (HMM) states. 8. The method of claim 1 , wherein the layer input for the time step for the lowest LSTM layer is the acoustic feature representation for the time step. 9. The method of claim 8 , wherein the layer input for the time step for each LSTM layer subsequent to the lowest LSTM layer in the sequence is the layer output generated by a preceding LSTM layer in the sequence for the time step. 10. A system comprising one or more computers and one or more storage devices storing instructions that when executed by one or more computers cause the one or more computers to perform first operations comprising: receiving an acoustic sequence, the acoustic sequence representing an utterance, and the acoustic sequence comprising a respective acoustic feature representation at each of a plurality of time steps; for each of the plurality of time steps, processing the acoustic feature representation through each of one or more long short-term memory (LSTM) layers, wherein the one or more LSTM layers are arranged in a sequence from a lowest LSTM layer to a highest LSTM layer, and wherein each of the one or more LSTM layers is configured to perform second operations comprising: receiving a layer input at the time step; generating, by processing the layer input at the time step and a previous recurrent projected output through one or more LSTM memory blocks, an LSTM output for the time step, processing the LSTM output for the time step using a recurrent projection layer, wherein the recurrent projection layer is configured to: generate a recurrent projected output for the time step by applying a matrix of current values of weights to the LSTM output to project the LSTM output to a lower dimensional space, and updating the previous recurrent projected output with the recurrent projected output, wherein the updated previous recurrent projected output is used by the LSTM memory blocks in generating an LSTM output for a next time step; and for each of the plurality of time steps, processing the recurrent projected output generated by the highest LSTM layer for the time step using an output layer to generate a set of scores for the time step, the set of scores for the time step comprising a respective score for each of a plurality of phonemes or phoneme subdivisions, the score for each phoneme or phoneme subdivision representing a likelihood that the phoneme or phoneme subdivision represents the utterance at the time step. 11. The system of claim 10 , the second operations further comprising: processing the LSTM output for the time step using a non-recurrent projection layer to generate a non-recurrent projected output for the time step. 12. The system of claim 11 , the first operations further comprising: for each of the time steps, processing the recurrent projected output generated by the highest LSTM layer for the time step and the non-recurrent projected output generated by the highest LSTM layer for the time step using the output layer to generate the set of scores for the time step. 13. The system of claim 10 , wherein each LSTM memory block comprises one or more LSTM memory cells, and wherein each LSTM memory cell generates a cell output that is aggregated to generate the LSTM output for the time step. 14. The system of claim 13 , wherein the cell output m t for the time step satisfies: i t =σ( W ix x t +W ir r t−1 +W ic c t−1 +b i ) f t =σ( W fx x t +W rf r t−1 +W cf c t−1 +b f ) c t =f t ⊙c t−1 +i i ⊙g ( W cx x t +

Assignees

Inventors

Classifications

  • Combinations of networks · CPC title

  • Recurrent networks, e.g. Hopfield networks · CPC title

  • Supervised learning · CPC title

  • characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title

  • using dynamic programming techniques, e.g. dynamic time warping [DTW] · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9620108B2 cover?
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for generating phoneme representations of acoustic sequences using projection sequences. One of the methods includes receiving an acoustic sequence, the acoustic sequence representing an utterance, and the acoustic sequence comprising a respective acoustic feature representation at each of a pluralit…
Who is the assignee on this patent?
Google Inc
What technology area does this patent fall under?
Primary CPC classification G10L15/08. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Apr 11 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 11 related publications on this page (citations in our corpus or others sharing the same primary CPC).