Automatic speech recognition using multi-dimensional models

US9984683B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9984683-B2
Application numberUS-201615217457-A
CountryUS
Kind codeB2
Filing dateJul 22, 2016
Priority dateJul 22, 2016
Publication dateMay 29, 2018
Grant dateMay 29, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for automatic speech recognition using multi-dimensional models. In some implementations, audio data that describes an utterance is received. A transcription for the utterance is determined using an acoustic model that includes a neural network having first memory blocks for time information and second memory blocks for frequency information. The transcription for the utterance is provided as output of an automated speech recognizer.

First claim

Opening claim text (preview).

What is claimed is: 1. A method performed by one or more computers, the method comprising: receiving, by the one or more computers, audio data that describes an utterance; processing the audio data using a neural network that has been trained as an acoustic model, wherein the processing comprises: providing, as input to the neural network, input vectors having values describing the utterance, the values including values representing audio waveform features, wherein the audio waveform features are determined using a filterbank having parameters trained jointly with weights of the neural network, wherein the neural network has first memory blocks for time information and second memory blocks for frequency information, the first memory blocks being different from the second memory blocks; wherein the first memory blocks are time-LSTM blocks that each have a state, and wherein the second memory blocks are frequency-LSTM blocks that each have a state and a corresponding frequency step in a sequence of multiple frequency steps, wherein the states are determined for each of a sequence of multiple time steps; wherein, for each of at least some of the frequency-LSTM blocks, the frequency-LSTM block determines its state using the state of the time-LSTM block corresponding to the same frequency step at the previous time step; and wherein, for each of at least some of the time-LSTM blocks, the time-LSTM block determines its state using the state of the frequency-LSTM block corresponding to the same time step and the previous frequency step; receiving, as output of the neural network, one or more scores that each indicate a likelihood that a respective phonetic unit represents a portion of the utterance; determining, by the one or more computers, a transcription for the utterance based on the one or more scores; and providing, by the one or more computers, the determined transcription as output of an automated speech recognizer. 2. The method of claim 1 , wherein receiving the audio data comprises receiving, over a network, audio data generated by a client device; and wherein providing the transcription comprises providing, over the network, the transcription to the client device. 3. The method of claim 1 , wherein the neural network comprises a grid-LSTM module, a linear projection layer, one or more LSTM layers, and a deep neural network (DNN); wherein the grid-LSTM module includes the first memory blocks and the second memory blocks, and the grid-LSTM module provides output to the linear projection layer; wherein the linear projection layer provides output to the one or more LSTM layers; and wherein the one or more LSTM layers provide output to the DNN. 4. The method of claim 1 , wherein the neural network is configured to share information from each of the first memory blocks with a respective proper subset of the second memory blocks, and the neural network is configured to share information from each of the second memory blocks with a respective proper subset of the first memory blocks. 5. The method of claim 1 , wherein, for at least some of the time-LSTM blocks, the time-LSTM block determines its state based on (i) input received for a current time step, (ii) the state of the time-LSTM block at a previous time step, and (iii) a state of exactly one of the frequency-LSTM blocks; and wherein, for at least some of the frequency-LSTM blocks, the frequency-LSTM block determines its state based on (i) input received for a current time step, (ii) the state of the frequency-LSTM block for a previous frequency step at the current time step, and (iii) a state of exactly one of the time-LSTM blocks. 6. The method of claim 1 , wherein each of the time-LSTM blocks and the frequency-LSTM blocks has one or more weights, and wherein the weights for the time-LSTM blocks are independent of the weights for the frequency-LSTM blocks. 7. The method of claim 1 , wherein each of the time-LSTM blocks and the frequency-LSTM blocks has one or more weights, and wherein at least some of the weights are shared between the time-LSTM blocks and the frequency-LSTM blocks. 8. The method of claim 1 , wherein receiving the one or more scores comprises receiving, as output of the neural network, multiple outputs corresponding to different context-dependent states of phones, wherein each of the multiple outputs indicates a likelihood of occurrence for the corresponding context-dependent state. 9. The method of claim 1 , wherein providing the input vectors comprises providing input vectors comprising values for log-mel features. 10. The method of claim 1 , wherein providing the input vectors comprises providing values representing characteristics of multiple channels of audio describing the utterance. 11. The method of claim 1 , wherein processing the audio data using the neural network comprises processing audio data representing characteristics of multiple channels of audio describing the utterance. 12. A system comprising: one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising: receiving, by the one or more computers, audio data that describes an utterance; processing the audio data using a neural network that has been trained as an acoustic model, wherein the processing comprises: providing, as input to the neural network, input vectors having values describing the utterance, the values including values representing audio waveform features, wherein the audio waveform features are determined using a filterbank having parameters trained jointly with weights of the neural network, wherein the neural network has first memory blocks for time information and second memory blocks for frequency information, the first memory blocks being different from the second memory blocks; wherein the first memory blocks are time-LSTM blocks that each have a state, and wherein the second memory blocks are frequency-LSTM blocks that each have a state and a corresponding frequency step in a sequence of multiple frequency steps, wherein the states are determined for each of a sequence of multiple time steps; wherein, for each of at least some of the frequency-LSTM blocks, the frequency-LSTM block determines its state using the state of the time-LSTM block corresponding to the same frequency step at the previous time step; and wherein, for each of at least some of the time-LSTM blocks, the time-LSTM block determines its state using the state of the frequency-LSTM block corresponding to the same time step and the previous frequency step; receiving, as output of the neural network, one or more scores that each indicate a likelihood that a respective phonetic unit represents a portion of the utterance; determining, by the one or more computers, a transcription for the utterance based on the one or more scores; and providing, by the one or more computers, the determined transcription as output of an automated speech recognizer. 13. The system of claim 12 , wherein the neural network comprises a grid-LSTM module, a linear projection layer, one or more LSTM layers, and a deep neural network (DNN); wherein the grid-LSTM module includes the first memory blocks and the second memory blocks, and the grid-LSTM module provides output to the linear projection layer; wherein the linear projection layer provides output to the one or more LSTM layers; and wherein the one or more LSTM layers provide output to the DNN. 14. The system of claim 12 , wherein processing the audio data using the neural network comprises processing audio data repres

Assignees

Inventors

Classifications

  • Recurrent networks, e.g. Hopfield networks · CPC title

  • Combinations of networks · CPC title

  • characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title

  • Supervised learning · CPC title

  • G10L15/16Primary

    using artificial neural networks · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9984683B2 cover?
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for automatic speech recognition using multi-dimensional models. In some implementations, audio data that describes an utterance is received. A transcription for the utterance is determined using an acoustic model that includes a neural network having first memory blocks for time information and s…
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification G10L15/16. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue May 29 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).