Very deep convolutional neural networks for end-to-end speech recognition

US11080599B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11080599-B2
Application numberUS-201916692538-A
CountryUS
Kind codeB2
Filing dateNov 22, 2019
Priority dateOct 10, 2016
Publication dateAug 3, 2021
Grant dateAug 3, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A speech recognition neural network system includes an encoder neural network and a decoder neural network. The encoder neural network generates an encoded sequence from an input acoustic sequence that represents an utterance. The input acoustic sequence includes a respective acoustic feature representation at each of a plurality of input time steps, the encoded sequence includes a respective encoded representation at each of a plurality of time reduced time steps, and the number of time reduced time steps is less than the number of input time steps. The encoder neural network includes a time reduction subnetwork, a convolutional LSTM subnetwork, and a network in network subnetwork. The decoder neural network receives the encoded sequence and processes the encoded sequence to generate, for each position in an output sequence order, a set of sub string scores that includes a respective sub string score for each substring in a set of substrings.

First claim

Opening claim text (preview).

What is claimed is: 1. A speech recognition neural network system implemented by one or more computers, comprising: an encoder neural network configured to generate an encoded sequence from an input acoustic sequence, the input acoustic sequence representing an utterance, the input acoustic sequence comprising a respective acoustic feature representation at each of a plurality of input time steps, the encoded sequence comprising a respective encoded representation at each of a plurality of time reduced time steps, the number of time reduced time steps being less than the number of input time steps, and the encoder neural network comprising: a time reduction subnetwork configured to process the input acoustic sequence to generate a sequence of reduced representations comprising a respective reduced representation at each of the plurality of time reduced time steps; a convolutional Long short-term memory (LSTM) subnetwork configured to, for each time reduced time step, process the reduced representation at the time reduced time step to generate a convolutional LSTM output for the time step; and a network in network subnetwork configured to, for each time reduced time step, process the convolutional LSTM output at the time reduced time step to generate the encoded representation for the time reduced time step; and a decoder neural network configured to receive the encoded sequence and process the encoded sequence to generate, for each position in an output sequence order, a set of sub string scores that includes a respective substring score for each substring in a set of substrings. 2. The system of claim 1 , wherein the convolutional LSTM subnetwork comprises a plurality of residual blocks stacked one after the other, and wherein each residual block comprises: a convolutional neural network layer and a convolutional LSTM neural network layer separated by at least a batch normalization layer. 3. The system of claim 2 , wherein each residual block further comprises: a skip connection from an input to the residual block to an output of the convolutional LSTM neural network layer. 4. The system of claim 1 , wherein the network in network subnetwork comprises a plurality of LSTM layers. 5. The system of claim 4 , wherein the network in network subnetwork comprises a respective convolutional layer that uses a 1×1 dimensional filter in between each pair of LSTM layers. 6. The system of claim 4 , wherein each convolutional layer that uses a 1×1 dimensional filter is followed by a respective batch normalization layer. 7. The system of claim 1 , further comprising: a decoder subsystem configured to generate a sequence of substrings from the substring scores that represents a transcription of the utterance. 8. The system of claim 1 , wherein the time reduction subnetwork comprises: a first time reduction block comprising: a first depth concatenation layer configured to depth concatenate acoustic feature representations at multiple adjacent input time steps at predetermined intervals in the input acoustic sequence to generate a first sequence of concatenated representations; and a first time-reduction convolutional layer configured to process the first sequence of concatenated representations to generate a sequence of initial reduced representations comprising a respective initial reduced representation at each of a plurality of initial time reduced time steps; and a second time reduction block comprising: a second depth concatenation layer configured to depth concatenate initial reduced representations at multiple adjacent initial time reduced time steps at predetermined intervals in the initial reduced sequence to generate a second sequence of concatenated representations; and a second time-reduction convolutional layer configured to process the second sequence of concatenated representations to generate the sequence of reduced representations comprising a reduced representation at each of the plurality of time reduced time steps. 9. A method comprising: receiving an input acoustic sequence representing an utterance, the input acoustic sequence comprising a respective acoustic feature representation at each of a plurality of input time steps; and processing the input acoustic sequence using an encoder neural network to generate an encoded sequence comprising a respective encoded representation at each of a plurality of time reduced time steps, the number of time reduced time steps being less than the number of input time steps, wherein processing the input acoustic sequence using the encoder neural network comprises: processing, using a time reduction subnetwork of the encoder neural network, the input acoustic sequence to generate a sequence of reduced representations comprising a respective reduced representation at each of the plurality of time reduced time steps; for each time reduced time step, processing, using a convolutional Long short-term memory (LSTM) subnetwork of the encoder neural network, the reduced representation at the time reduced time step to generate a convolutional LSTM output for the time step; and for each time reduced time step, processing, using a network in network subnetwork of the encoder neural network, the convolutional LSTM output at the time reduced time step to generate the encoded representation for the time reduced time step; and processing, using a decoder neural network, the encoded sequence to generate, for each position in an output sequence order, a set of sub string scores that includes a respective sub string score for each sub string in a set of sub strings. 10. The method of claim 9 , further comprising: generating a sequence of substrings from the sub string scores that represents a transcription of the utterance. 11. The method of claim 9 , wherein processing, using the time reduction subnetwork of the encoder neural network, the input acoustic sequence to generate the sequence of reduced representations comprising a respective reduced representation at each of the plurality of time reduced time steps comprises: depth concatenating acoustic feature representations at multiple adjacent input time steps at predetermined intervals in the input acoustic sequence to generate a first sequence of concatenated representations; processing the first sequence of concatenated representations to generate a sequence of initial reduced representations comprising a respective initial reduced representation at each of a plurality of initial time reduced time steps; depth concatenating initial reduced representations at multiple adjacent initial time reduced time steps at predetermined intervals in the initial reduced sequence to generate a second sequence of concatenated representations; and processing the second sequence of concatenated representations to generate the sequence of reduced representations comprising a reduced representation at each of the plurality of time reduced time steps. 12. The method of claim 9 , wherein a respective sub string score for each substring defines a likelihood that the sub string represents a correct transcription of the utterance represented by the input acoustic sequence. 13. The method of claim 9 , wherein the set of sub strings includes a set of alphabetic letters corresponding to one or more natural languages. 14. The method of claim 9 , wherein the set of sub strings includes one or more of a space character, a comma character, a period character, or an apostrophe character. 15. The method of claim 9 , wherein the set of sub strings includes word pieces. 16. The method of claim 9 , wherein th

Assignees

Inventors

Classifications

  • G10L15/16Primary

    using artificial neural networks · CPC title

  • Combinations of networks · CPC title

  • Recurrent networks, e.g. Hopfield networks · CPC title

  • characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title

  • Convolutional networks [CNN, ConvNet] · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11080599B2 cover?
A speech recognition neural network system includes an encoder neural network and a decoder neural network. The encoder neural network generates an encoded sequence from an input acoustic sequence that represents an utterance. The input acoustic sequence includes a respective acoustic feature representation at each of a plurality of input time steps, the encoded sequence includes a respective e…
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification G10L15/16. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Aug 03 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).