Speech recognition with acoustic models

US9818410B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9818410-B2
Application numberUS-201514983315-A
CountryUS
Kind codeB2
Filing dateDec 29, 2015
Priority dateJun 19, 2015
Publication dateNov 14, 2017
Grant dateNov 14, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and apparatus, including computer programs encoded on computer storage media for learning pronunciations from acoustic sequences. One method includes receiving an acoustic sequence, the acoustic sequence representing an utterance, and the acoustic sequence comprising a sequence of multiple frames of acoustic data at each of a plurality of time steps; stacking one or more frames of acoustic data to generate a sequence of modified frames of acoustic data; processing the sequence of modified frames of acoustic data through an acoustic modeling neural network comprising one or more recurrent neural network (RNN) layers and a final CTC output layer to generate a neural network output, wherein processing the sequence of modified frames of acoustic data comprises: subsampling the modified frames of acoustic data; and processing each subsampled modified frame of acoustic data through the acoustic modeling neural network.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: receiving, by one or more computers, an acoustic sequence, the acoustic sequence representing an utterance, and the acoustic sequence comprising a sequence of multiple frames of acoustic data at each of a plurality of time steps; stacking, by the one or more computers, one or more frames of acoustic data to generate a sequence of modified frames of acoustic data; subsampling, by the one or more computers, the sequence of modified frames of acoustic data to generate a sequence of subsampled modified frames by removing one or more modified frames from the sequence of modified frames; and processing, by the one or more computers, the sequence of subsampled modified frames of acoustic data through an acoustic modeling neural network comprising one or more recurrent neural network (RNN) layers and a final connectionist temporal classification (CTC) output layer, wherein the acoustic modeling neural network is configured to, for each subsampled modified frame: process the subsampled modified frame through the one or more RNN layers to generate a recurrent output, and process the recurrent output through the final CTC output layer to generate a set of scores for the subsampled modified frame, the set of scores for the subsampled modified frame comprising (i) a respective score for each of a plurality of vocabulary phonemes and (ii) a score for a blank character, the score for each vocabulary phoneme representing a respective likelihood that the vocabulary phoneme represents the utterance at the subsampled modified frame and the score for the blank character representing a likelihood that the utterance at the subsampled modified frame is incomplete. 2. The method of claim 1 , further comprising, for each subsampled modified frame: providing an output derived from the neural network output for the subsampled modified frame to a decoder for speech decoding of the utterance. 3. The method of claim 2 , wherein providing the output derived from the neural network output comprises scaling the blank character score for the subsampled modified frame, wherein scaling the blank character score comprises adding a negative logarithm of a constant scalar to the blank character score. 4. The method of claim 1 , further comprising, for each subsampled modified frame: determining whether the score for the blank character for the subsampled modified frame exceeds a threshold value; when the score for the subsampled modified frame does not exceed the threshold value, providing an output derived from the neural network output for the subsampled modified frame to a decoder for use in speech decoding of the utterance, and when the score for the subsampled modified frame exceeds the threshold value, causing the decoder to skip the subsampled modified frame when speech decoding the utterance. 5. The method of claim 1 , further comprising, for each subsampled modified frame: determining whether the score for the blank character for the subsampled modified frame exceeds a threshold value; when the score for the subsampled modified frame does not exceed the threshold value, providing an output derived from the neural network output for the subsampled modified frame to a decoder for use in speech decoding of the utterance, and when the score for the subsampled modified frame exceeds the threshold value, causing the decoder to transition into a blank state instead of using the output derived from the neural network output for the subsampled modified frame in speech decoding of the utterance. 6. The method of claim 5 , wherein the blank state is a state of the decoder that predicts with certainty that the utterance represented by the subsampled modified frame of acoustic data is incomplete. 7. The method of claim 5 , further comprising, for each subsampled modified frame: when the score for the subsampled modified frame exceeds the threshold value and when the decoder is already in the blank state, causing the decoder to skip the subsampled modified frame when speech decoding the utterance. 8. The method of claim 1 , wherein stacking one or more frames of acoustic data to generate a sequence of modified frames of acoustic data comprises sequentially concatenating pluralities of frames of acoustic data to generate one or more modified frames of acoustic data. 9. The method of claim 1 , wherein the sequence of modified frames of acoustic data is shorter than the sequence of frames of acoustic data. 10. The method of claim 1 , wherein subsampling the modified frames of acoustic data comprises decimating one or more frames of acoustic data. 11. The method of claim 1 , wherein the neural network has been trained for speech decoding using state-level minimum Bayes risk (sMBR) sequence discriminative training criterion. 12. The method of claim 1 , wherein the RNN layers are Long Short-Term Memory (LSTM) neural network layers. 13. A system comprising one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising: receiving an acoustic sequence, the acoustic sequence representing an utterance, and the acoustic sequence comprising a sequence of multiple frames of acoustic data at each of a plurality of time steps; stacking one or more frames of acoustic data to generate a sequence of modified frames of acoustic data; subsampling the sequence of modified frames of acoustic data to generate a sequence of subsampled modified frames by removing one or more modified frames from the sequence of modified frames; and processing the sequence of subsampled modified frames of acoustic data through an acoustic modeling neural network comprising one or more recurrent neural network (RNN) layers and a final connectionist temporal classification (CTC) output layer, wherein the acoustic modeling neural network is configured to, for each subsampled modified frame: process the subsampled modified frame through the one or more RNN layers to generate a recurrent output, and process the recurrent output through the final CTC output layer to generate a set of scores for the subsampled modified frame, the set of scores for the subsampled modified frame comprising (i) a respective score for each of a plurality of vocabulary phonemes and (ii) a score for a blank character, the score for each vocabulary phoneme representing a respective likelihood that the vocabulary phoneme represents the utterance at the subsampled modified frame and the score for the blank character representing a likelihood that the utterance at the subsampled modified frame is incomplete. 14. The system of claim 13 , the operations further comprising, for each subsampled modified frame: providing an output derived from the neural network output for the subsampled modified frame to a decoder for speech decoding of the utterance. 15. The system of claim 14 , wherein providing the output derived from the neural network output comprises scaling the blank character score for the subsampled modified frame, wherein scaling the blank character score comprises adding a negative logarithm of a constant scalar to the blank character score. 16. The system of claim 13 , the operations further comprising, for each subsampled modified frame: determining whether the score for the blank character for the subsampled modified frame exceeds a threshold value; when the score for the subsampled modified frame does not exceed the threshold value, providing an output derived from the neural network output for the

Assignees

Inventors

Classifications

  • Recurrent networks, e.g. Hopfield networks · CPC title

  • characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title

  • Supervised learning · CPC title

  • G10L17/14Primary

    Use of phonemic categorisation or speech recognition prior to speaker recognition or verification · CPC title

  • G10L15/02Primary

    Feature extraction for speech recognition; Selection of recognition unit · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9818410B2 cover?
Methods, systems, and apparatus, including computer programs encoded on computer storage media for learning pronunciations from acoustic sequences. One method includes receiving an acoustic sequence, the acoustic sequence representing an utterance, and the acoustic sequence comprising a sequence of multiple frames of acoustic data at each of a plurality of time steps; stacking one or more frame…
Who is the assignee on this patent?
Google Inc
What technology area does this patent fall under?
Primary CPC classification G10L17/14. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Nov 14 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).