Voiceprint authentication method and apparatus
US-2016372121-A1 · Dec 22, 2016 · US
US9818410B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-9818410-B2 |
| Application number | US-201514983315-A |
| Country | US |
| Kind code | B2 |
| Filing date | Dec 29, 2015 |
| Priority date | Jun 19, 2015 |
| Publication date | Nov 14, 2017 |
| Grant date | Nov 14, 2017 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods, systems, and apparatus, including computer programs encoded on computer storage media for learning pronunciations from acoustic sequences. One method includes receiving an acoustic sequence, the acoustic sequence representing an utterance, and the acoustic sequence comprising a sequence of multiple frames of acoustic data at each of a plurality of time steps; stacking one or more frames of acoustic data to generate a sequence of modified frames of acoustic data; processing the sequence of modified frames of acoustic data through an acoustic modeling neural network comprising one or more recurrent neural network (RNN) layers and a final CTC output layer to generate a neural network output, wherein processing the sequence of modified frames of acoustic data comprises: subsampling the modified frames of acoustic data; and processing each subsampled modified frame of acoustic data through the acoustic modeling neural network.
Opening claim text (preview).
What is claimed is: 1. A method comprising: receiving, by one or more computers, an acoustic sequence, the acoustic sequence representing an utterance, and the acoustic sequence comprising a sequence of multiple frames of acoustic data at each of a plurality of time steps; stacking, by the one or more computers, one or more frames of acoustic data to generate a sequence of modified frames of acoustic data; subsampling, by the one or more computers, the sequence of modified frames of acoustic data to generate a sequence of subsampled modified frames by removing one or more modified frames from the sequence of modified frames; and processing, by the one or more computers, the sequence of subsampled modified frames of acoustic data through an acoustic modeling neural network comprising one or more recurrent neural network (RNN) layers and a final connectionist temporal classification (CTC) output layer, wherein the acoustic modeling neural network is configured to, for each subsampled modified frame: process the subsampled modified frame through the one or more RNN layers to generate a recurrent output, and process the recurrent output through the final CTC output layer to generate a set of scores for the subsampled modified frame, the set of scores for the subsampled modified frame comprising (i) a respective score for each of a plurality of vocabulary phonemes and (ii) a score for a blank character, the score for each vocabulary phoneme representing a respective likelihood that the vocabulary phoneme represents the utterance at the subsampled modified frame and the score for the blank character representing a likelihood that the utterance at the subsampled modified frame is incomplete. 2. The method of claim 1 , further comprising, for each subsampled modified frame: providing an output derived from the neural network output for the subsampled modified frame to a decoder for speech decoding of the utterance. 3. The method of claim 2 , wherein providing the output derived from the neural network output comprises scaling the blank character score for the subsampled modified frame, wherein scaling the blank character score comprises adding a negative logarithm of a constant scalar to the blank character score. 4. The method of claim 1 , further comprising, for each subsampled modified frame: determining whether the score for the blank character for the subsampled modified frame exceeds a threshold value; when the score for the subsampled modified frame does not exceed the threshold value, providing an output derived from the neural network output for the subsampled modified frame to a decoder for use in speech decoding of the utterance, and when the score for the subsampled modified frame exceeds the threshold value, causing the decoder to skip the subsampled modified frame when speech decoding the utterance. 5. The method of claim 1 , further comprising, for each subsampled modified frame: determining whether the score for the blank character for the subsampled modified frame exceeds a threshold value; when the score for the subsampled modified frame does not exceed the threshold value, providing an output derived from the neural network output for the subsampled modified frame to a decoder for use in speech decoding of the utterance, and when the score for the subsampled modified frame exceeds the threshold value, causing the decoder to transition into a blank state instead of using the output derived from the neural network output for the subsampled modified frame in speech decoding of the utterance. 6. The method of claim 5 , wherein the blank state is a state of the decoder that predicts with certainty that the utterance represented by the subsampled modified frame of acoustic data is incomplete. 7. The method of claim 5 , further comprising, for each subsampled modified frame: when the score for the subsampled modified frame exceeds the threshold value and when the decoder is already in the blank state, causing the decoder to skip the subsampled modified frame when speech decoding the utterance. 8. The method of claim 1 , wherein stacking one or more frames of acoustic data to generate a sequence of modified frames of acoustic data comprises sequentially concatenating pluralities of frames of acoustic data to generate one or more modified frames of acoustic data. 9. The method of claim 1 , wherein the sequence of modified frames of acoustic data is shorter than the sequence of frames of acoustic data. 10. The method of claim 1 , wherein subsampling the modified frames of acoustic data comprises decimating one or more frames of acoustic data. 11. The method of claim 1 , wherein the neural network has been trained for speech decoding using state-level minimum Bayes risk (sMBR) sequence discriminative training criterion. 12. The method of claim 1 , wherein the RNN layers are Long Short-Term Memory (LSTM) neural network layers. 13. A system comprising one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising: receiving an acoustic sequence, the acoustic sequence representing an utterance, and the acoustic sequence comprising a sequence of multiple frames of acoustic data at each of a plurality of time steps; stacking one or more frames of acoustic data to generate a sequence of modified frames of acoustic data; subsampling the sequence of modified frames of acoustic data to generate a sequence of subsampled modified frames by removing one or more modified frames from the sequence of modified frames; and processing the sequence of subsampled modified frames of acoustic data through an acoustic modeling neural network comprising one or more recurrent neural network (RNN) layers and a final connectionist temporal classification (CTC) output layer, wherein the acoustic modeling neural network is configured to, for each subsampled modified frame: process the subsampled modified frame through the one or more RNN layers to generate a recurrent output, and process the recurrent output through the final CTC output layer to generate a set of scores for the subsampled modified frame, the set of scores for the subsampled modified frame comprising (i) a respective score for each of a plurality of vocabulary phonemes and (ii) a score for a blank character, the score for each vocabulary phoneme representing a respective likelihood that the vocabulary phoneme represents the utterance at the subsampled modified frame and the score for the blank character representing a likelihood that the utterance at the subsampled modified frame is incomplete. 14. The system of claim 13 , the operations further comprising, for each subsampled modified frame: providing an output derived from the neural network output for the subsampled modified frame to a decoder for speech decoding of the utterance. 15. The system of claim 14 , wherein providing the output derived from the neural network output comprises scaling the blank character score for the subsampled modified frame, wherein scaling the blank character score comprises adding a negative logarithm of a constant scalar to the blank character score. 16. The system of claim 13 , the operations further comprising, for each subsampled modified frame: determining whether the score for the blank character for the subsampled modified frame exceeds a threshold value; when the score for the subsampled modified frame does not exceed the threshold value, providing an output derived from the neural network output for the
Recurrent networks, e.g. Hopfield networks · CPC title
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title
Supervised learning · CPC title
Use of phonemic categorisation or speech recognition prior to speaker recognition or verification · CPC title
Feature extraction for speech recognition; Selection of recognition unit · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.