Speech recognition using associative mapping
US-9299347-B1 · Mar 29, 2016 · US
US9786270B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-9786270-B2 |
| Application number | US-201615205263-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jul 8, 2016 |
| Priority date | Jul 9, 2015 |
| Publication date | Oct 10, 2017 |
| Grant date | Oct 10, 2017 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for generating acoustic models. In some implementations, a first neural network trained as an acoustic model using the connectionist temporal classification algorithm is obtained. Output distributions from the first neural network are obtained for an utterance. A second neural network is trained as an acoustic model using the output distributions produced by the first neural network as output targets for the second neural network. An automated speech recognizer configured to use the trained second neural network is provided.
Opening claim text (preview).
What is claimed is: 1. A method performed by one or more computers, the method comprising: obtaining, by the one or more computers, a first neural network trained as an acoustic model using connectionist temporal classification; obtaining, by the one or more computers, output distributions from the first neural network for an utterance, the output distributions comprising scores indicating likelihoods corresponding to different phonetic units; training, by the one or more computers, a second neural network as an acoustic model using the output distributions produced by the first neural network as output targets for the second neural network; and providing, by the one or more computers, an automated speech recognizer configured to use the trained second neural network to generate transcriptions for utterances. 2. The method of claim 1 , wherein providing an automated speech recognizer comprises: receiving audio data for an utterance; generating a transcription for the audio data using the trained second neural network; and providing the generated transcription for display. 3. The method of claim 1 , wherein providing an automated speech recognizer comprises providing the trained second neural network to another device for the performance of speech recognition by the other device. 4. The method of claim 1 , wherein the output distributions from the first neural network for the utterance are obtained using a first set of audio data for the utterance, and the second neural network is trained using a second set of audio data for the utterance, the second set of audio data having increased noise compared to the first set of training data. 5. The method of claim 1 , wherein training the second neural network as an acoustic model comprises: obtaining audio data for the utterance; adding noise to the audio data for the utterance to generate an altered version of the audio data; generating a sequence of input vectors based on the altered version of the audio data; and training the second neural network using output distributions produced by the first neural network as output targets corresponding to the sequence of input vectors generated based on the altered version of the audio data. 6. The method of claim 1 , wherein training the second neural network comprises training the second neural network with a loss function that uses two or more different output targets. 7. The method of claim 6 , wherein training the second neural network using the loss function comprises training the second neural network using a loss function that is a weighted combination of the two or more loss functions. 8. The method of claim 7 , wherein the weighted combination is a combination of (i) a first loss function that constrains the alignment of inputs and outputs, and (ii) a second loss function that does not constrain the alignment of inputs and outputs. 9. The method of claim 7 , wherein the two or more loss functions include at least two of a Baum-Welch loss function, a connectionist temporal classification loss function, and a Viterbi alignment loss function. 10. The method of claim 1 , wherein the second neural network has fewer parameters than the first neural network. 11. The method of claim 1 , wherein training the second neural network comprises training the second neural network to provide output distributions for the utterance that at least approximate the output distributions from the first neural network for the utterance. 12. A system comprising: one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising: obtaining, by the one or more computers, a first neural network trained as an acoustic model using connectionist temporal classification; obtaining, by the one or more computers, output distributions from the first neural network for an utterance, the output distributions comprising scores indicating likelihoods corresponding to different phonetic units; training, by the one or more computers, a second neural network as an acoustic model using the output distributions produced by the first neural network as output targets for the second neural network; and providing, by the one or more computers, an automated speech recognizer configured to use the trained second neural network to generate transcriptions for utterances. 13. The system of claim 12 , wherein providing an automated speech recognizer comprises: receiving audio data for an utterance; generating a transcription for the audio data using the trained second neural network; and providing the generated transcription for display. 14. The system of claim 12 , wherein providing an automated speech recognizer comprises providing the trained second neural network to another device for the performance of speech recognition by the other device. 15. The system of claim 12 , wherein the output distributions from the first neural network for the utterance are obtained using a first set of audio data for the utterance, and the second neural network is trained using a second set of audio data for the utterance, the second set of audio data having increased noise compared to the first set of audio data. 16. The system of claim 12 , wherein training the second neural network as an acoustic model comprises: obtaining audio data for the utterance; adding noise to the audio data for the utterance to generate an altered version of the audio data; generating a sequence of input vectors based on the altered version of the audio data; and training the second neural network using output distributions produced by the first neural network as output targets corresponding to the sequence of input vectors generated based on the altered version of the audio data. 17. One or more non-transitory computer-readable storage media storing with a computer program, the program comprising instructions that when executed by one or more computers cause the one or more computers to perform operations comprising: obtaining, by the one or more computers, a first neural network trained as an acoustic model using connectionist temporal classification; obtaining, by the one or more computers, output distributions from the first neural network for an utterance, the output distributions comprising scores indicating likelihoods corresponding to different phonetic units; training, by the one or more computers, a second neural network as an acoustic model using the output distributions produced by the first neural network as output targets for the second neural network; and providing, by the one or more computers, an automated speech recognizer configured to use the trained second neural network to generate transcriptions for utterances. 18. The one or more non-transitory computer-readable storage media of claim 17 , wherein providing an automated speech recognizer comprises: receiving audio data for an utterance; generating a transcription for the audio data using the trained second neural network; and providing the generated transcription for display. 19. The one or more non-transitory computer-readable storage media of claim 17 , wherein providing an automated speech recognizer comprises providing the trained second neural network to another device for the performance of speech recognition by the other device. 20. The one or more non-transitory computer-readable storage media of claim 17 , wherein the output distributions from the first n
using artificial neural networks · CPC title
Phonemic context, e.g. pronunciation rules, phonotactical constraints or phoneme n-grams · CPC title
Training · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.