Acoustic model training corpus selection
US-2016093294-A1 · Mar 31, 2016 · US
US10229672B1 · US · B1
| Field | Value |
|---|---|
| Publication number | US-10229672-B1 |
| Application number | US-201715397327-A |
| Country | US |
| Kind code | B1 |
| Filing date | Jan 3, 2017 |
| Priority date | Dec 31, 2015 |
| Publication date | Mar 12, 2019 |
| Grant date | Mar 12, 2019 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training acoustic models and using the trained acoustic models. A connectionist temporal classification (CTC) acoustic model is accessed, the CTC acoustic model having been trained using a context-dependent state inventory generated from approximate phonetic alignments determined by another CTC acoustic model trained without fixed alignment targets. Audio data for a portion of an utterance is received. Input data corresponding to the received audio data is provided to the accessed CTC acoustic model. Data indicating a transcription for the utterance is generated based on output that the accessed CTC acoustic model produced in response to the input data. The data indicating the transcription is provided as output of an automated speech recognition service.
Opening claim text (preview).
What is claimed is: 1. A method performed by one or more computers of a speech recognition system, the method comprising: training, by the one or more computers of the speech recognition system, a first connectionist temporal classification (CTC) acoustic model on first training data to generate, as unmodified outputs, second training data of context-dependent state inventory from approximate phonetic alignments, the first training data comprising context-independent phones generated without using any previously determined phonetic alignments; training, by the one or more computers of the speech recognition system, a second CTC acoustic model on the second training data to generate outputs corresponding to one or more context-dependent states; accessing, by the one or more computers of the speech recognition system, the second CTC acoustic model; receiving, by the one or more computers of the speech recognition system, audio data for a portion of an utterance; providing, by the one or more computers of the speech recognition system, input data corresponding to the received audio data as input to the accessed second CTC acoustic model that has been trained on the second training data; generating, by the one or more computers of the speech recognition system, data indicating a transcription for the utterance based on output that the accessed second CTC acoustic model produced in response to the input data corresponding to the received audio data; and providing, by the one or more computers of the speech recognition system, the data indicating the transcription as output of the automated speech recognition system. 2. The method of claim 1 , wherein receiving the audio data comprises receiving the audio data from a client device over a computer network, and wherein providing the data indicating the transcription comprises providing the data indicating the transcription to the client device over the computer network. 3. The method of claim 1 , wherein providing the data indicating the transcription comprises live streaming speech recognition results such that the data indicating the transcription is provided while the one or more computers concurrently receive audio data for an additional portion of the utterance. 4. The method of claim 1 , wherein the accessed second CTC acoustic model is a unidirectional CTC acoustic model and the first CTC acoustic model is a bidirectional CTC acoustic model. 5. The method of claim 1 , wherein the accessed second CTC acoustic model comprises a recurrent neural network trained using CTC techniques comprising one or more long short-term memory layers, and wherein the first CTC acoustic model comprises a recurrent neural network trained using CTC techniques comprising one or more long short-term memory layers. 6. The method of claim 1 , wherein the second CTC acoustic model is configured to provide outputs that identify labels for triphones or to provide outputs that include scores for labels for triphones. 7. The method of claim 1 , wherein the second CTC acoustic model is trained independent of non-CTC acoustic models. 8. The method of claim 1 , wherein the second CTC acoustic model is trained to recognize multiple different pronunciations of a word using a pronunciation model that indicates multiple different phonetic sequences as valid pronunciations of the word. 9. The method of claim 1 , wherein the second CTC acoustic model is trained to recognize multiple different verbalizations of a written word using a verbalization model that indicates multiple different spoken words as valid verbalizations of the written word. 10. The method of claim 1 , wherein the accessed second CTC model is a recurrent neural network model that has been trained using a CTC loss function, and is configured to selectively indicate that a blank output label has a higher likelihood score than phoneme output labels in response to input data to the accessed second CTC model. 11. The method of claim 1 , wherein the operations further comprise generating, as the output that the accessed second CTC acoustic model produced in response to the input data corresponding to the received audio data, output vectors that each indicate (i) likelihoods corresponding to different phonetic units and (ii) a likelihood corresponding to a blank label that does not represent a phonetic unit. 12. A speech recognition system comprising: one or more computers of the speech recognition system; and a non-transitory computer-readable medium coupled to the one or more computers having instructions stored thereon which, when executed by the one or more computers, cause the one or more computers to perform operations comprising: training a first connectionist temporal classification (CTC) acoustic model on first training data to generate, as unmodified outputs, second training data of context-dependent state inventory from approximate phonetic alignments, the first training data comprising context-independent phones generated without using any previously determined phonetic alignments; training a second CTC acoustic model on the second training data to generate outputs corresponding to one or more context-dependent states; accessing the second CTC acoustic model; receiving audio data for a portion of an utterance; providing input data corresponding to the received audio data as input to the accessed second CTC acoustic model that has been trained on the second training data; generating data indicating a transcription for the utterance based on output that the accessed second CTC acoustic model produced in response to the input data corresponding to the received audio data; and providing the data indicating the transcription as output of the automated speech recognition system. 13. The system of claim 12 , wherein receiving the audio data comprises receiving the audio data from a client device over a computer network, and wherein providing the data indicating the transcription comprises providing the data indicating the transcription to the client device over the computer network. 14. The system of claim 12 , wherein providing the data indicating the transcription comprises live streaming speech recognition results such that the data indicating the transcription is provided while the one or more computers concurrently receive audio data for an additional portion of the utterance. 15. The system of claim 12 , wherein the accessed second CTC acoustic model is a unidirectional CTC acoustic model and the first CTC acoustic model is a bidirectional CTC acoustic model. 16. The system of claim 12 , wherein the accessed second CTC acoustic model comprises a recurrent neural network trained using CTC techniques comprising one or more long short-term memory layers, and wherein the first CTC acoustic model comprises a recurrent neural network trained using CTC techniques comprising one or more long short-term memory layers. 17. The system of claim 12 , wherein the second CTC acoustic model is configured to provide outputs that identify labels for triphones or to provide outputs that include scores for labels for triphones. 18. The system of claim 12 , wherein the second CTC acoustic model is trained to recognize multiple different pronunciations of a word using a pronunciation model that indicates multiple different phonetic sequences as valid pronunciations of the word. 19. The system of claim 12 , wherein the second CTC acoustic model is trained to recognize multiple different verbalizations of a written word using a verbalization model that indicates multi
using artificial neural networks · CPC title
Demisyllables, biphones or triphones being the recognition units · CPC title
Phonemic context, e.g. pronunciation rules, phonotactical constraints or phoneme n-grams · CPC title
Distributed recognition, e.g. in client-server systems, for mobile phones or network applications · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.