Statistical-analysis-based reset of recurrent neural networks for automatic speech recognition
US-2019005945-A1 · Jan 3, 2019 · US
US11417322B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11417322-B2 |
| Application number | US-201916712492-A |
| Country | US |
| Kind code | B2 |
| Filing date | Dec 12, 2019 |
| Priority date | Dec 12, 2018 |
| Publication date | Aug 16, 2022 |
| Grant date | Aug 16, 2022 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods, systems, and apparatus, including computer programs stored on a computer-readable storage medium, for transliteration for speech recognition training and scoring. In some implementations, language examples are accessed, some of which include words in a first script and words in one or more other scripts. At least portions of some of the language examples are transliterated to the first script to generate a training data set. A language model is generated based on occurrences of the different sequences of words in the training data set in the first script. The language model is used to perform speech recognition for an utterance.
Opening claim text (preview).
The invention claimed is: 1. A method performed by one or more computers, the method comprising: accessing, by the one or more computers, a set of data indicating language examples for a first script, wherein at least some of the language examples include words in the first script and out-of-script words in one or more other scripts; accessing, by the one or more computers, a blacklist of terms in a script different than the first script; selectively transliterating, by the one or more computers, at least portions of some of the language examples by transliterating a portion of the out-of-script words to the first script and bypassing transliteration of a remaining portion of the out-of-script words that includes instances of the terms from the blacklist to generate a training data set having the portion of the out-of-script words transliterated into the first script and the remaining portion of the out-of-script words kept in the one or more other scripts; and generating, by the one or more computers, a speech recognition model based on occurrences of sequences of words in the training data set having the portion of the out-of-script words transliterated into the first script and the remaining portion of the out-of-script words kept in the one or more other scripts. 2. The method of claim 1 , wherein the speech recognition model is a language model, an acoustic model, a sequence-to-sequence model, or an end-to-end model. 3. The method of claim 1 , wherein selectively transliterating comprises mapping different tokens that represent text from different scripts to a single normalized transliterated representation. 4. The method of claim 1 , wherein selectively transliterating the language examples comprises transliterating the portion of the out-of-script words in the language examples that are not in the first script into the first script. 5. The method of claim 1 , wherein selectively transliterating the language examples comprises generating altered language examples in which words written in a second script different from the first script are replaced with one or more words in the first script that approximate acoustic properties of the word in the second script. 6. The method of claim 5 , wherein the words written in the second script are individually transliterated into the first script on a word-by-word basis. 7. The method of claim 1 , further comprising: determining a test set of language examples with which to test the speech recognition model; generating a normalized test set by transliterating into the first script words of the language examples in the test set that are not written in the first script; obtaining output of the speech recognition model corresponding to the language examples in the test set; normalizing output of the speech recognition model by transliterating into the first script words of the speech recognition model output that are not written in the first script; and determining an error rate of the speech recognition model based on a comparison of the normalized test set with the normalized speech recognition model output. 8. The method of claim 7 , wherein the error rate is a word error rate, and wherein the method includes, based on the word error rate: determining whether to continue training or terminate training of the speech recognition model; altering a training data set used to train the speech recognition model; setting a size, structure, or other characteristic of the speech recognition model; or selecting one or more speech recognition models for a speech recognition task. 9. The method of claim 1 , further comprising determining a modelling error rate for the speech recognition model in which acoustically similar words written in any of multiple scripts are accepted as correct transcriptions, without penalizing output of a word in a different script than a corresponding word in a reference transcription. 10. The method of claim 9 , further comprising determining a rendering error rate for the speech recognition model that is a measure of differences between a script of words in the output of the speech recognition model relative to a script of corresponding words in reference transcriptions. 11. The method of claim 1 , wherein selectively transliterating is performed using a finite state transducer network trained to perform transliteration into the first script. 12. The method of claim 1 , wherein selectively transliterating comprises, for at least one language example, performing multiple rounds of transliteration between scripts to reach a transliterated representation in the first script that is included in the training data set in the first script. 13. The method of claim 1 , further comprising determining a score indicating a level of mixing of scripts in the language examples; and based on the score: selecting a parameter for pruning a finite state transducer network for transliteration; selecting a parameter for pruning the speech recognition model; or selecting a size or structure for the speech recognition model. 14. The method of claim 1 , wherein generating the speech recognition model comprises: after selectively transliterating at least portions of some the language examples by transliterating the portion of the out-of-script words to the first script, determining, by the one or more computers, a count of occurrences of different sequences of words in the training data set in the first script; and generating, by the one or more computers, a speech recognition model based on the counts of occurrences of the different sequences of words in the training data set in the first script. 15. The method of claim 1 , wherein the speech recognition model comprises a recurrent neural network, and generating the speech recognition model comprises training the recurrent neural network. 16. The method of claim 1 , further comprising using, by the one or more computers, the model to perform speech recognition for an utterance. 17. The method of claim 1 , further comprising: receiving, by one or more computers, audio data representing an utterance; and using, by the one or more computers, the speech generation model to map the audio data to text representing the utterance. 18. A system comprising: one or more computers; and one or more computer-readable media storing instructions that, when executed by the one or more computers, cause the one or more computers to perform operations comprising: accessing a set of data indicating language examples for a first script, wherein at least some of the language examples include words in the first script and out-of-script words in one or more other scripts; accessing a blacklist of terms in a script different than the first script; selectively transliterating at least portions of some of the language examples by transliterating a portion of the out-of-script words to the first script and bypassing transliteration of a remaining portion of the out-of-script words that includes instances of the terms from the blacklist to generate a training data set having the portion of the out-of-script words transliterated into the first script and the remaining portion of the out-of-script words kept in the one or more other scripts; and generating a speech recognition model based on occurrences of sequences of words in the training data set having the portion of the out-of-script words transliterated into the first script and remaining portion of the out-of-script words kept in the one or more other scripts. 19. One or more non-transitory
Recurrent networks, e.g. Hopfield networks · CPC title
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title
Supervised learning · CPC title
Training · CPC title
using artificial neural networks · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.