Transliteration for speech recognition training and scoring
US-2020193977-A1 · Jun 18, 2020 · US
US10964309B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10964309-B2 |
| Application number | US-201916410556-A |
| Country | US |
| Kind code | B2 |
| Filing date | May 13, 2019 |
| Priority date | Apr 16, 2019 |
| Publication date | Mar 30, 2021 |
| Grant date | Mar 30, 2021 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A CS CTC model may be initialed from a major language CTC model by keeping network hidden weights and replacing output tokens with a union of major and secondary language output tokens. The initialized model may be trained by updating parameters with training data from both languages, and a LID model may also be trained with the data. During a decoding process for each of a series of audio frames, if silence dominates a current frame then a silence output token may be emitted. If silence does not dominate the frame, then a major language output token posterior vector from the CS CTC model may be multiplied with the LID major language probability to create a probability vector from the major language. A similar step is performed for the secondary language, and the system may emit an output token associated with the highest probability across all tokens from both languages.
Opening claim text (preview).
What is claimed is: 1. A system for automatic speech recognition associated with a major language and a secondary language, comprising: a computer processor; and a memory storage device including instructions that when executed by the computer processor enable the system to: during a decoding process for each of a series of frames associated with a speech waveform input, if silence dominates a current frame then emit a silence output token; and if silence does not dominate the current frame, then: multiply a major language output token posterior vector constructed from posteriors of major language tokens from a Code-Switching (“CS”) Connectionist Temporal Classification (“CTC”) model with a Language Identification (“LID”) model major language probability to create a probability vector from the major language, wherein the CS CTC model was: (1) initialized from the major language CTC model by keeping network hidden weights and replacing output tokens of the major language CTC model with a union of major language output tokens, secondary language output tokens, and the silence output token and (2) the initialized CTC model was trained by updating parameters with training data from both the major language and the secondary language, and further wherein the LID model was trained with the training data in connection with three-frame-by-frame outputs: (1) the major language probability, (2) a secondary language probability, and (3) a silence probability, multiply a secondary language output token posterior vector from the CS CTC model with the LID secondary language probability to create a probability vector from the secondary language, and emit the output token associated with the highest probability across all tokens from the major and secondary language. 2. The system of claim 1 , further comprising instructions that when executed by the computer processor enable the system to: collapse the emitted output tokens using greedy decoding to generate an automatic speech recognition decoding hypothesis. 3. The system of claim 2 , wherein the greedy decoding removes silence output tokens and repetitive language output tokens. 4. The system of claim 1 , wherein the CTC model comprises bidirectional Long Short-Term Memory (“LSTM”) Recurrent Neural Networks (“RNNs”). 5. The system of claim 4 , wherein an objective function for the CTC model is CTC loss. 6. The system of claim 1 , wherein the LID model is associated with feed-forward Deep Neural Networks (“DNNs”) and Long Short-Term Memories (“LSTMs”) to build a frame-level LID model. 7. The system of claim 6 , wherein the LID model utilizes information from a context window that includes data external to the current frame. 8. A computer-implemented method for automatic speech recognition associated with a first language and a second language, comprising: during a decoding process for each of a series of frames associated with a speech waveform input, if silence dominates a current frame, emitting a silence output token; and if silence does not dominate the current frame: multiplying a first language output token posterior vector constructed from posteriors of major language tokens from a Code-Switching (“CS”) Connectionist Temporal Classification (“CTC”) model with a Language Identification (“LID”) model first language probability to create a probability vector from the first language, wherein the CS CTC model was: (1) initialized from the major language CTC model by keeping network hidden weights and replacing output tokens of the major language CTC model with a union of major language output tokens, secondary language output tokens, and the silence output token and (2) the initialized CTC model was trained by updating parameters with training data from both the major language and the secondary language, and further wherein the LID model was trained with the training data in connection with three frame-by-frame outputs: (1) the major language probability, (2) a secondary language probability, and (3) a silence probability, multiplying a second language output token posterior vector from the CS CTC model with the LID second language probability to create a probability vector from the second language, and emitting the output token associated with the highest probability across all tokens from the major and secondary language. 9. The method of claim 8 , further comprising: collapsing the emitted output tokens using greedy decoding to generate an automatic speech recognition decoding hypothesis. 10. The method of claim 9 , wherein the greedy decoding removes silence output tokens and repetitive language output tokens. 11. The method of claim 8 , wherein the CTC model comprises bidirectional Long Short-Term Memory (“LSTM”) Recurrent Neural Networks (“RNNs”). 12. The method of claim 11 , wherein an objective function for the CTC model is CTC loss. 13. The method of claim 8 , wherein the LID model is associated with feed-forward Deep Neural Networks (“DNNs”) and Long Short-Term Memories (“LSTMs”) to build a frame-level LID model. 14. The method of claim 13 , wherein the LID model utilizes information from a context window that includes data external to the current frame. 15. A non-transient, computer-readable medium storing instructions to be executed by a processor to perform a method for automatic speech recognition associated with a major language and a secondary language, the method comprising: during a decoding process for each of a series of frames associated with a speech waveform input, if silence dominates a current frame, emitting a silence output token; if silence does not dominate the current frame: multiplying a first language output token posterior vector constructed from posteriors of major language tokens from a Code-Switching (“CS”) Connectionist Temporal Classification (“CTC”) model with a Language Identification (“LID”) model first language probability to create a probability vector from the first language, wherein the CS CTC model was: (1) initialized from the major language CTC model by keeping network hidden weights and replacing output tokens of the major language CTC model with a union of major language output tokens, secondary language output tokens, and the silence output token and (2) the initialized CTC model was trained by updating parameters with training data from both the major language and the secondary language, and further wherein the LID model was trained with the training data in connection with three frame-by-frame outputs: (1) the major language probability, (2) a secondary language probability, and (3) a silence probability, multiplying a second language output token posterior vector from the CS CTC model with the LID second language probability to create a probability vector from the second language, and emitting the output token associated with the highest probability across all tokens from the first and second language; and collapsing the emitted output tokens using greedy decoding to generate an automatic speech recognition decoding hypothesis. 16. The medium of claim 15 , wherein the greedy decoding removes silence output tokens and repetitive language output tokens. 17. The medium of claim 15 , wherein the CTC model comprises bidirectional Long Short-Term Memory (“LSTM”) Recurrent Neural Networks (“RNNs”). 18. The medium of claim 17 , wherein an objective function for the CTC model is CTC loss. 19. The medium of claim 15 , wherein the LID model is associated with feed-forward Deep Neural Networks (“DNNs”) and Long Short-Term Mem
Probabilistic or stochastic networks · CPC title
Combinations of networks · CPC title
Recurrent networks, e.g. Hopfield networks · CPC title
Supervised learning · CPC title
Feedforward networks · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.