Code-switching speech recognition with end-to-end connectionist temporal classification model

US10964309B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10964309-B2
Application numberUS-201916410556-A
CountryUS
Kind codeB2
Filing dateMay 13, 2019
Priority dateApr 16, 2019
Publication dateMar 30, 2021
Grant dateMar 30, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A CS CTC model may be initialed from a major language CTC model by keeping network hidden weights and replacing output tokens with a union of major and secondary language output tokens. The initialized model may be trained by updating parameters with training data from both languages, and a LID model may also be trained with the data. During a decoding process for each of a series of audio frames, if silence dominates a current frame then a silence output token may be emitted. If silence does not dominate the frame, then a major language output token posterior vector from the CS CTC model may be multiplied with the LID major language probability to create a probability vector from the major language. A similar step is performed for the secondary language, and the system may emit an output token associated with the highest probability across all tokens from both languages.

First claim

Opening claim text (preview).

What is claimed is: 1. A system for automatic speech recognition associated with a major language and a secondary language, comprising: a computer processor; and a memory storage device including instructions that when executed by the computer processor enable the system to: during a decoding process for each of a series of frames associated with a speech waveform input, if silence dominates a current frame then emit a silence output token; and if silence does not dominate the current frame, then: multiply a major language output token posterior vector constructed from posteriors of major language tokens from a Code-Switching (“CS”) Connectionist Temporal Classification (“CTC”) model with a Language Identification (“LID”) model major language probability to create a probability vector from the major language, wherein the CS CTC model was: (1) initialized from the major language CTC model by keeping network hidden weights and replacing output tokens of the major language CTC model with a union of major language output tokens, secondary language output tokens, and the silence output token and (2) the initialized CTC model was trained by updating parameters with training data from both the major language and the secondary language, and further wherein the LID model was trained with the training data in connection with three-frame-by-frame outputs: (1) the major language probability, (2) a secondary language probability, and (3) a silence probability, multiply a secondary language output token posterior vector from the CS CTC model with the LID secondary language probability to create a probability vector from the secondary language, and emit the output token associated with the highest probability across all tokens from the major and secondary language. 2. The system of claim 1 , further comprising instructions that when executed by the computer processor enable the system to: collapse the emitted output tokens using greedy decoding to generate an automatic speech recognition decoding hypothesis. 3. The system of claim 2 , wherein the greedy decoding removes silence output tokens and repetitive language output tokens. 4. The system of claim 1 , wherein the CTC model comprises bidirectional Long Short-Term Memory (“LSTM”) Recurrent Neural Networks (“RNNs”). 5. The system of claim 4 , wherein an objective function for the CTC model is CTC loss. 6. The system of claim 1 , wherein the LID model is associated with feed-forward Deep Neural Networks (“DNNs”) and Long Short-Term Memories (“LSTMs”) to build a frame-level LID model. 7. The system of claim 6 , wherein the LID model utilizes information from a context window that includes data external to the current frame. 8. A computer-implemented method for automatic speech recognition associated with a first language and a second language, comprising: during a decoding process for each of a series of frames associated with a speech waveform input, if silence dominates a current frame, emitting a silence output token; and if silence does not dominate the current frame: multiplying a first language output token posterior vector constructed from posteriors of major language tokens from a Code-Switching (“CS”) Connectionist Temporal Classification (“CTC”) model with a Language Identification (“LID”) model first language probability to create a probability vector from the first language, wherein the CS CTC model was: (1) initialized from the major language CTC model by keeping network hidden weights and replacing output tokens of the major language CTC model with a union of major language output tokens, secondary language output tokens, and the silence output token and (2) the initialized CTC model was trained by updating parameters with training data from both the major language and the secondary language, and further wherein the LID model was trained with the training data in connection with three frame-by-frame outputs: (1) the major language probability, (2) a secondary language probability, and (3) a silence probability, multiplying a second language output token posterior vector from the CS CTC model with the LID second language probability to create a probability vector from the second language, and emitting the output token associated with the highest probability across all tokens from the major and secondary language. 9. The method of claim 8 , further comprising: collapsing the emitted output tokens using greedy decoding to generate an automatic speech recognition decoding hypothesis. 10. The method of claim 9 , wherein the greedy decoding removes silence output tokens and repetitive language output tokens. 11. The method of claim 8 , wherein the CTC model comprises bidirectional Long Short-Term Memory (“LSTM”) Recurrent Neural Networks (“RNNs”). 12. The method of claim 11 , wherein an objective function for the CTC model is CTC loss. 13. The method of claim 8 , wherein the LID model is associated with feed-forward Deep Neural Networks (“DNNs”) and Long Short-Term Memories (“LSTMs”) to build a frame-level LID model. 14. The method of claim 13 , wherein the LID model utilizes information from a context window that includes data external to the current frame. 15. A non-transient, computer-readable medium storing instructions to be executed by a processor to perform a method for automatic speech recognition associated with a major language and a secondary language, the method comprising: during a decoding process for each of a series of frames associated with a speech waveform input, if silence dominates a current frame, emitting a silence output token; if silence does not dominate the current frame: multiplying a first language output token posterior vector constructed from posteriors of major language tokens from a Code-Switching (“CS”) Connectionist Temporal Classification (“CTC”) model with a Language Identification (“LID”) model first language probability to create a probability vector from the first language, wherein the CS CTC model was: (1) initialized from the major language CTC model by keeping network hidden weights and replacing output tokens of the major language CTC model with a union of major language output tokens, secondary language output tokens, and the silence output token and (2) the initialized CTC model was trained by updating parameters with training data from both the major language and the secondary language, and further wherein the LID model was trained with the training data in connection with three frame-by-frame outputs: (1) the major language probability, (2) a secondary language probability, and (3) a silence probability, multiplying a second language output token posterior vector from the CS CTC model with the LID second language probability to create a probability vector from the second language, and emitting the output token associated with the highest probability across all tokens from the first and second language; and collapsing the emitted output tokens using greedy decoding to generate an automatic speech recognition decoding hypothesis. 16. The medium of claim 15 , wherein the greedy decoding removes silence output tokens and repetitive language output tokens. 17. The medium of claim 15 , wherein the CTC model comprises bidirectional Long Short-Term Memory (“LSTM”) Recurrent Neural Networks (“RNNs”). 18. The medium of claim 17 , wherein an objective function for the CTC model is CTC loss. 19. The medium of claim 15 , wherein the LID model is associated with feed-forward Deep Neural Networks (“DNNs”) and Long Short-Term Mem

Assignees

Inventors

Classifications

  • Probabilistic or stochastic networks · CPC title

  • Combinations of networks · CPC title

  • Recurrent networks, e.g. Hopfield networks · CPC title

  • Supervised learning · CPC title

  • Feedforward networks · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10964309B2 cover?
A CS CTC model may be initialed from a major language CTC model by keeping network hidden weights and replacing output tokens with a union of major and secondary language output tokens. The initialized model may be trained by updating parameters with training data from both languages, and a LID model may also be trained with the data. During a decoding process for each of a series of audio fram…
Who is the assignee on this patent?
Microsoft Technology Licensing Llc
What technology area does this patent fall under?
Primary CPC classification G10L15/005. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 30 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).