Who is the assignee on this patent?

Microsoft Technology Licensing Llc

What technology area does this patent fall under?

Primary CPC classification G10L15/005. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Mar 30 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Code-switching speech recognition with end-to-end connectionist temporal classification model

US10964309B2 · US · B2

Patent metadata
Field	Value
Publication number	US-10964309-B2
Application number	US-201916410556-A
Country	US
Kind code	B2
Filing date	May 13, 2019
Priority date	Apr 16, 2019
Publication date	Mar 30, 2021
Grant date	Mar 30, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A CS CTC model may be initialed from a major language CTC model by keeping network hidden weights and replacing output tokens with a union of major and secondary language output tokens. The initialized model may be trained by updating parameters with training data from both languages, and a LID model may also be trained with the data. During a decoding process for each of a series of audio frames, if silence dominates a current frame then a silence output token may be emitted. If silence does not dominate the frame, then a major language output token posterior vector from the CS CTC model may be multiplied with the LID major language probability to create a probability vector from the major language. A similar step is performed for the secondary language, and the system may emit an output token associated with the highest probability across all tokens from both languages.

First claim

Opening claim text (preview).

What is claimed is: 1. A system for automatic speech recognition associated with a major language and a secondary language, comprising: a computer processor; and a memory storage device including instructions that when executed by the computer processor enable the system to: during a decoding process for each of a series of frames associated with a speech waveform input, if silence dominates a current frame then emit a silence output token; and if silence does not dominate the current frame, then: multiply a major language output token posterior vector constructed from posteriors of major language tokens from a Code-Switching (“CS”) Connectionist Temporal Classification (“CTC”) model with a Language Identification (“LID”) model major language probability to create a probability vector from the major language, wherein the CS CTC model was: (1) initialized from the major language CTC model by keeping network hidden weights and replacing output tokens of the major language CTC model with a union of major language output tokens, secondary language output tokens, and the silence output token and (2) the initialized CTC model was trained by updating parameters with training data from both the major language and the secondary language, and further wherein the LID model was trained with the training data in connection with three-frame-by-frame outputs: (1) the major language probability, (2) a secondary language probability, and (3) a silence probability, multiply a secondary language output token posterior vector from the CS CTC model with the LID secondary language probability to create a probability vector from the secondary language, and emit the output token associated with the highest probability across all tokens from the major and secondary language. 2. The system of claim 1 , further comprising instructions that when executed by the computer processor enable the system to: collapse the emitted output tokens using greedy decoding to generate an automatic speech recognition decoding hypothesis. 3. The system of claim 2 , wherein the greedy decoding removes silence output tokens and repetitive language output tokens. 4. The system of claim 1 , wherein the CTC model comprises bidirectional Long Short-Term Memory (“LSTM”) Recurrent Neural Networks (“RNNs”). 5. The system of claim 4 , wherein an objective function for the CTC model is CTC loss. 6. The system of claim 1 , wherein the LID model is associated with feed-forward Deep Neural Networks (“DNNs”) and Long Short-Term Memories (“LSTMs”) to build a frame-level LID model. 7. The system of claim 6 , wherein the LID model utilizes information from a context window that includes data external to the current frame. 8. A computer-implemented method for automatic speech recognition associated with a first language and a second language, comprising: during a decoding process for each of a series of frames associated with a speech waveform input, if silence dominates a current frame, emitting a silence output token; and if silence does not dominate the current frame: multiplying a first language output token posterior vector constructed from posteriors of major language tokens from a Code-Switching (“CS”) Connectionist Temporal Classification (“CTC”) model with a Language Identification (“LID”) model first language probability to create a probability vector from the first language, wherein the CS CTC model was: (1) initialized from the major language CTC model by keeping network hidden weights and replacing output tokens of the major language CTC model with a union of major language output tokens, secondary language output tokens, and the silence output token and (2) the initialized CTC model was trained by updating parameters with training data from both the major language and the secondary language, and further wherein the LID model was trained with the training data in connection with three frame-by-frame outputs: (1) the major language probability, (2) a secondary language probability, and (3) a silence probability, multiplying a second language output token posterior vector from the CS CTC model with the LID second language probability to create a probability vector from the second language, and emitting the output token associated with the highest probability across all tokens from the major and secondary language. 9. The method of claim 8 , further comprising: collapsing the emitted output tokens using greedy decoding to generate an automatic speech recognition decoding hypothesis. 10. The method of claim 9 , wherein the greedy decoding removes silence output tokens and repetitive language output tokens. 11. The method of claim 8 , wherein the CTC model comprises bidirectional Long Short-Term Memory (“LSTM”) Recurrent Neural Networks (“RNNs”). 12. The method of claim 11 , wherein an objective function for the CTC model is CTC loss. 13. The method of claim 8 , wherein the LID model is associated with feed-forward Deep Neural Networks (“DNNs”) and Long Short-Term Memories (“LSTMs”) to build a frame-level LID model. 14. The method of claim 13 , wherein the LID model utilizes information from a context window that includes data external to the current frame. 15. A non-transient, computer-readable medium storing instructions to be executed by a processor to perform a method for automatic speech recognition associated with a major language and a secondary language, the method comprising: during a decoding process for each of a series of frames associated with a speech waveform input, if silence dominates a current frame, emitting a silence output token; if silence does not dominate the current frame: multiplying a first language output token posterior vector constructed from posteriors of major language tokens from a Code-Switching (“CS”) Connectionist Temporal Classification (“CTC”) model with a Language Identification (“LID”) model first language probability to create a probability vector from the first language, wherein the CS CTC model was: (1) initialized from the major language CTC model by keeping network hidden weights and replacing output tokens of the major language CTC model with a union of major language output tokens, secondary language output tokens, and the silence output token and (2) the initialized CTC model was trained by updating parameters with training data from both the major language and the secondary language, and further wherein the LID model was trained with the training data in connection with three frame-by-frame outputs: (1) the major language probability, (2) a secondary language probability, and (3) a silence probability, multiplying a second language output token posterior vector from the CS CTC model with the LID second language probability to create a probability vector from the second language, and emitting the output token associated with the highest probability across all tokens from the first and second language; and collapsing the emitted output tokens using greedy decoding to generate an automatic speech recognition decoding hypothesis. 16. The medium of claim 15 , wherein the greedy decoding removes silence output tokens and repetitive language output tokens. 17. The medium of claim 15 , wherein the CTC model comprises bidirectional Long Short-Term Memory (“LSTM”) Recurrent Neural Networks (“RNNs”). 18. The medium of claim 17 , wherein an objective function for the CTC model is CTC loss. 19. The medium of claim 15 , wherein the LID model is associated with feed-forward Deep Neural Networks (“DNNs”) and Long Short-Term Mem

Assignees

Microsoft Technology Licensing Llc

Inventors

Classifications

G06N3/047
Probabilistic or stochastic networks · CPC title
G06N3/045
Combinations of networks · CPC title
G06N3/044
Recurrent networks, e.g. Hopfield networks · CPC title
G06N3/09
Supervised learning · CPC title
G06N3/0499
Feedforward networks · CPC title

Patent family

Related publications grouped by family.

View patent family 72832812

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10964309B2 cover?: A CS CTC model may be initialed from a major language CTC model by keeping network hidden weights and replacing output tokens with a union of major and secondary language output tokens. The initialized model may be trained by updating parameters with training data from both languages, and a LID model may also be trained with the data. During a decoding process for each of a series of audio fram…
Who is the assignee on this patent?: Microsoft Technology Licensing Llc
What technology area does this patent fall under?: Primary CPC classification G10L15/005. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Mar 30 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Transliteration for speech recognition training and scoring

Training of speech recognition systems

Multi-dialect and multilingual speech recognition

Transcription generation from multiple speech recognition systems

Methods and Systems for Recognizing Simultaneous Speech by Multiple Speakers

Method and Apparatus for Multi-Lingual End-to-End Speech Recognition

Dialog device with dialog support generated using a mixture of language models combined using a recurrent neural network

Frequently asked questions