Who is the assignee on this patent?

Microsoft Technology Licensing Llc

What technology area does this patent fall under?

Primary CPC classification G10L15/08. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Apr 21 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Advancing word-based speech recognition processing

US10629193B2 · US · B2

Patent metadata
Field	Value
Publication number	US-10629193-B2
Application number	US-201815917082-A
Country	US
Kind code	B2
Filing date	Mar 9, 2018
Priority date	Mar 9, 2018
Publication date	Apr 21, 2020
Grant date	Apr 21, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Non-limiting examples of the present disclosure describe advancements in acoustic-to-word modeling that improve accuracy in speech recognition processing through the replacement of out-of-vocabulary (OOV) tokens. During the decoding of speech signals, better accuracy in speech recognition processing is achieved through training and implementation of multiple different solutions that present enhanced speech recognition models. In one example, a hybrid neural network model for speech recognition processing combines a word-based neural network model as a primary model and a character-based neural network model as an auxiliary model. The primary word-based model emits a word sequence, and an output of character-based auxiliary model is consulted at a segment where the word-based model emits an OOV token. In another example, a mixed unit speech recognition model is developed and trained to generate a mixed word and character sequence during decoding of a speech signal without requiring generation of OOV tokens.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: detecting that a speech signal comprises an out-of-vocabulary (OOV) token using a word-based language model; propagating the speech signal to a character-based language model for character-based evaluation only when the OOV token is detected in the speech signal by the word-based language model; generating a character sequence for the OOV token; and outputting a speech recognition result for the speech signal that comprises the generated character sequence for the OOV token. 2. The method of claim 1 , wherein the speech signal is processed using a hybrid neural network model that comprises an acoustic-to-word model for detection of the OOV token and character-based auxiliary model evaluation for evaluation of the OOV token. 3. The method of claim 2 , further comprising: training the hybrid neural network model based on processing associated with the speech recognition result. 4. The method of claim 3 , wherein the training further comprises receiving, from a productivity service, usage data associated with an interaction with the speech recognition result and updating training data of the hybrid neural network model based on the usage data. 5. The method of claim 2 , wherein the hybrid neural network model is a hybrid Connectionist Temporal Classification (CTC) model that comprises an acoustic-to-word CTC model for evaluation of the speech signal and a character-based CTC model for evaluation of the OOV token. 6. The method of claim 5 , wherein the hybrid neural network model is a hybrid Connectionist Temporal Classification (CTC) model that comprises an acoustic-to-word CTC model is trained to identify frequent words, and wherein the OOV token is generated when the speech signal is identified as an infrequent word that is not recognized by the acoustic-to-word CTC model. 7. The method of claim 1 , further comprising: collapsing the character sequence into an output unit for the OOV token, wherein the outputting outputs the output unit in the speech recognition result. 8. The method of claim 1 , wherein the outputting comprises propagating the speech recognition result to an application or productivity service. 9. A method comprising: receiving a speech signal; decoding the speech signal using a mixed unit speech recognition model that is trained based on word and character sequences; generating, for the speech signal, a mixed word and character sequence based on an evaluation of the speech signal by the mixed unit speech recognition model, wherein the mixed unit speech recognition model applies a word-based language model to detect an out of vocabulary (OOV) token and applies a character-based language model to evaluate the OOV token only when the OOV token is detected in the speech signal by the word-based language model; decomposing the mixed word and character sequence; and outputting a speech recognition result for the speech signal that comprises the decomposed mixed word and character sequence. 10. The method of claim 9 , further comprising: collapsing the character sequence of the decomposed mixed word and character sequence, wherein the speech recognition result, output in the outputting, comprises the collapsed character sequence. 11. The method of claim 9 , wherein the mixed unit speech recognition model is a mixed Connectionist Temporal Classification (CTC) model. 12. The method of claim 9 , wherein the speech signal is received during real-time execution of an application or service. 13. The method of claim 9 , wherein the outputting comprises propagating the speech recognition result to an application or service for subsequent processing. 14. The method of claim 9 , further comprising: updating training data for the mixed unit speech recognition model based on usage data, of the speech recognition result, that is associated with an application or service. 15. A system comprising: at least one processor; and a memory, operatively connected with the at least one processor, storing computer-executable instructions that, when executed by the at least one processor, causes the at least one processor to execute a method that comprises: receiving a speech signal; decoding the speech signal using a mixed unit speech recognition model that is trained based on word and character sequences; generating, for the speech signal, a mixed word and character sequence based on an evaluation of the speech signal by the mixed unit speech recognition model, wherein the mixed unit speech recognition model applies a word-based language model to detect an out of vocabulary (OOV) token and applies a character-based language model to evaluate the OOV token only when the OOV token is detected in the speech signal by the word-based language model; decomposing the mixed word and character sequence; and outputting a speech recognition result for the speech signal that comprises the decomposed mixed word and character sequence. 16. The system of claim 15 , wherein the method, executed by the at least one processor, further comprises: collapsing the character sequence of the decomposed mixed word and character sequence, wherein the speech recognition result, output in the outputting, comprises the collapsed character sequence. 17. The system of claim 15 , wherein the mixed unit speech recognition model is a mixed Connectionist Temporal Classification (CTC) model. 18. The method of claim 15 , wherein the speech signal is received during real-time execution of an application or service. 19. The method of claim 15 , wherein the outputting comprises propagating the speech recognition result to an application or service for subsequent processing. 20. The system of claim 15 , wherein the method, executed by the at least one processor, further comprises: updating training data for the mixed unit speech recognition model based on usage data, of the speech recognition result, that is associated with an application or service.

Assignees

Microsoft Technology Licensing Llc

Inventors

Classifications

G10L15/16
using artificial neural networks · CPC title
G10L15/08Primary
Speech classification or search · CPC title
G10L15/063
Training · CPC title
G10L2015/223
Execution procedure of a spoken command · CPC title
G10L15/187Primary
Phonemic context, e.g. pronunciation rules, phonotactical constraints or phoneme n-grams · CPC title

Patent family

Related publications grouped by family.

View patent family 67844564

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10629193B2 cover?: Non-limiting examples of the present disclosure describe advancements in acoustic-to-word modeling that improve accuracy in speech recognition processing through the replacement of out-of-vocabulary (OOV) tokens. During the decoding of speech signals, better accuracy in speech recognition processing is achieved through training and implementation of multiple different solutions that present enh…
Who is the assignee on this patent?: Microsoft Technology Licensing Llc
What technology area does this patent fall under?: Primary CPC classification G10L15/08. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Apr 21 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).