Advancing word-based speech recognition processing

US10629193B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10629193-B2
Application numberUS-201815917082-A
CountryUS
Kind codeB2
Filing dateMar 9, 2018
Priority dateMar 9, 2018
Publication dateApr 21, 2020
Grant dateApr 21, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Non-limiting examples of the present disclosure describe advancements in acoustic-to-word modeling that improve accuracy in speech recognition processing through the replacement of out-of-vocabulary (OOV) tokens. During the decoding of speech signals, better accuracy in speech recognition processing is achieved through training and implementation of multiple different solutions that present enhanced speech recognition models. In one example, a hybrid neural network model for speech recognition processing combines a word-based neural network model as a primary model and a character-based neural network model as an auxiliary model. The primary word-based model emits a word sequence, and an output of character-based auxiliary model is consulted at a segment where the word-based model emits an OOV token. In another example, a mixed unit speech recognition model is developed and trained to generate a mixed word and character sequence during decoding of a speech signal without requiring generation of OOV tokens.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: detecting that a speech signal comprises an out-of-vocabulary (OOV) token using a word-based language model; propagating the speech signal to a character-based language model for character-based evaluation only when the OOV token is detected in the speech signal by the word-based language model; generating a character sequence for the OOV token; and outputting a speech recognition result for the speech signal that comprises the generated character sequence for the OOV token. 2. The method of claim 1 , wherein the speech signal is processed using a hybrid neural network model that comprises an acoustic-to-word model for detection of the OOV token and character-based auxiliary model evaluation for evaluation of the OOV token. 3. The method of claim 2 , further comprising: training the hybrid neural network model based on processing associated with the speech recognition result. 4. The method of claim 3 , wherein the training further comprises receiving, from a productivity service, usage data associated with an interaction with the speech recognition result and updating training data of the hybrid neural network model based on the usage data. 5. The method of claim 2 , wherein the hybrid neural network model is a hybrid Connectionist Temporal Classification (CTC) model that comprises an acoustic-to-word CTC model for evaluation of the speech signal and a character-based CTC model for evaluation of the OOV token. 6. The method of claim 5 , wherein the hybrid neural network model is a hybrid Connectionist Temporal Classification (CTC) model that comprises an acoustic-to-word CTC model is trained to identify frequent words, and wherein the OOV token is generated when the speech signal is identified as an infrequent word that is not recognized by the acoustic-to-word CTC model. 7. The method of claim 1 , further comprising: collapsing the character sequence into an output unit for the OOV token, wherein the outputting outputs the output unit in the speech recognition result. 8. The method of claim 1 , wherein the outputting comprises propagating the speech recognition result to an application or productivity service. 9. A method comprising: receiving a speech signal; decoding the speech signal using a mixed unit speech recognition model that is trained based on word and character sequences; generating, for the speech signal, a mixed word and character sequence based on an evaluation of the speech signal by the mixed unit speech recognition model, wherein the mixed unit speech recognition model applies a word-based language model to detect an out of vocabulary (OOV) token and applies a character-based language model to evaluate the OOV token only when the OOV token is detected in the speech signal by the word-based language model; decomposing the mixed word and character sequence; and outputting a speech recognition result for the speech signal that comprises the decomposed mixed word and character sequence. 10. The method of claim 9 , further comprising: collapsing the character sequence of the decomposed mixed word and character sequence, wherein the speech recognition result, output in the outputting, comprises the collapsed character sequence. 11. The method of claim 9 , wherein the mixed unit speech recognition model is a mixed Connectionist Temporal Classification (CTC) model. 12. The method of claim 9 , wherein the speech signal is received during real-time execution of an application or service. 13. The method of claim 9 , wherein the outputting comprises propagating the speech recognition result to an application or service for subsequent processing. 14. The method of claim 9 , further comprising: updating training data for the mixed unit speech recognition model based on usage data, of the speech recognition result, that is associated with an application or service. 15. A system comprising: at least one processor; and a memory, operatively connected with the at least one processor, storing computer-executable instructions that, when executed by the at least one processor, causes the at least one processor to execute a method that comprises: receiving a speech signal; decoding the speech signal using a mixed unit speech recognition model that is trained based on word and character sequences; generating, for the speech signal, a mixed word and character sequence based on an evaluation of the speech signal by the mixed unit speech recognition model, wherein the mixed unit speech recognition model applies a word-based language model to detect an out of vocabulary (OOV) token and applies a character-based language model to evaluate the OOV token only when the OOV token is detected in the speech signal by the word-based language model; decomposing the mixed word and character sequence; and outputting a speech recognition result for the speech signal that comprises the decomposed mixed word and character sequence. 16. The system of claim 15 , wherein the method, executed by the at least one processor, further comprises: collapsing the character sequence of the decomposed mixed word and character sequence, wherein the speech recognition result, output in the outputting, comprises the collapsed character sequence. 17. The system of claim 15 , wherein the mixed unit speech recognition model is a mixed Connectionist Temporal Classification (CTC) model. 18. The method of claim 15 , wherein the speech signal is received during real-time execution of an application or service. 19. The method of claim 15 , wherein the outputting comprises propagating the speech recognition result to an application or service for subsequent processing. 20. The system of claim 15 , wherein the method, executed by the at least one processor, further comprises: updating training data for the mixed unit speech recognition model based on usage data, of the speech recognition result, that is associated with an application or service.

Assignees

Inventors

Classifications

  • using artificial neural networks · CPC title

  • G10L15/08Primary

    Speech classification or search · CPC title

  • Training · CPC title

  • Execution procedure of a spoken command · CPC title

  • G10L15/187Primary

    Phonemic context, e.g. pronunciation rules, phonotactical constraints or phoneme n-grams · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10629193B2 cover?
Non-limiting examples of the present disclosure describe advancements in acoustic-to-word modeling that improve accuracy in speech recognition processing through the replacement of out-of-vocabulary (OOV) tokens. During the decoding of speech signals, better accuracy in speech recognition processing is achieved through training and implementation of multiple different solutions that present enh…
Who is the assignee on this patent?
Microsoft Technology Licensing Llc
What technology area does this patent fall under?
Primary CPC classification G10L15/08. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Apr 21 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).