Canonical training for highly configurable multilingual speech

US12249336B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12249336-B2
Application numberUS-202118573846-A
CountryUS
Kind codeB2
Filing dateJun 29, 2021
Priority dateJun 29, 2021
Publication dateMar 11, 2025
Grant dateMar 11, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Embodiments are provided for building a configurable multilingual model. A computing system obtains a plurality of language-specific automatic speech recognition modules and a universal automatic speech recognition module trained on a multi-language training dataset comprising training data corresponding to each of the plurality of different languages. The computing system then compiles the universal automatic speech recognition module with the plurality of language-specific automatic speech recognition modules to generate a configurable multilingual model that is configured to selectively and dynamically utilize a sub-set of the plurality of language-specific automatic speech recognition modules with the universal automatic speech recognition module to process audio content in response to user input identifying one or more target languages associated with the audio content.

First claim

Opening claim text (preview).

What is claimed is: 1. A computing system comprising: one or more processors; and one or more hardware storage devices storing one or more computer-readable instructions that are executable by the one or more processors to configure the computing system to at least: obtain a plurality of language-specific automatic speech recognition modules, each language-specific automatic speech recognition module of the plurality of language-specific automatic speech recognition modules having been trained on a different language-specific training dataset and such that each of the plurality of language-specific automatic speech recognition modules is configured to recognize speech in a correspondingly different language of a plurality of different languages; obtain a universal automatic speech recognition module trained on a multi-language training dataset comprising training data corresponding to each of the plurality of different languages and such that the universal automatic speech recognition module is trained to recognize speech in all of the plurality of different languages; compile the universal automatic speech recognition module with the plurality of language-specific automatic speech recognition modules as a configurable multilingual model that is configured to selectively and dynamically utilize a sub-set of the plurality of language-specific automatic speech recognition modules with the universal automatic speech recognition module to process audio content in response to user input identifying one or more target languages associated with the audio content; and training the configurable multilingual model to recognize user input for selecting combinations of the plurality of different languages when configuring the configurable multilingual model into a user-specific automatic speech recognition model by providing the configurable multilingual model with user choice input vectors corresponding to different combinations of the plurality of different languages. 2. The computing system of claim 1 , the one or more computer-readable instructions being further executable to further configure the computing system to: obtain a one-hot vector corresponding to a first language; obtain a multi-hot vector corresponding the first language and one or more additional languages; and randomly present the one-hot vector and the multi-hot vector as the user choice input vectors to the configurable multilingual model during training of the configurable multilingual model. 3. The computing system of claim 2 , the one or more computer-readable instructions being further executable to further configure the computing system to: apply a language-independent training dataset without language identification data. 4. The computing system of claim 2 , as a result of compiling the configurable multilingual model, the configurable multilingual model comprises a language-specific embedding based on the multi-hot vector and an input acoustic feature, a language-specific layer comprising the universal automatic speech recognition module and the plurality of language-specific automatic speech recognition modules, and a language-specific vocabulary that merges one or more language-specific vocabularies in response to user input interpretable for selecting one or more languages, each language corresponding to a different language-specific vocabulary dataset. 5. A computing system comprising: one or more processors; and one or more hardware storage devices storing one or more computer-readable instructions that are executable by the one or more processors to configure the computing system to at least: obtain a configurable multilingual model comprising a universal automatic speech recognition module and a plurality of language-specific automatic speech recognition modules, the configurable multilingual model being trained to dynamically select the universal automatic speech recognition module and a sub-set of language-specific automatic speech recognition modules from the plurality of language-specific automatic speech recognition modules to generate a user-specific automatic speech recognition model configured to recognize spoken utterances in one or more user-identified languages; receive user input comprising (i) a null value corresponding to the universal automatic speech recognition module or (ii) a language identification vector indicating one or more target languages; select the universal automatic speech recognition module; and when the user input comprises the language identification vector, select the sub-set of language-specific automatic speech recognition modules, each language-specific automatic speech recognition modules included in the sub-set of language-specific automatic speech recognition modules trained to recognize spoken utterances in a different language of the one or more target languages. 6. The computing system of claim 5 , the one or more computer-readable instructions being further executable to further configure the computing system to: extract the universal automatic speech recognition module and the sub-set of language-specific automatic speech recognition modules from the configurable multilingual model; and at inference time, generate the user-specific automatic speech recognition model by combining the universal automatic speech recognition module and the sub-set of language-specific automatic speech recognition modules. 7. The computing system of claim 6 , the one or more computer-readable instructions being further executable to further configure the computing system to: transmit the user-specific automatic speech recognition model to a user device. 8. The computing system of claim 5 , the one or more computer-readable instructions being further executable to further configure the computing system to compile the configurable multilingual model by: identifying one or more module languages; obtaining one or more language-specific automatic speech recognition modules, each language-specific automatic speech recognition module of the one or more language-specific automatic speech recognition modules trained on a different language-specific training dataset to train each language-specific automatic speech recognition module to recognize spoken utterances in a different language of the one or more module languages; obtaining a universal automatic speech recognition module trained on a multi-language training dataset comprising training data corresponding to each of the one or more module languages to train the universal automatic speech recognition module to recognize spoken utterances in any of the one or more module languages; and combining the universal automatic speech recognition module and the one or more language-specific automatic speech recognition modules. 9. The computing system of claim 5 , the language identification vector comprising a one-hot vector corresponding a single target language. 10. The computing system of claim 5 , the language identification vector comprising a multi-hot vector corresponding to a plurality of target languages. 11. The computing system of claim 5 , the one or more computer-readable instructions being further executable to further configure the computing system to select the sub-set of language-specific automatic speech recognition modules by: positively weighting each language-specific automatic speech recognition module included in the sub-set of language-specific automatic speech recognition modules; and unweighting each language-specific automatic speech recognition module not included in the sub-set of language-specific automatic speech recognition modules. 12. The computing system of claim 5 , the one or more

Assignees

Inventors

Classifications

  • updating or merging of old and new templates; Mean values; Weighting · CPC title

  • Distributed recognition, e.g. in client-server systems, for mobile phones or network applications · CPC title

  • Training · CPC title

  • G10L15/005Primary

    Language recognition · CPC title

  • using artificial neural networks · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12249336B2 cover?
Embodiments are provided for building a configurable multilingual model. A computing system obtains a plurality of language-specific automatic speech recognition modules and a universal automatic speech recognition module trained on a multi-language training dataset comprising training data corresponding to each of the plurality of different languages. The computing system then compiles the uni…
Who is the assignee on this patent?
Microsoft Technology Licensing Llc, Li Jinyu, Zhou Long, and 2 more
What technology area does this patent fall under?
Primary CPC classification G10L15/005. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 11 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).