Phoneme-based contextualization for cross-lingual speech recognition in end-to-end models
US-2020349923-A1 · Nov 5, 2020 · US
US2024135923A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2024135923-A1 |
| Application number | US-202318485271-A |
| Country | US |
| Kind code | A1 |
| Filing date | Oct 11, 2023 |
| Priority date | Oct 13, 2022 |
| Publication date | Apr 25, 2024 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method includes receiving a sequence of acoustic frames as input to a multilingual automated speech recognition (ASR) model configured to recognize speech in a plurality of different supported languages and generating, by an audio encoder of the multilingual ASR, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The method also includes generating, by a language identification (LID) predictor of the multilingual ASR, a language prediction representation for a corresponding higher order feature representation. The method also includes generating, by a decoder of the multilingual ASR, a probability distribution over possible speech recognition results based on the corresponding higher order feature representation, a sequence of non-blank symbols, and a corresponding language prediction representation. The decoder includes monolingual output layer having a plurality of output nodes each sharing a plurality of language-specific wordpiece models.
Opening claim text (preview).
What is claimed is: 1 . A multilingual automated speech recognition (ASR) model for recognizing speech in a plurality of different supported languages, the multilingual ASR model comprising: an audio encoder configured to: receive, as input, a sequence of acoustic frames; and generate, at each of a plurality of output steps, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames; and a language identification (LID) predictor configured to: receive, as input, the higher order feature representation generated by the audio encoder at each of the plurality of output steps; and generate, at each of the plurality of output steps, a language prediction representation; and a decoder comprising a monolingual output layer having a plurality of output nodes each sharing a plurality of language-specific wordpiece models, wherein the decoder is configured to: receive, as input, the higher order feature representation generated by the audio encoder at each of the plurality of output steps, a sequence of non-blank symbols output by the monolingual output layer, and the language prediction representation generated by the LID predictor at each of the plurality of output steps; and generate, at each of the plurality of output steps, a probability distribution over possible speech recognition results. 2 . The multilingual ASR model of claim 1 , wherein: each language of the plurality of different supported languages comprises V number of wordpiece models; the monolingual output layer comprises an input size equal to H; and the monolingual output layer comprises a dimension equal to H×V. 3 . The multilingual ASR model of claim 1 , wherein each language-specific wordpiece model of the plurality of language-specific wordpiece models shared by each corresponding output node comprises a language-specific wordpiece model corresponding to a respective language among the plurality of different supported languages that is different than the respective languages corresponding to the other language-specific wordpiece models shared by the corresponding output node. 4 . The multilingual ASR model of claim 3 , wherein each language-specific wordpiece model comprises a respective wordpiece token vocabulary in a writing system corresponding to the respective language. 5 . The multilingual ASR model of claim 1 , wherein the sequence of acoustic frames received as input at the audio encoder characterize an utterance spoken in at least one of the plurality of different supported languages. 6 . The multilingual ASR model of claim 5 , wherein the utterance comprises a code-mixed utterance comprising one or more words spoken in a first language and one or more other words spoken in a second language. 7 . The multilingual ASR model of claim 1 , wherein, for each of the plurality of different supported languages, the plurality of output nodes of the monolingual output layer associate to corresponding language-specific wordpiece models for each of the plurality of different supported languages alphabetically. 8 . The multilingual ASR model of claim 1 , wherein, when two or more of the plurality of different supported languages share a same corresponding language-specific wordpiece model, the monolingual output layer associates the same corresponding language-specific wordpiece model to share a same one of the plurality of output nodes. 9 . The multilingual ASR model of claim 8 , wherein an associating process associates same language-specific wordpiece models shared by different languages to output nodes by: identifying all language-specific wordpiece models across all of the plurality of different supported languages that are shared by two or more of the plurality of different languages; and for each corresponding language-specific wordpiece model identified as being shared by two or more of the plurality of different languages: indexing the corresponding language-specific wordpiece model from 1 to S, wherein S denotes a number of the different languages that share the corresponding language-specific wordpiece model; and assigning the corresponding language-specific wordpiece model to occupy a respective one of the plurality of output nodes for each of the S number of the different languages that share the corresponding language-specific wordpiece model. 10 . The multilingual ASR model of claim 9 , wherein, for the corresponding language-specific wordpiece model assigned to occupy the respective one of the plurality of output nodes for each of the S number of different languages, the associating process merges the corresponding language-specific wordpiece model indexed from 1 to S into a single language-specific wordpiece model shared by each of the S number of the different languages. 11 . The multilingual ASR model of claim 1 , wherein: the language prediction representation received as input at the decoder at each of the plurality of output steps represents a probability distribution over possible languages among the plurality of different supported languages that is predicted for a corresponding acoustic frame in the sequence of acoustic frames; and the decoder generates the probability distribution over possible speech recognition results at each of the plurality of output steps only over the language-specific wordpiece models that correspond to the top-K languages in the probability distribution over possible languages represented by the language prediction representation at the corresponding output step. 12 . The multilingual ASR model of claim 11 , wherein: K is less than a total number of the different supported languages; and K comprises a frame-dependent variable that adapts. 13 . The multilingual ASR model of claim 1 , wherein the monolingual output layer performs beam-searching over a top N candidate hypotheses selected from the probability distribution over possible speech recognition results at each of the plurality of output steps. 14 . The multilingual ASR model of claim 1 , wherein the decoder further comprises: a prediction network configured to: receive, as input, the sequence of non-blank symbols output by the monolingual output layer and the language prediction representation generated by the LID predictor at each of the plurality of output steps; and generate, at each of the plurality of output steps, a dense representation; and a joint network configured to: receive, as input, the dense representation generated by the prediction network at each of the plurality of output steps, the higher order feature representation generated by the audio encoder at each of the plurality of output steps, and the language prediction representation generated by the LID predictor at each of the plurality of output steps; and generate, at each of the plurality of output steps, the probability distribution over possible speech recognition results. 15 . The multilingual ASR model of claim 14 , wherein the joint network comprises a combination structure that stacks gating and bilinear pooling to fuse the dense representation generated by the prediction network and the higher order feature representation generated by the audio encoder. 16 . The multilingual ASR model of claim 1 , wherein: the audio encoder comprises a cascaded encoder comprising: a first encoder configured to: receive, as input, the sequence of acoustic frames; and generate, at each of the plurality of output steps, a first higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames; and a second
Probabilistic grammars, e.g. word n-grams · CPC title
Language recognition · CPC title
Feature extraction for speech recognition; Selection of recognition unit · CPC title
using artificial neural networks · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.