Phoneme-based contextualization for cross-lingual speech recognition in end-to-end models
US-2020349923-A1 · Nov 5, 2020 · US
US12548561B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12548561-B2 |
| Application number | US-202318485271-A |
| Country | US |
| Kind code | B2 |
| Filing date | Oct 11, 2023 |
| Priority date | Oct 13, 2022 |
| Publication date | Feb 10, 2026 |
| Grant date | Feb 10, 2026 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method includes receiving a sequence of acoustic frames as input to a multilingual automated speech recognition (ASR) model configured to recognize speech in a plurality of different supported languages and generating, by an audio encoder of the multilingual ASR, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The method also includes generating, by a language identification (LID) predictor of the multilingual ASR, a language prediction representation for a corresponding higher order feature representation. The method also includes generating, by a decoder of the multilingual ASR, a probability distribution over possible speech recognition results based on the corresponding higher order feature representation, a sequence of non-blank symbols, and a corresponding language prediction representation. The decoder includes monolingual output layer having a plurality of output nodes each sharing a plurality of language-specific wordpiece models.
Opening claim text (preview).
What is claimed is: 1 . A system comprising: data processing hardware; and memory hardware in communication with the data processing hardware, the memory hardware storing instructions that when executed on the data processing hardware cause the data processing hardware to perform operations corresponding to a multilingual automated speech recognition (ASR) model for recognizing speech in a plurality of different supported languages, the multilingual ASR model comprising: an audio encoder configured to: receive, as input, a sequence of acoustic frames; and generate, at each of a plurality of output steps, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames; and a language identification (LID) predictor configured to: receive, as input, the higher order feature representation generated by the audio encoder at each of the plurality of output steps; and generate, at each of the plurality of output steps, a language prediction representation; and a decoder comprising a monolingual output layer having a plurality of output nodes each sharing a plurality of language-specific wordpiece models, wherein the decoder is configured to: receive, as input, the higher order feature representation generated by the audio encoder at each of the plurality of output steps, a sequence of non-blank symbols output by the monolingual output layer, and the language prediction representation generated by the LID predictor at each of the plurality of output steps; and generate, at each of the plurality of output steps, a probability distribution over possible speech recognition results. 2 . The system of claim 1 , wherein: each language of the plurality of different supported languages comprises V number of wordpiece models; the monolingual output layer comprises an input size equal to H; and the monolingual output layer comprises a dimension equal to H×V. 3 . The system of claim 1 , wherein each language-specific wordpiece model of the plurality of language-specific wordpiece models shared by each corresponding output node comprises a language-specific wordpiece model corresponding to a respective language among the plurality of different supported languages that is different than the respective languages corresponding to the other language-specific wordpiece models shared by the corresponding output node. 4 . The system of claim 3 , wherein each language-specific wordpiece model comprises a respective wordpiece token vocabulary in a writing system corresponding to the respective language. 5 . The system of claim 1 , wherein the sequence of acoustic frames received as input at the audio encoder characterize an utterance spoken in at least one of the plurality of different supported languages. 6 . The system of claim 5 , wherein the utterance comprises a code-mixed utterance comprising one or more words spoken in a first language and one or more other words spoken in a second language. 7 . The system of claim 1 , wherein, for each of the plurality of different supported languages, the plurality of output nodes of the monolingual output layer associate to corresponding language-specific wordpiece models for each of the plurality of different supported languages alphabetically. 8 . The system of claim 1 , wherein, when two or more of the plurality of different supported languages share a same corresponding language-specific wordpiece model, the monolingual output layer associates the same corresponding language-specific wordpiece model to share a same one of the plurality of output nodes. 9 . The system of claim 8 , wherein an associating process associates same language-specific wordpiece models shared by different languages to output nodes by: identifying all language-specific wordpiece models across all of the plurality of different supported languages that are shared by two or more of the plurality of different languages; and for each corresponding language-specific wordpiece model identified as being shared by two or more of the plurality of different languages: indexing the corresponding language-specific wordpiece model from 1 to S, wherein S denotes a number of the different languages that share the corresponding language-specific wordpiece model; and assigning the corresponding language-specific wordpiece model to occupy a respective one of the plurality of output nodes for each of the S number of the different languages that share the corresponding language-specific wordpiece model. 10 . The system of claim 9 , wherein, for the corresponding language-specific wordpiece model assigned to occupy the respective one of the plurality of output nodes for each of the S number of different languages, the associating process merges the corresponding language-specific wordpiece model indexed from 1 to S into a single language-specific wordpiece model shared by each of the S number of the different languages. 11 . The system of claim 1 , wherein: the language prediction representation received as input at the decoder at each of the plurality of output steps represents a probability distribution over possible languages among the plurality of different supported languages that is predicted for a corresponding acoustic frame in the sequence of acoustic frames; and the decoder generates the probability distribution over possible speech recognition results at each of the plurality of output steps only over the language-specific wordpiece models that correspond to the top-K languages in the probability distribution over possible languages represented by the language prediction representation at the corresponding output step. 12 . The system of claim 11 , wherein: K is less than a total number of the different supported languages; and K comprises a frame-dependent variable that adapts. 13 . The system of claim 1 , wherein the monolingual output layer performs beam-searching over a top N candidate hypotheses selected from the probability distribution over possible speech recognition results at each of the plurality of output steps. 14 . The system of claim 1 , wherein the decoder further comprises: a prediction network configured to: receive, as input, the sequence of non-blank symbols output by the monolingual output layer and the language prediction representation generated by the LID predictor at each of the plurality of output steps; and generate, at each of the plurality of output steps, a dense representation; and a joint network configured to: receive, as input, the dense representation generated by the prediction network at each of the plurality of output steps, the higher order feature representation generated by the audio encoder at each of the plurality of output steps, and the language prediction representation generated by the LID predictor at each of the plurality of output steps; and generate, at each of the plurality of output steps, the probability distribution over possible speech recognition results. 15 . The system of claim 14 , wherein the joint network comprises a combination structure that stacks gating and bilinear pooling to fuse the dense representation generated by the prediction network and the higher order feature representation generated by the audio encoder. 16 . The system of claim 1 , wherein: the audio encoder comprises a cascaded encoder comprising: a first encoder configured to: receive, as input, the sequence of acoustic frames; and generate, at each of the plurality of output steps, a first higher order feature representation for a corresponding acoustic fr
Feature extraction for speech recognition; Selection of recognition unit · CPC title
Language recognition · CPC title
Probabilistic grammars, e.g. word n-grams · CPC title
using artificial neural networks · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.