What technology area does this patent fall under?

Primary CPC classification G10L15/197. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Feb 10 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Universal monolingual output layer for multilingual speech recognition

US12548561B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12548561-B2
Application number	US-202318485271-A
Country	US
Kind code	B2
Filing date	Oct 11, 2023
Priority date	Oct 13, 2022
Publication date	Feb 10, 2026
Grant date	Feb 10, 2026

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method includes receiving a sequence of acoustic frames as input to a multilingual automated speech recognition (ASR) model configured to recognize speech in a plurality of different supported languages and generating, by an audio encoder of the multilingual ASR, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The method also includes generating, by a language identification (LID) predictor of the multilingual ASR, a language prediction representation for a corresponding higher order feature representation. The method also includes generating, by a decoder of the multilingual ASR, a probability distribution over possible speech recognition results based on the corresponding higher order feature representation, a sequence of non-blank symbols, and a corresponding language prediction representation. The decoder includes monolingual output layer having a plurality of output nodes each sharing a plurality of language-specific wordpiece models.

First claim

Opening claim text (preview).

What is claimed is: 1 . A system comprising: data processing hardware; and memory hardware in communication with the data processing hardware, the memory hardware storing instructions that when executed on the data processing hardware cause the data processing hardware to perform operations corresponding to a multilingual automated speech recognition (ASR) model for recognizing speech in a plurality of different supported languages, the multilingual ASR model comprising: an audio encoder configured to: receive, as input, a sequence of acoustic frames; and generate, at each of a plurality of output steps, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames; and a language identification (LID) predictor configured to: receive, as input, the higher order feature representation generated by the audio encoder at each of the plurality of output steps; and generate, at each of the plurality of output steps, a language prediction representation; and a decoder comprising a monolingual output layer having a plurality of output nodes each sharing a plurality of language-specific wordpiece models, wherein the decoder is configured to: receive, as input, the higher order feature representation generated by the audio encoder at each of the plurality of output steps, a sequence of non-blank symbols output by the monolingual output layer, and the language prediction representation generated by the LID predictor at each of the plurality of output steps; and generate, at each of the plurality of output steps, a probability distribution over possible speech recognition results. 2 . The system of claim 1 , wherein: each language of the plurality of different supported languages comprises V number of wordpiece models; the monolingual output layer comprises an input size equal to H; and the monolingual output layer comprises a dimension equal to H×V. 3 . The system of claim 1 , wherein each language-specific wordpiece model of the plurality of language-specific wordpiece models shared by each corresponding output node comprises a language-specific wordpiece model corresponding to a respective language among the plurality of different supported languages that is different than the respective languages corresponding to the other language-specific wordpiece models shared by the corresponding output node. 4 . The system of claim 3 , wherein each language-specific wordpiece model comprises a respective wordpiece token vocabulary in a writing system corresponding to the respective language. 5 . The system of claim 1 , wherein the sequence of acoustic frames received as input at the audio encoder characterize an utterance spoken in at least one of the plurality of different supported languages. 6 . The system of claim 5 , wherein the utterance comprises a code-mixed utterance comprising one or more words spoken in a first language and one or more other words spoken in a second language. 7 . The system of claim 1 , wherein, for each of the plurality of different supported languages, the plurality of output nodes of the monolingual output layer associate to corresponding language-specific wordpiece models for each of the plurality of different supported languages alphabetically. 8 . The system of claim 1 , wherein, when two or more of the plurality of different supported languages share a same corresponding language-specific wordpiece model, the monolingual output layer associates the same corresponding language-specific wordpiece model to share a same one of the plurality of output nodes. 9 . The system of claim 8 , wherein an associating process associates same language-specific wordpiece models shared by different languages to output nodes by: identifying all language-specific wordpiece models across all of the plurality of different supported languages that are shared by two or more of the plurality of different languages; and for each corresponding language-specific wordpiece model identified as being shared by two or more of the plurality of different languages: indexing the corresponding language-specific wordpiece model from 1 to S, wherein S denotes a number of the different languages that share the corresponding language-specific wordpiece model; and assigning the corresponding language-specific wordpiece model to occupy a respective one of the plurality of output nodes for each of the S number of the different languages that share the corresponding language-specific wordpiece model. 10 . The system of claim 9 , wherein, for the corresponding language-specific wordpiece model assigned to occupy the respective one of the plurality of output nodes for each of the S number of different languages, the associating process merges the corresponding language-specific wordpiece model indexed from 1 to S into a single language-specific wordpiece model shared by each of the S number of the different languages. 11 . The system of claim 1 , wherein: the language prediction representation received as input at the decoder at each of the plurality of output steps represents a probability distribution over possible languages among the plurality of different supported languages that is predicted for a corresponding acoustic frame in the sequence of acoustic frames; and the decoder generates the probability distribution over possible speech recognition results at each of the plurality of output steps only over the language-specific wordpiece models that correspond to the top-K languages in the probability distribution over possible languages represented by the language prediction representation at the corresponding output step. 12 . The system of claim 11 , wherein: K is less than a total number of the different supported languages; and K comprises a frame-dependent variable that adapts. 13 . The system of claim 1 , wherein the monolingual output layer performs beam-searching over a top N candidate hypotheses selected from the probability distribution over possible speech recognition results at each of the plurality of output steps. 14 . The system of claim 1 , wherein the decoder further comprises: a prediction network configured to: receive, as input, the sequence of non-blank symbols output by the monolingual output layer and the language prediction representation generated by the LID predictor at each of the plurality of output steps; and generate, at each of the plurality of output steps, a dense representation; and a joint network configured to: receive, as input, the dense representation generated by the prediction network at each of the plurality of output steps, the higher order feature representation generated by the audio encoder at each of the plurality of output steps, and the language prediction representation generated by the LID predictor at each of the plurality of output steps; and generate, at each of the plurality of output steps, the probability distribution over possible speech recognition results. 15 . The system of claim 14 , wherein the joint network comprises a combination structure that stacks gating and bilinear pooling to fuse the dense representation generated by the prediction network and the higher order feature representation generated by the audio encoder. 16 . The system of claim 1 , wherein: the audio encoder comprises a cascaded encoder comprising: a first encoder configured to: receive, as input, the sequence of acoustic frames; and generate, at each of the plurality of output steps, a first higher order feature representation for a corresponding acoustic fr

Assignees

Google Llc

Inventors

Classifications

G10L15/02
Feature extraction for speech recognition; Selection of recognition unit · CPC title
G10L15/005
Language recognition · CPC title
G10L15/197Primary
Probabilistic grammars, e.g. word n-grams · CPC title
G10L15/16Primary
using artificial neural networks · CPC title

Patent family

Related publications grouped by family.

View patent family 88695377

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12548561B2 cover?: A method includes receiving a sequence of acoustic frames as input to a multilingual automated speech recognition (ASR) model configured to recognize speech in a plurality of different supported languages and generating, by an audio encoder of the multilingual ASR, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The method also includ…
Who is the assignee on this patent?: Google Llc
What technology area does this patent fall under?: Primary CPC classification G10L15/197. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Feb 10 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).