What technology area does this patent fall under?

Primary CPC classification G10L15/005. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Mar 18 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Multi-dialect and multilingual speech recognition

US12254865B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12254865-B2
Application number	US-202418418246-A
Country	US
Kind code	B2
Filing date	Jan 20, 2024
Priority date	Nov 21, 2018
Publication date	Mar 18, 2025
Grant date	Mar 18, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and apparatus, including computer programs encoded on a computer-readable media, for speech recognition using multi-dialect and multilingual models. In some implementations, audio data indicating audio characteristics of an utterance is received. Input features determined based on the audio data are provided to a speech recognition model that has been trained to output score indicating the likelihood of linguistic units for each of multiple different language or dialects. The speech recognition model can be one that has been trained using cluster adaptive training. Output that the speech recognition model generated in response to receiving the input features determined based on the audio data is received. A transcription of the utterance generated based on the output of the speech recognition model is provided.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method of jointly performing speech recognition and language prediction using a sequence-to-sequence speech recognition model, the method when executed on data processing hardware causes the data processing hardware to perform operations comprising: receiving audio data characterizing a spoken utterance; processing, using the sequence-to-sequence speech recognition model, the audio data to generate, at each of a plurality of time steps: a probability distribution over a predetermined set of linguistic units; and a predicted language of the spoken utterance among multiple different languages the speech recognition model has been trained to recognize; and providing, as an output from the sequence-to-sequence speech recognition model, a transcription of the utterance based on the probability distribution over the predetermined set of linguistic units and the predicted language generated at each of the plurality of time steps, wherein the speech recognition model is trained using multi-task learning using: a first objective function corresponding to grapheme prediction; and a second objective function corresponding to a language or dialect classification cost, the first objective function and second objective function being weighted such that the speech recognition model is trained to learn hidden representations that are effective for both language and dialect classification and grapheme prediction. 2. The computer-implemented method of claim 1 , wherein the sequence-to-sequence speech recognition model is trained using multi-task learning to teach the sequence-to-sequence speech recognition model to learn how to jointly predict linguistic units and language from input audio data. 3. The computer-implemented method of claim 1 , wherein the sequence-to-sequence speech recognition model is trained using multi-task learning by: obtaining training data comprising: training audio data characterizing training utterances each spoken in one of a plurality of different languages; and for each training utterance, a corresponding target output label sequence corresponding to a transcription of the corresponding training utterance, wherein each target output label sequence is annotated with a special language symbol indicating the language of the corresponding training utterance; and training the sequence-to-sequence speech recognition model on the training data to learn how to jointly predict the target output label sequence and the special language symbol for each training utterance. 4. The computer-implemented method of claim 3 , wherein training the speech recognition model on the training data causes the speech recognition model to: output scores indicative of special language symbols representing the multiple different languages of the utterances; and generate output sequences that include one of the special language symbols representing the multiple different languages. 5. The computer-implemented method of claim 1 , wherein the sequence-to-sequence speech recognition model comprises an encoder and a decoder. 6. The computer-implemented method of claim 5 , wherein the encoder comprises one or more neural network layers that have parameters learned through training using training examples representing speech in the multiple different languages. 7. The computer-implemented method of claim 5 , wherein the decoder comprises one or more neural network layers that have parameters learned through training using training examples representing speech in the multiple different languages. 8. The computer-implemented method of claim 1 , wherein the linguistic units are word pieces. 9. The computer-implemented method of claim 1 , wherein the linguistic units are graphemes. 10. A system comprising: data processing hardware; and memory hardware in communication with the data processing hardware and storing instructions that when executed on the data processing hardware causes the data processing hardware to perform operations comprising: receiving audio data characterizing a spoken utterance; processing, using the sequence-to-sequence speech recognition model, the audio data to generate, at each of a plurality of time steps: a probability distribution over a predetermined set of linguistic units; and a predicted language of the spoken utterance among multiple different languages the speech recognition model has been trained to recognize; and providing, as an output from the sequence-to-sequence speech recognition model, a transcription of the utterance based on the probability distribution over the predetermined set of linguistic units and the predicted language generated at each of the plurality of time steps, wherein the speech recognition model is trained using multi-task learning using: a first objective function corresponding to grapheme prediction; and a second objective function corresponding to a language or dialect classification cost, the first objective function and second objective function being weighted such that the speech recognition model is trained to learn hidden representations that are effective for both language and dialect classification and grapheme prediction. 11. The system of claim 10 , wherein the sequence-to-sequence speech recognition model is trained using multi-task learning to teach the sequence-to-sequence speech recognition model to learn how to jointly predict linguistic units and language from input audio data. 12. The system of claim 10 , wherein the sequence-to-sequence speech recognition model is trained using multi-task learning by: obtaining training data comprising: training audio data characterizing training utterances each spoken in one of a plurality of different languages; and for each training utterance, a corresponding target output label sequence corresponding to a transcription of the corresponding training utterance, wherein each target output label sequence is annotated with a special language symbol indicating the language of the corresponding training utterance; and training the sequence-to-sequence speech recognition model on the training data to learn how to jointly predict the target output label sequence and the special language symbol for each training utterance. 13. The system of claim 12 , wherein training the speech recognition model on the training data causes the speech recognition model to: output scores indicative of special language symbols representing the multiple different languages of the utterances; and generate output sequences that include one of the special language symbols representing the multiple different languages. 14. The system of claim 10 , wherein the sequence-to-sequence speech recognition model comprises an encoder and a decoder. 15. The system of claim 14 , wherein the encoder comprises one or more neural network layers that have parameters learned through training using training examples representing speech in the multiple different languages. 16. The system of claim 14 , wherein the decoder comprises one or more neural network layers that have parameters learned through training using training examples representing speech in the multiple different languages. 17. The system of claim 10 , wherein the linguistic units are word pieces. 18. The system of claim 10 , wherein the linguistic units are graphemes.

Assignees

Google Llc

Inventors

Classifications

G10L15/07
to the speaker · CPC title
G10L15/16
using artificial neural networks · CPC title
G10L2015/0631
Creating reference templates; Clustering · CPC title
G10L15/063
Training · CPC title
G10L15/005Primary
Language recognition · CPC title

Patent family

Related publications grouped by family.

View patent family 70728058

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12254865B2 cover?: Methods, systems, and apparatus, including computer programs encoded on a computer-readable media, for speech recognition using multi-dialect and multilingual models. In some implementations, audio data indicating audio characteristics of an utterance is received. Input features determined based on the audio data are provided to a speech recognition model that has been trained to output score i…
Who is the assignee on this patent?: Google Llc
What technology area does this patent fall under?: Primary CPC classification G10L15/005. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Mar 18 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).