What technology area does this patent fall under?

Primary CPC classification G10L15/16. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Feb 24 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Scalable model specialization framework for speech model personalization

US12562154B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12562154-B2
Application number	US-202318184630-A
Country	US
Kind code	B2
Filing date	Mar 15, 2023
Priority date	Mar 18, 2022
Publication date	Feb 24, 2026
Grant date	Feb 24, 2026

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method for speech conversion includes obtaining a speech conversion model configured to convert input utterances of human speech directly into corresponding output utterances of synthesized speech. The method further includes receiving a speech conversion request including input audio data corresponding to an utterance spoken by a target speaker associated with atypical speech and a speaker identifier uniquely identifying the target speaker. The method includes activating, using the speaker identifier, a particular sub-model for biasing the speech conversion model to recognize a type of the atypical speech associated with the target speaker identified by the speaker identifier. The method includes converting, using the speech conversion model biased by the activated particular sub-model, the input audio data corresponding to the utterance spoken by the target speaker associated with atypical speech into output audio data corresponding to a synthesized canonical fluent speech representation of the utterance spoken by the target speaker.

First claim

Opening claim text (preview).

What is claimed is: 1 . A computer-implemented method when executed on data processing hardware causes the data processing hardware to perform operations comprising: obtaining a speech conversion model configured to convert input utterances of human speech directly into corresponding output utterances of synthesized speech, the speech conversion model comprising an encoder and a decoder; receiving a speech conversion request comprising input audio data corresponding to an utterance spoken by a target speaker associated with atypical speech and a speaker identifier uniquely identifying the target speaker; activating, using the speaker identifier, a particular sub-model for biasing the speech conversion model to recognize a type of the atypical speech associated with the target speaker identified by the speaker identifier; and converting, using the speech conversion model biased by the activated particular sub-model, the input audio data corresponding to the utterance spoken by the target speaker associated with atypical speech into output audio data corresponding to a synthesized canonical fluent speech representation of the utterance spoken by the target speaker by: generating, as output from the encoder configured to receive the input audio data as input, encoded audio data, the encoded audio data including a series of vectors; and generating, as output from the decoder configured to receive the encoded audio data output from the encoder as input, the output audio data corresponding to the synthesized canonical fluent representation of the utterance spoken by the target speaker. 2 . The computer-implemented method of claim 1 , wherein the speech conversion model is: trained on generalized training data; and speaker- and domain-independent. 3 . The computer-implemented method of claim 1 , wherein generating the output audio data corresponding to the synthesized canonical fluent representation of the utterance spoken by the target speaker comprises generating the output audio data corresponding to the synthesized canonical fluent representation of the utterance spoken by the target speaker without performing any intermediate text-to-speech conversion on a textual representation corresponding to a transcription of the utterance. 4 . The computer-implemented method of claim 3 , wherein the encoder comprises a stack of self-attention blocks each having a multi-headed self attention mechanism. 5 . The computer-implemented method of claim 4 , wherein the particular sub-model comprises a stack of residual adaptors disposed between each of the self-attention blocks in the stack of self-attention blocks of the encoder. 6 . The computer-implemented method of claim 5 , wherein each residual adaptor comprises a normalization layer, followed by a feed-forward layer with down-projection to a bottleneck dimension and a non-linear activation, and another feed-forward layer with up-projection. 7 . The computer-implemented method of claim 3 , wherein the speech conversion model further comprises a wordpiece decoder configured to: receive, as input, the encoded audio data from the encoder; and generate, as output, a textual representation corresponding to a transcription of the utterance. 8 . The computer-implemented method of claim 3 , wherein the speech conversion model further comprises a phoneme decoder configured to: receive, as input, the encoded audio data from the encoder; and generate, as output, a phoneme representation of the utterance. 9 . The computer-implemented method of claim 1 , wherein: the input audio data comprises one of an input spectrogram or an input audio waveform; and the output audio data comprises one of an output spectrogram or an output audio waveform. 10 . The computer-implemented method of claim 1 , wherein activating the particular sub-model for biasing the speech conversion model comprises: selecting, from among a plurality of sub-models each associated with a different type of atypical speech, the particular sub-model associated with the type of atypical speech associated with the target speaker; and loading the particular sub-model into the speech conversion model for biasing the speech conversion model to recognize the type of the atypical speech associated with the target speaker. 11 . A system comprising: data processing hardware; and memory hardware in communication with the data processing hardware, the memory hardware storing instructions that when executed on the data processing hardware cause the data processing hardware to perform operations comprising: obtaining a speech conversion model configured to convert input utterances of human speech directly into corresponding output utterances of synthesized speech, the speech conversion model comprising an encoder and a decoder; receiving a speech conversion request comprising input audio data corresponding to an utterance spoken by a target speaker associated with atypical speech and a speaker identifier uniquely identifying the target speaker; activating, using the speaker identifier, a particular sub-model for biasing the speech conversion model to recognize a type of the atypical speech associated with the target speaker identified by the speaker identifier; and converting, using the speech conversion model biased by the activated particular sub-model, the input audio data corresponding to the utterance spoken by the target speaker associated with atypical speech into output audio data corresponding to a synthesized canonical fluent speech representation of the utterance spoken by the target speaker by: generating, as output from the encoder configured to receive the input audio data as input, encoded audio data, the encoded audio data including a series of vectors; and generating, as output from the decoder configured to receive the encoded audio data output from the encoder as input, the output audio data corresponding to the synthesized canonical fluent representation of the utterance spoken by the target speaker. 12 . The system of claim 11 , wherein the speech conversion model is: trained on generalized training data; and speaker- and domain-independent. 13 . The system of claim 11 , wherein generating the output audio data corresponding to the synthesized canonical fluent representation of the utterance spoken by the target speaker comprises generating the output audio data corresponding to the synthesized canonical fluent representation of the utterance spoken by the target speaker without performing any intermediate text-to-speech conversion on a textual representation corresponding to a transcription of the utterance. 14 . The system of claim 13 , wherein the encoder comprises a stack of self-attention blocks each having a multi-headed self attention mechanism. 15 . The system of claim 14 , wherein the particular sub-model comprises a stack of residual adaptors disposed between each of the self-attention blocks in the stack of self-attention blocks of the encoder. 16 . The system of claim 15 , wherein each residual adaptor comprises a normalization layer, followed by a feed-forward layer with down-projection to a bottleneck dimension and a non-linear activation, and another feed-forward layer with up-projection. 17 . The system of claim 13 , wherein the speech conversion model further comprises a wordpiece decoder configured to: receive, as input, the encoded audio data from the encoder; and generate, as output, a textual representation corresponding to a transcription of the utterance. 18 . The syste

Assignees

Google Llc

Inventors

Classifications

G10L2015/025
Phonemes, fenemes or fenones being the recognition units · CPC title
G10L15/063
Training · CPC title
G10L15/02
Feature extraction for speech recognition; Selection of recognition unit · CPC title
G10L13/00
Speech synthesis; Text to speech systems · CPC title
G10L21/003
Changing voice quality, e.g. pitch or formants · CPC title

Patent family

Related publications grouped by family.

View patent family 85937302

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12562154B2 cover?: A method for speech conversion includes obtaining a speech conversion model configured to convert input utterances of human speech directly into corresponding output utterances of synthesized speech. The method further includes receiving a speech conversion request including input audio data corresponding to an utterance spoken by a target speaker associated with atypical speech and a speaker i…
Who is the assignee on this patent?: Google Llc
What technology area does this patent fall under?: Primary CPC classification G10L15/16. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Feb 24 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

System and method for secure transcription generation

Canonical training for highly configurable multilingual speech

Integrating text inputs for training and adapting neural network transducer asr models

Direct Speech-to-Speech Translation via Machine Learning

Large-Scale Multilingual Speech Recognition With A Streaming End-To-End Model

Frequently asked questions