Scalable model specialization framework for speech model personalization

US12562154B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12562154-B2
Application numberUS-202318184630-A
CountryUS
Kind codeB2
Filing dateMar 15, 2023
Priority dateMar 18, 2022
Publication dateFeb 24, 2026
Grant dateFeb 24, 2026

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method for speech conversion includes obtaining a speech conversion model configured to convert input utterances of human speech directly into corresponding output utterances of synthesized speech. The method further includes receiving a speech conversion request including input audio data corresponding to an utterance spoken by a target speaker associated with atypical speech and a speaker identifier uniquely identifying the target speaker. The method includes activating, using the speaker identifier, a particular sub-model for biasing the speech conversion model to recognize a type of the atypical speech associated with the target speaker identified by the speaker identifier. The method includes converting, using the speech conversion model biased by the activated particular sub-model, the input audio data corresponding to the utterance spoken by the target speaker associated with atypical speech into output audio data corresponding to a synthesized canonical fluent speech representation of the utterance spoken by the target speaker.

First claim

Opening claim text (preview).

What is claimed is: 1 . A computer-implemented method when executed on data processing hardware causes the data processing hardware to perform operations comprising: obtaining a speech conversion model configured to convert input utterances of human speech directly into corresponding output utterances of synthesized speech, the speech conversion model comprising an encoder and a decoder; receiving a speech conversion request comprising input audio data corresponding to an utterance spoken by a target speaker associated with atypical speech and a speaker identifier uniquely identifying the target speaker; activating, using the speaker identifier, a particular sub-model for biasing the speech conversion model to recognize a type of the atypical speech associated with the target speaker identified by the speaker identifier; and converting, using the speech conversion model biased by the activated particular sub-model, the input audio data corresponding to the utterance spoken by the target speaker associated with atypical speech into output audio data corresponding to a synthesized canonical fluent speech representation of the utterance spoken by the target speaker by: generating, as output from the encoder configured to receive the input audio data as input, encoded audio data, the encoded audio data including a series of vectors; and generating, as output from the decoder configured to receive the encoded audio data output from the encoder as input, the output audio data corresponding to the synthesized canonical fluent representation of the utterance spoken by the target speaker. 2 . The computer-implemented method of claim 1 , wherein the speech conversion model is: trained on generalized training data; and speaker- and domain-independent. 3 . The computer-implemented method of claim 1 , wherein generating the output audio data corresponding to the synthesized canonical fluent representation of the utterance spoken by the target speaker comprises generating the output audio data corresponding to the synthesized canonical fluent representation of the utterance spoken by the target speaker without performing any intermediate text-to-speech conversion on a textual representation corresponding to a transcription of the utterance. 4 . The computer-implemented method of claim 3 , wherein the encoder comprises a stack of self-attention blocks each having a multi-headed self attention mechanism. 5 . The computer-implemented method of claim 4 , wherein the particular sub-model comprises a stack of residual adaptors disposed between each of the self-attention blocks in the stack of self-attention blocks of the encoder. 6 . The computer-implemented method of claim 5 , wherein each residual adaptor comprises a normalization layer, followed by a feed-forward layer with down-projection to a bottleneck dimension and a non-linear activation, and another feed-forward layer with up-projection. 7 . The computer-implemented method of claim 3 , wherein the speech conversion model further comprises a wordpiece decoder configured to: receive, as input, the encoded audio data from the encoder; and generate, as output, a textual representation corresponding to a transcription of the utterance. 8 . The computer-implemented method of claim 3 , wherein the speech conversion model further comprises a phoneme decoder configured to: receive, as input, the encoded audio data from the encoder; and generate, as output, a phoneme representation of the utterance. 9 . The computer-implemented method of claim 1 , wherein: the input audio data comprises one of an input spectrogram or an input audio waveform; and the output audio data comprises one of an output spectrogram or an output audio waveform. 10 . The computer-implemented method of claim 1 , wherein activating the particular sub-model for biasing the speech conversion model comprises: selecting, from among a plurality of sub-models each associated with a different type of atypical speech, the particular sub-model associated with the type of atypical speech associated with the target speaker; and loading the particular sub-model into the speech conversion model for biasing the speech conversion model to recognize the type of the atypical speech associated with the target speaker. 11 . A system comprising: data processing hardware; and memory hardware in communication with the data processing hardware, the memory hardware storing instructions that when executed on the data processing hardware cause the data processing hardware to perform operations comprising: obtaining a speech conversion model configured to convert input utterances of human speech directly into corresponding output utterances of synthesized speech, the speech conversion model comprising an encoder and a decoder; receiving a speech conversion request comprising input audio data corresponding to an utterance spoken by a target speaker associated with atypical speech and a speaker identifier uniquely identifying the target speaker; activating, using the speaker identifier, a particular sub-model for biasing the speech conversion model to recognize a type of the atypical speech associated with the target speaker identified by the speaker identifier; and converting, using the speech conversion model biased by the activated particular sub-model, the input audio data corresponding to the utterance spoken by the target speaker associated with atypical speech into output audio data corresponding to a synthesized canonical fluent speech representation of the utterance spoken by the target speaker by: generating, as output from the encoder configured to receive the input audio data as input, encoded audio data, the encoded audio data including a series of vectors; and generating, as output from the decoder configured to receive the encoded audio data output from the encoder as input, the output audio data corresponding to the synthesized canonical fluent representation of the utterance spoken by the target speaker. 12 . The system of claim 11 , wherein the speech conversion model is: trained on generalized training data; and speaker- and domain-independent. 13 . The system of claim 11 , wherein generating the output audio data corresponding to the synthesized canonical fluent representation of the utterance spoken by the target speaker comprises generating the output audio data corresponding to the synthesized canonical fluent representation of the utterance spoken by the target speaker without performing any intermediate text-to-speech conversion on a textual representation corresponding to a transcription of the utterance. 14 . The system of claim 13 , wherein the encoder comprises a stack of self-attention blocks each having a multi-headed self attention mechanism. 15 . The system of claim 14 , wherein the particular sub-model comprises a stack of residual adaptors disposed between each of the self-attention blocks in the stack of self-attention blocks of the encoder. 16 . The system of claim 15 , wherein each residual adaptor comprises a normalization layer, followed by a feed-forward layer with down-projection to a bottleneck dimension and a non-linear activation, and another feed-forward layer with up-projection. 17 . The system of claim 13 , wherein the speech conversion model further comprises a wordpiece decoder configured to: receive, as input, the encoded audio data from the encoder; and generate, as output, a textual representation corresponding to a transcription of the utterance. 18 . The syste

Assignees

Inventors

Classifications

  • Phonemes, fenemes or fenones being the recognition units · CPC title

  • Training · CPC title

  • Feature extraction for speech recognition; Selection of recognition unit · CPC title

  • Speech synthesis; Text to speech systems · CPC title

  • Changing voice quality, e.g. pitch or formants · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12562154B2 cover?
A method for speech conversion includes obtaining a speech conversion model configured to convert input utterances of human speech directly into corresponding output utterances of synthesized speech. The method further includes receiving a speech conversion request including input audio data corresponding to an utterance spoken by a target speaker associated with atypical speech and a speaker i…
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification G10L15/16. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Feb 24 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).