System and method for secure transcription generation
US-12367860-B2 · Jul 22, 2025 · US
US12562154B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12562154-B2 |
| Application number | US-202318184630-A |
| Country | US |
| Kind code | B2 |
| Filing date | Mar 15, 2023 |
| Priority date | Mar 18, 2022 |
| Publication date | Feb 24, 2026 |
| Grant date | Feb 24, 2026 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method for speech conversion includes obtaining a speech conversion model configured to convert input utterances of human speech directly into corresponding output utterances of synthesized speech. The method further includes receiving a speech conversion request including input audio data corresponding to an utterance spoken by a target speaker associated with atypical speech and a speaker identifier uniquely identifying the target speaker. The method includes activating, using the speaker identifier, a particular sub-model for biasing the speech conversion model to recognize a type of the atypical speech associated with the target speaker identified by the speaker identifier. The method includes converting, using the speech conversion model biased by the activated particular sub-model, the input audio data corresponding to the utterance spoken by the target speaker associated with atypical speech into output audio data corresponding to a synthesized canonical fluent speech representation of the utterance spoken by the target speaker.
Opening claim text (preview).
What is claimed is: 1 . A computer-implemented method when executed on data processing hardware causes the data processing hardware to perform operations comprising: obtaining a speech conversion model configured to convert input utterances of human speech directly into corresponding output utterances of synthesized speech, the speech conversion model comprising an encoder and a decoder; receiving a speech conversion request comprising input audio data corresponding to an utterance spoken by a target speaker associated with atypical speech and a speaker identifier uniquely identifying the target speaker; activating, using the speaker identifier, a particular sub-model for biasing the speech conversion model to recognize a type of the atypical speech associated with the target speaker identified by the speaker identifier; and converting, using the speech conversion model biased by the activated particular sub-model, the input audio data corresponding to the utterance spoken by the target speaker associated with atypical speech into output audio data corresponding to a synthesized canonical fluent speech representation of the utterance spoken by the target speaker by: generating, as output from the encoder configured to receive the input audio data as input, encoded audio data, the encoded audio data including a series of vectors; and generating, as output from the decoder configured to receive the encoded audio data output from the encoder as input, the output audio data corresponding to the synthesized canonical fluent representation of the utterance spoken by the target speaker. 2 . The computer-implemented method of claim 1 , wherein the speech conversion model is: trained on generalized training data; and speaker- and domain-independent. 3 . The computer-implemented method of claim 1 , wherein generating the output audio data corresponding to the synthesized canonical fluent representation of the utterance spoken by the target speaker comprises generating the output audio data corresponding to the synthesized canonical fluent representation of the utterance spoken by the target speaker without performing any intermediate text-to-speech conversion on a textual representation corresponding to a transcription of the utterance. 4 . The computer-implemented method of claim 3 , wherein the encoder comprises a stack of self-attention blocks each having a multi-headed self attention mechanism. 5 . The computer-implemented method of claim 4 , wherein the particular sub-model comprises a stack of residual adaptors disposed between each of the self-attention blocks in the stack of self-attention blocks of the encoder. 6 . The computer-implemented method of claim 5 , wherein each residual adaptor comprises a normalization layer, followed by a feed-forward layer with down-projection to a bottleneck dimension and a non-linear activation, and another feed-forward layer with up-projection. 7 . The computer-implemented method of claim 3 , wherein the speech conversion model further comprises a wordpiece decoder configured to: receive, as input, the encoded audio data from the encoder; and generate, as output, a textual representation corresponding to a transcription of the utterance. 8 . The computer-implemented method of claim 3 , wherein the speech conversion model further comprises a phoneme decoder configured to: receive, as input, the encoded audio data from the encoder; and generate, as output, a phoneme representation of the utterance. 9 . The computer-implemented method of claim 1 , wherein: the input audio data comprises one of an input spectrogram or an input audio waveform; and the output audio data comprises one of an output spectrogram or an output audio waveform. 10 . The computer-implemented method of claim 1 , wherein activating the particular sub-model for biasing the speech conversion model comprises: selecting, from among a plurality of sub-models each associated with a different type of atypical speech, the particular sub-model associated with the type of atypical speech associated with the target speaker; and loading the particular sub-model into the speech conversion model for biasing the speech conversion model to recognize the type of the atypical speech associated with the target speaker. 11 . A system comprising: data processing hardware; and memory hardware in communication with the data processing hardware, the memory hardware storing instructions that when executed on the data processing hardware cause the data processing hardware to perform operations comprising: obtaining a speech conversion model configured to convert input utterances of human speech directly into corresponding output utterances of synthesized speech, the speech conversion model comprising an encoder and a decoder; receiving a speech conversion request comprising input audio data corresponding to an utterance spoken by a target speaker associated with atypical speech and a speaker identifier uniquely identifying the target speaker; activating, using the speaker identifier, a particular sub-model for biasing the speech conversion model to recognize a type of the atypical speech associated with the target speaker identified by the speaker identifier; and converting, using the speech conversion model biased by the activated particular sub-model, the input audio data corresponding to the utterance spoken by the target speaker associated with atypical speech into output audio data corresponding to a synthesized canonical fluent speech representation of the utterance spoken by the target speaker by: generating, as output from the encoder configured to receive the input audio data as input, encoded audio data, the encoded audio data including a series of vectors; and generating, as output from the decoder configured to receive the encoded audio data output from the encoder as input, the output audio data corresponding to the synthesized canonical fluent representation of the utterance spoken by the target speaker. 12 . The system of claim 11 , wherein the speech conversion model is: trained on generalized training data; and speaker- and domain-independent. 13 . The system of claim 11 , wherein generating the output audio data corresponding to the synthesized canonical fluent representation of the utterance spoken by the target speaker comprises generating the output audio data corresponding to the synthesized canonical fluent representation of the utterance spoken by the target speaker without performing any intermediate text-to-speech conversion on a textual representation corresponding to a transcription of the utterance. 14 . The system of claim 13 , wherein the encoder comprises a stack of self-attention blocks each having a multi-headed self attention mechanism. 15 . The system of claim 14 , wherein the particular sub-model comprises a stack of residual adaptors disposed between each of the self-attention blocks in the stack of self-attention blocks of the encoder. 16 . The system of claim 15 , wherein each residual adaptor comprises a normalization layer, followed by a feed-forward layer with down-projection to a bottleneck dimension and a non-linear activation, and another feed-forward layer with up-projection. 17 . The system of claim 13 , wherein the speech conversion model further comprises a wordpiece decoder configured to: receive, as input, the encoded audio data from the encoder; and generate, as output, a textual representation corresponding to a transcription of the utterance. 18 . The syste
Phonemes, fenemes or fenones being the recognition units · CPC title
Training · CPC title
Feature extraction for speech recognition; Selection of recognition unit · CPC title
Speech synthesis; Text to speech systems · CPC title
Changing voice quality, e.g. pitch or formants · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.