Speech signal processing device, speech signal processing method, speech signal processing program, training device, training method, and training program
US-2022335965-A1 · Oct 20, 2022 · US
US2023186035A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2023186035-A1 |
| Application number | US-202217889116-A |
| Country | US |
| Kind code | A1 |
| Filing date | Aug 16, 2022 |
| Priority date | Dec 14, 2021 |
| Publication date | Jun 15, 2023 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
In one embodiment, a method includes accessing a first utterance of a content by a first speaker, generating first discrete speech units from the first utterance based on a speech-learning model, wherein each of the first discrete speech units is associated with a speech cluster, accessing second utterances of the content by second speakers different from the first speaker, and training a speech normalizer by processing each of the second utterances using the speech normalizer to generate second discrete speech units and updating the speech normalizer by using the first discrete speech units as an optimization target for the second discrete speech units associated with each of the second utterances.
Opening claim text (preview).
What is claimed is: 1 . A method comprising, by one or more computing systems: accessing a first utterance of a content by a first speaker; generating, based on a speech-learning model, a plurality of first discrete speech units from the first utterance, wherein each of the plurality of first discrete speech units is associated with a speech cluster; accessing one or more second utterances of the content by one or more second speakers different from the first speaker; and training a speech normalizer by: processing each of the one or more second utterances using the speech normalizer to generate a plurality of second discrete speech units; and updating the speech normalizer by using the plurality of first discrete speech units as an optimization target for the plurality of second discrete speech units associated with each of the one or more second utterances. 2 . The method of claim 1 , wherein generating the plurality of first discrete speech units comprises: generating a plurality of intermediate representations by processing the first utterance with the speech-learning model; and applying one or more clustering algorithms to the plurality of intermediate representations. 3 . The method of claim 1 , further comprising: reducing one or more repeating first content units from the plurality of first content units. 4 . The method of claim 1 , wherein the trained speech normalizer comprises one or more of a finetuned speech-learning model or a decoder. 5 . The method of claim 1 , further comprising: accessing a third utterance by a third speaker; and processing the third utterance using the trained speech normalizer to generate a plurality of normalized speech units. 6 . The method of claim 5 , further comprising: anonymizing the third speaker based on removing one or more normalized speech units associated with speech characteristics specific to the third speaker from the plurality of normalized speech units. 7 . The method of claim 5 , further comprising: denoising the third utterance based on removing one or more normalized speech units corresponding to background noises from the plurality of normalized speech units. 8 . The method of claim 5 , further comprising: removing one or more normalized speech units corresponding to silence longer than a threshold time from the plurality of normalized speech units. 9 . The method of claim 1 , further comprising: processing a plurality of first training data associated with a target language by the trained speech normalizer to generate a plurality of normalized target speech units; and training a textless speech-to-speech translation model based on the plurality of normalized target speech units and a plurality of second training data associated with a source language. 10 . One or more computer-readable non-transitory storage media embodying software that is operable when executed to: access a first utterance of a content by a first speaker; generate, based on a speech-learning model, a plurality of first discrete speech units from the first utterance, wherein each of the plurality of first discrete speech units is associated with a speech cluster; access one or more second utterances of the content by one or more second speakers different from the first speaker; and train a speech normalizer by: processing each of the one or more second utterances using the speech normalizer to generate a plurality of second discrete speech units; and updating the speech normalizer by using the plurality of first discrete speech units as an optimization target for the plurality of second discrete speech units associated with each of the one or more second utterances. 11 . The media of claim 10 , wherein generating the plurality of first discrete speech units comprises: generating a plurality of intermediate representations by processing the first utterance with the speech-learning model; and applying one or more clustering algorithms to the plurality of intermediate representations. 12 . The media of claim 10 , wherein the software is further operable when executed to: reduce one or more repeating first content units from the plurality of first content units. 13 . The media of claim 10 , wherein the trained speech normalizer comprises one or more of a finetuned speech-learning model or a decoder. 14 . The media of claim 10 , wherein the software is further operable when executed to: access a third utterance by a third speaker; and process the third utterance using the trained speech normalizer to generate a plurality of normalized speech units. 15 . The media of claim 14 , wherein the software is further operable when executed to: anonymize the third speaker based on removing one or more normalized speech units associated with speech characteristics specific to the third speaker from the plurality of normalized speech units. 16 . The media of claim 15 , wherein the software is further operable when executed to: denoise the third utterance based on removing one or more normalized speech units corresponding to background noises from the plurality of normalized speech units. 17 . The media of claim 15 , wherein the software is further operable when executed to: remove one or more normalized speech units corresponding to silence longer than a threshold time from the plurality of normalized speech units. 18 . The media of claim 10 , wherein the software is further operable when executed to: process a plurality of first training data associated with a target language by the trained speech normalizer to generate a plurality of normalized target speech units; and train a textless speech-to-speech translation model based on the plurality of normalized target speech units and a plurality of second training data associated with a source language. 19 . A system comprising: one or more processors; and a non-transitory memory coupled to the processors comprising instructions executable by the processors, the processors operable when executing the instructions to: access a first utterance of a content by a first speaker; generate, based on a speech-learning model, a plurality of first discrete speech units from the first utterance, wherein each of the plurality of first discrete speech units is associated with a speech cluster; access one or more second utterances of the content by one or more second speakers different from the first speaker; and train a speech normalizer by: processing each of the one or more second utterances using the speech normalizer to generate a plurality of second discrete speech units; and updating the speech normalizer by using the plurality of first discrete speech units as an optimization target for the plurality of second discrete speech units associated with each of the one or more second utterances. 20 . The system of claim 19 , wherein the processors are further operable when executing the instructions to: process a plurality of first training data associated with a target language by the trained speech normalizer to generate a plurality of normalized target speech units; and train a textless speech-to-speech translation model based on the plurality of normalized target speech units and a plurality of second training data associated with a source language.
Creating reference templates; Clustering · CPC title
Training · CPC title
Noise filtering · CPC title
Data-driven translation · CPC title
Concept to speech synthesisers; Generation of natural phrases from machine-based concepts (generation of parameters for speech synthesis out of text G10L13/08) · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.