Textless Speech-to-Speech Translation on Real Data

US2023186035A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2023186035-A1
Application numberUS-202217889116-A
CountryUS
Kind codeA1
Filing dateAug 16, 2022
Priority dateDec 14, 2021
Publication dateJun 15, 2023
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

In one embodiment, a method includes accessing a first utterance of a content by a first speaker, generating first discrete speech units from the first utterance based on a speech-learning model, wherein each of the first discrete speech units is associated with a speech cluster, accessing second utterances of the content by second speakers different from the first speaker, and training a speech normalizer by processing each of the second utterances using the speech normalizer to generate second discrete speech units and updating the speech normalizer by using the first discrete speech units as an optimization target for the second discrete speech units associated with each of the second utterances.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method comprising, by one or more computing systems: accessing a first utterance of a content by a first speaker; generating, based on a speech-learning model, a plurality of first discrete speech units from the first utterance, wherein each of the plurality of first discrete speech units is associated with a speech cluster; accessing one or more second utterances of the content by one or more second speakers different from the first speaker; and training a speech normalizer by: processing each of the one or more second utterances using the speech normalizer to generate a plurality of second discrete speech units; and updating the speech normalizer by using the plurality of first discrete speech units as an optimization target for the plurality of second discrete speech units associated with each of the one or more second utterances. 2 . The method of claim 1 , wherein generating the plurality of first discrete speech units comprises: generating a plurality of intermediate representations by processing the first utterance with the speech-learning model; and applying one or more clustering algorithms to the plurality of intermediate representations. 3 . The method of claim 1 , further comprising: reducing one or more repeating first content units from the plurality of first content units. 4 . The method of claim 1 , wherein the trained speech normalizer comprises one or more of a finetuned speech-learning model or a decoder. 5 . The method of claim 1 , further comprising: accessing a third utterance by a third speaker; and processing the third utterance using the trained speech normalizer to generate a plurality of normalized speech units. 6 . The method of claim 5 , further comprising: anonymizing the third speaker based on removing one or more normalized speech units associated with speech characteristics specific to the third speaker from the plurality of normalized speech units. 7 . The method of claim 5 , further comprising: denoising the third utterance based on removing one or more normalized speech units corresponding to background noises from the plurality of normalized speech units. 8 . The method of claim 5 , further comprising: removing one or more normalized speech units corresponding to silence longer than a threshold time from the plurality of normalized speech units. 9 . The method of claim 1 , further comprising: processing a plurality of first training data associated with a target language by the trained speech normalizer to generate a plurality of normalized target speech units; and training a textless speech-to-speech translation model based on the plurality of normalized target speech units and a plurality of second training data associated with a source language. 10 . One or more computer-readable non-transitory storage media embodying software that is operable when executed to: access a first utterance of a content by a first speaker; generate, based on a speech-learning model, a plurality of first discrete speech units from the first utterance, wherein each of the plurality of first discrete speech units is associated with a speech cluster; access one or more second utterances of the content by one or more second speakers different from the first speaker; and train a speech normalizer by: processing each of the one or more second utterances using the speech normalizer to generate a plurality of second discrete speech units; and updating the speech normalizer by using the plurality of first discrete speech units as an optimization target for the plurality of second discrete speech units associated with each of the one or more second utterances. 11 . The media of claim 10 , wherein generating the plurality of first discrete speech units comprises: generating a plurality of intermediate representations by processing the first utterance with the speech-learning model; and applying one or more clustering algorithms to the plurality of intermediate representations. 12 . The media of claim 10 , wherein the software is further operable when executed to: reduce one or more repeating first content units from the plurality of first content units. 13 . The media of claim 10 , wherein the trained speech normalizer comprises one or more of a finetuned speech-learning model or a decoder. 14 . The media of claim 10 , wherein the software is further operable when executed to: access a third utterance by a third speaker; and process the third utterance using the trained speech normalizer to generate a plurality of normalized speech units. 15 . The media of claim 14 , wherein the software is further operable when executed to: anonymize the third speaker based on removing one or more normalized speech units associated with speech characteristics specific to the third speaker from the plurality of normalized speech units. 16 . The media of claim 15 , wherein the software is further operable when executed to: denoise the third utterance based on removing one or more normalized speech units corresponding to background noises from the plurality of normalized speech units. 17 . The media of claim 15 , wherein the software is further operable when executed to: remove one or more normalized speech units corresponding to silence longer than a threshold time from the plurality of normalized speech units. 18 . The media of claim 10 , wherein the software is further operable when executed to: process a plurality of first training data associated with a target language by the trained speech normalizer to generate a plurality of normalized target speech units; and train a textless speech-to-speech translation model based on the plurality of normalized target speech units and a plurality of second training data associated with a source language. 19 . A system comprising: one or more processors; and a non-transitory memory coupled to the processors comprising instructions executable by the processors, the processors operable when executing the instructions to: access a first utterance of a content by a first speaker; generate, based on a speech-learning model, a plurality of first discrete speech units from the first utterance, wherein each of the plurality of first discrete speech units is associated with a speech cluster; access one or more second utterances of the content by one or more second speakers different from the first speaker; and train a speech normalizer by: processing each of the one or more second utterances using the speech normalizer to generate a plurality of second discrete speech units; and updating the speech normalizer by using the plurality of first discrete speech units as an optimization target for the plurality of second discrete speech units associated with each of the one or more second utterances. 20 . The system of claim 19 , wherein the processors are further operable when executing the instructions to: process a plurality of first training data associated with a target language by the trained speech normalizer to generate a plurality of normalized target speech units; and train a textless speech-to-speech translation model based on the plurality of normalized target speech units and a plurality of second training data associated with a source language.

Assignees

Inventors

Classifications

  • Creating reference templates; Clustering · CPC title

  • Training · CPC title

  • Noise filtering · CPC title

  • G06F40/42Primary

    Data-driven translation · CPC title

  • Concept to speech synthesisers; Generation of natural phrases from machine-based concepts (generation of parameters for speech synthesis out of text G10L13/08) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2023186035A1 cover?
In one embodiment, a method includes accessing a first utterance of a content by a first speaker, generating first discrete speech units from the first utterance based on a speech-learning model, wherein each of the first discrete speech units is associated with a speech cluster, accessing second utterances of the content by second speakers different from the first speaker, and training a speec…
Who is the assignee on this patent?
Meta Platforms Inc
What technology area does this patent fall under?
Primary CPC classification G06F40/42. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Jun 15 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).