What technology area does this patent fall under?

Primary CPC classification G10L13/033. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Jun 13 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Synthesized speech generation

US11676571B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11676571-B2
Application number	US-202117154372-A
Country	US
Kind code	B2
Filing date	Jan 21, 2021
Priority date	Jan 21, 2021
Publication date	Jun 13, 2023
Grant date	Jun 13, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A device for speech generation includes one or more processors configured to receive one or more control parameters indicating target speech characteristics. The one or more processors are also configured to process, using a multi-encoder, an input representation of speech based on the one or more control parameters to generate encoded data corresponding to an audio signal that represents a version of the speech based on the target speech characteristics.

First claim

Opening claim text (preview).

What is claimed is: 1. A device for speech generation comprising: one or more processors configured to: receive an input speech signal; receive one or more control parameters indicating target speech characteristics; perform audio feature extraction on the input speech signal to generate mel-scale spectrograms, fundamental frequency (F0) features, or both, of the input speech signal; and process, using a multi-encoder, an input representation of speech based on the one or more control parameters to generate encoded data corresponding to an audio signal that represents a version of the speech based on the target speech characteristics, wherein the input representation of speech includes the mel-scale spectrograms, fundamental frequency (F0) features, or both, of the input speech signal. 2. The device of claim 1 , wherein the one or more control parameters indicate a target person whose speech characteristics are to be used, a target emotion, a target rate of speech, or a combination thereof. 3. The device of claim 1 , wherein the one or more processors are further configured to generate merged style data based on the input representation and the one or more control parameters, and wherein the merged style data is used by the multi-encoder during processing of the input representation. 4. The device of claim 1 , wherein the multi-encoder includes: a first encoder configured to encode the input representation independently of the one or more control parameters to generate first encoded data; and one or more second encoders configured to encode the input representation based on the one or more control parameters to generate second encoded data, wherein the encoded data includes the first encoded data and the second encoded data. 5. The device of claim 4 , wherein the one or more processors are further configured to: process, at a speech characteristic encoder, the input representation based on at least one of the one or more control parameters to generate an encoded input speech representation; generate, at an encoder pre-network, merged style data based at least in part on the encoded input speech representation; provide the input representation to the first encoder to generate the first encoded data; and provide the merged style data to the one or more second encoders to generate the second encoded data. 6. The device of claim 4 , further comprising a multi-encoder transformer including the multi-encoder and a decoder, wherein the first encoder includes a first attention network, wherein each of the one or more second encoders includes a second attention network, and wherein the decoder includes a decoder attention network that is distinct from the first attention network and the second attention network of each of the one or more second encoders. 7. The device of claim 6 , wherein: the first encoder comprises: a first layer including the first attention network, wherein the first attention network corresponds to a first multi-head attention network; and a second layer including a first neural network, and each of the one or more second encoders comprises: a first layer including the second attention network, wherein the second attention network corresponds to a second multi-head attention network; and a second layer including a second neural network. 8. The device of claim 4 , further comprising: a decoder coupled to the multi-encoder, the decoder including a decoder network that is configured to generate output spectral data based on the first encoded data and the second encoded data; and a speech synthesizer configured to generate, based on the output spectral data, the audio signal that represents the version of the speech based on the target speech characteristics. 9. The device of claim 8 , wherein the decoder network includes a decoder attention network comprising: a first multi-head attention network configured to process the first encoded data; one or more second multi-head attention networks configured to process the second encoded data; and a combiner configured to combine outputs of the first multi-head attention network and the one or more second multi-head attention networks. 10. The device of claim 9 , wherein the decoder further comprises: a masked multi-head attention network coupled to an input of the decoder attention network; and a decoder neural network coupled to an output of the decoder attention network. 11. The device of claim 1 , wherein the one or processors are further configured to: generate one or more estimated control parameters from the audio signal; and based on a comparison of the one or more control parameters and the one or more estimated control parameters, train one or more neural network weights of the multi-encoder, one or more speech characteristic encoders, an encoder pre-network, a decoder network, or a combination thereof. 12. The device of claim 1 , further comprising a microphone, wherein the one or more processors are configured to receive the input speech signal via the microphone. 13. The device of claim 1 , wherein the one or more processors are further configured to receive the input speech signal from a speech repository. 14. The device of claim 1 , wherein the one or more processors are configured to receive an input signal that includes the input speech signal and a video signal. 15. A method of speech generation comprising: receiving an input speech signal at a device; receiving, at the device, one or more control parameters indicating target speech characteristics; performing, at the device, audio feature extraction on the input speech signal to generate mel-scale spectrograms, fundamental frequency (F0) features, or both, of the input speech signal; and processing, using a multi-encoder, an input representation of speech based on the one or more control parameters to generate encoded data corresponding to an audio signal that represents a version of the speech based on the target speech characteristics, wherein the input representation of speech includes the mel-scale spectrograms, fundamental frequency (F0) features, or both, of the input speech signal. 16. The method of claim 15 , wherein the one or more control parameters indicate a target person whose speech characteristics are to be used, a target emotion, a target rate of speech, or a combination thereof. 17. The method of claim 15 , further comprising generating, at the device, merged style data based on the input representation and the one or more control parameters, wherein the merged style data is used by the multi-encoder during processing of the input representation. 18. The method of claim 15 , further comprising: encoding, at a first encoder of the multi-encoder, the input representation independently of the one or more control parameters to generate first encoded data; and encoding, at one or more second encoders of the multi-encoder, the input representation based on the one or more control parameters to generate second encoded data, wherein the encoded data includes the first encoded data and the second encoded data. 19. The method of claim 18 , further comprising: processing, at a speech characteristic encoder, the input representation based on at least one of the one or more control parameters to generate an encoded input speech representation; generating, at an encoder pre-network, merged style data based at least in part on the encoded input speech representation; provide the input representation to the first encoder to generate the first encoded data; an

Assignees

Qualcomm Inc

Inventors

Classifications

G10L19/02
using spectral analysis, e.g. transform vocoders or subband vocoders · CPC title
G10L2021/0135
Voice conversion or morphing · CPC title
G10L25/63
for estimating an emotional state · CPC title
G10L13/033Primary
Voice editing, e.g. manipulating the voice of the synthesiser · CPC title
G06N3/045
Combinations of networks · CPC title

Patent family

Related publications grouped by family.

View patent family 79092855

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11676571B2 cover?: A device for speech generation includes one or more processors configured to receive one or more control parameters indicating target speech characteristics. The one or more processors are also configured to process, using a multi-encoder, an input representation of speech based on the one or more control parameters to generate encoded data corresponding to an audio signal that represents a ver…
Who is the assignee on this patent?: Qualcomm Inc
What technology area does this patent fall under?: Primary CPC classification G10L13/033. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Jun 13 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Synthesized speech generation

Synthesized Data Augmentation Using Voice Conversion and Speech Recognition Models

Text-to-speech (tts) processing

Text-to-speech (TTS) processing

Machine translation using neural network models

Frequently asked questions