Synthesized speech generation
US-2022230623-A1 · Jul 21, 2022 · US
US11676571B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11676571-B2 |
| Application number | US-202117154372-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jan 21, 2021 |
| Priority date | Jan 21, 2021 |
| Publication date | Jun 13, 2023 |
| Grant date | Jun 13, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A device for speech generation includes one or more processors configured to receive one or more control parameters indicating target speech characteristics. The one or more processors are also configured to process, using a multi-encoder, an input representation of speech based on the one or more control parameters to generate encoded data corresponding to an audio signal that represents a version of the speech based on the target speech characteristics.
Opening claim text (preview).
What is claimed is: 1. A device for speech generation comprising: one or more processors configured to: receive an input speech signal; receive one or more control parameters indicating target speech characteristics; perform audio feature extraction on the input speech signal to generate mel-scale spectrograms, fundamental frequency (F0) features, or both, of the input speech signal; and process, using a multi-encoder, an input representation of speech based on the one or more control parameters to generate encoded data corresponding to an audio signal that represents a version of the speech based on the target speech characteristics, wherein the input representation of speech includes the mel-scale spectrograms, fundamental frequency (F0) features, or both, of the input speech signal. 2. The device of claim 1 , wherein the one or more control parameters indicate a target person whose speech characteristics are to be used, a target emotion, a target rate of speech, or a combination thereof. 3. The device of claim 1 , wherein the one or more processors are further configured to generate merged style data based on the input representation and the one or more control parameters, and wherein the merged style data is used by the multi-encoder during processing of the input representation. 4. The device of claim 1 , wherein the multi-encoder includes: a first encoder configured to encode the input representation independently of the one or more control parameters to generate first encoded data; and one or more second encoders configured to encode the input representation based on the one or more control parameters to generate second encoded data, wherein the encoded data includes the first encoded data and the second encoded data. 5. The device of claim 4 , wherein the one or more processors are further configured to: process, at a speech characteristic encoder, the input representation based on at least one of the one or more control parameters to generate an encoded input speech representation; generate, at an encoder pre-network, merged style data based at least in part on the encoded input speech representation; provide the input representation to the first encoder to generate the first encoded data; and provide the merged style data to the one or more second encoders to generate the second encoded data. 6. The device of claim 4 , further comprising a multi-encoder transformer including the multi-encoder and a decoder, wherein the first encoder includes a first attention network, wherein each of the one or more second encoders includes a second attention network, and wherein the decoder includes a decoder attention network that is distinct from the first attention network and the second attention network of each of the one or more second encoders. 7. The device of claim 6 , wherein: the first encoder comprises: a first layer including the first attention network, wherein the first attention network corresponds to a first multi-head attention network; and a second layer including a first neural network, and each of the one or more second encoders comprises: a first layer including the second attention network, wherein the second attention network corresponds to a second multi-head attention network; and a second layer including a second neural network. 8. The device of claim 4 , further comprising: a decoder coupled to the multi-encoder, the decoder including a decoder network that is configured to generate output spectral data based on the first encoded data and the second encoded data; and a speech synthesizer configured to generate, based on the output spectral data, the audio signal that represents the version of the speech based on the target speech characteristics. 9. The device of claim 8 , wherein the decoder network includes a decoder attention network comprising: a first multi-head attention network configured to process the first encoded data; one or more second multi-head attention networks configured to process the second encoded data; and a combiner configured to combine outputs of the first multi-head attention network and the one or more second multi-head attention networks. 10. The device of claim 9 , wherein the decoder further comprises: a masked multi-head attention network coupled to an input of the decoder attention network; and a decoder neural network coupled to an output of the decoder attention network. 11. The device of claim 1 , wherein the one or processors are further configured to: generate one or more estimated control parameters from the audio signal; and based on a comparison of the one or more control parameters and the one or more estimated control parameters, train one or more neural network weights of the multi-encoder, one or more speech characteristic encoders, an encoder pre-network, a decoder network, or a combination thereof. 12. The device of claim 1 , further comprising a microphone, wherein the one or more processors are configured to receive the input speech signal via the microphone. 13. The device of claim 1 , wherein the one or more processors are further configured to receive the input speech signal from a speech repository. 14. The device of claim 1 , wherein the one or more processors are configured to receive an input signal that includes the input speech signal and a video signal. 15. A method of speech generation comprising: receiving an input speech signal at a device; receiving, at the device, one or more control parameters indicating target speech characteristics; performing, at the device, audio feature extraction on the input speech signal to generate mel-scale spectrograms, fundamental frequency (F0) features, or both, of the input speech signal; and processing, using a multi-encoder, an input representation of speech based on the one or more control parameters to generate encoded data corresponding to an audio signal that represents a version of the speech based on the target speech characteristics, wherein the input representation of speech includes the mel-scale spectrograms, fundamental frequency (F0) features, or both, of the input speech signal. 16. The method of claim 15 , wherein the one or more control parameters indicate a target person whose speech characteristics are to be used, a target emotion, a target rate of speech, or a combination thereof. 17. The method of claim 15 , further comprising generating, at the device, merged style data based on the input representation and the one or more control parameters, wherein the merged style data is used by the multi-encoder during processing of the input representation. 18. The method of claim 15 , further comprising: encoding, at a first encoder of the multi-encoder, the input representation independently of the one or more control parameters to generate first encoded data; and encoding, at one or more second encoders of the multi-encoder, the input representation based on the one or more control parameters to generate second encoded data, wherein the encoded data includes the first encoded data and the second encoded data. 19. The method of claim 18 , further comprising: processing, at a speech characteristic encoder, the input representation based on at least one of the one or more control parameters to generate an encoded input speech representation; generating, at an encoder pre-network, merged style data based at least in part on the encoded input speech representation; provide the input representation to the first encoder to generate the first encoded data; an
using spectral analysis, e.g. transform vocoders or subband vocoders · CPC title
Voice conversion or morphing · CPC title
for estimating an emotional state · CPC title
Voice editing, e.g. manipulating the voice of the synthesiser · CPC title
Combinations of networks · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.