Voice-transformation based data augmentation for prosodic classification
US-2019272818-A1 · Sep 5, 2019 · US
US11373633B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11373633-B2 |
| Application number | US-201916586007-A |
| Country | US |
| Kind code | B2 |
| Filing date | Sep 27, 2019 |
| Priority date | Sep 27, 2019 |
| Publication date | Jun 28, 2022 |
| Grant date | Jun 28, 2022 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
During text-to-speech processing, a speech model creates synthesized speech that corresponds to input data. The speech model may include an encoder for encoding the input data into a context vector and a decoder for decoding the context vector into spectrogram data. The speech model may further include a voice decoder that receives vocal characteristic data representing a desired vocal characteristic of synthesized speech. The voice decoder may process the vocal characteristic data to determine configuration data, such as weights, for use by the speech decoder.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method for generating speech from text, the method comprising: receiving a request to generate output speech data corresponding to input text data, the request including a description of a speaking style; determining, using a natural-language understanding (NLU) component, that the input text data represents an intent to perform synthesis of speech using a description of a speaking style indicated in the input text data; determining the description corresponds to a vocal characteristic; determining vocal characteristic data representing at least the vocal characteristic; processing, using a voice decoder, the vocal characteristic data to determine a neural-network model weight; processing, using a first encoder, the input text data to determine encoded linguistic data; receiving second user input corresponding to a speech synthesis task; processing, using a second encoder, the second user input to determine encoded data, wherein determining the encoded data comprises processing, using the second encoder, the encoded linguistic data and the vocal characteristic data; processing, using a speech decoder and the neural-network model weight, the encoded data to determine synthesized speech data; and causing output of audio corresponding to the synthesized speech data. 2. The computer-implemented method of claim 1 , further comprising: receiving audio data representing an utterance; processing the audio data to determine prosody data; and processing the prosody data using a trained model to determine a portion of the vocal characteristic data. 3. The computer-implemented method of claim 1 , further comprising: processing, using a third encoder, the vocal characteristic data to determine encoded paralinguistic data, wherein determining the encoded data further comprises processing, using the second encoder, the encoded paralinguistic data. 4. The computer-implemented method of claim 1 , wherein the request comprises a first portion and a second portion, further comprising: determining that the first portion lacks the description of the vocal characteristic; determining, using a dialog model, audio data representing a prompt for the vocal characteristic; and causing the audio data to be outputted. 5. A computer-implemented method comprising: receiving first input data representing a vocal characteristic and a request to output content corresponding to the vocal characteristic; performing natural language understanding (NLU) processing using the first input data to determine an intent to perform speech synthesis corresponding to the vocal characteristic represented in a portion of the first input data; based at least in part on determining the intent to perform speech synthesis corresponding to the vocal characteristic represented in the portion of the first input data, processing the first input data to determine vocal characteristic data representing at least the vocal characteristic; determining, using a trained model and the vocal characteristic data, a model output data; receiving second input data corresponding to a speech synthesis task; determining, using an encoder and the second input data, encoded data; and determining, using a decoder, the model output data and the encoded data, synthesized speech data corresponding to the vocal characteristic. 6. The computer-implemented method of claim 5 , wherein the first input data comprises audio data, and further comprising: determining that the audio data represents an utterance; processing the audio data to determine prosody data; and processing the prosody data using a trained model to determine a portion of the vocal characteristic data. 7. The computer-implemented method of claim 5 , further comprising: prior to receiving the first input data, receiving third input data; determining, using NLU processing, that the third input data lacks a description of the vocal characteristic; and causing an indication of a request for the vocal characteristic to be sent to a local device. 8. The computer-implemented method of claim 5 , wherein the encoder comprises a first encoder and a second encoder, further comprising: determining, using the first encoder and the first input data, encoded linguistic data, wherein determining the encoded data comprises processing, using the second encoder, the encoded linguistic data and the vocal characteristic data. 9. The computer-implemented method of claim 5 further comprising: determining identification data indicating that the synthesized speech data includes a representation of synthesized speech; determining modified synthesized speech data by processing the identification data with the synthesized speech data; and sending the modified synthesized speech data. 10. The computer-implemented method of claim 5 , wherein the first input data comprises a description of the vocal characteristic, and further comprising: determining a second vocal characteristic different from the description, wherein the vocal characteristic data further represents the second vocal characteristic. 11. The computer-implemented method of claim 5 , wherein the encoder comprises a first encoder and a second encoder, further comprising: determining, using the first encoder and the vocal characteristic data, encoded paralinguistic data, wherein determining the encoded data comprises processing, using the second encoder, the encoded paralinguistic data and the vocal characteristic data. 12. The computer-implemented method of claim 5 , wherein the portion of the first input data comprises audio data representing speech and the method further comprises: determining that the intent indicates the vocal characteristic is to be determined to match a characteristic of the speech. 13. The computer-implemented method of claim 5 , wherein the portion of the first input data comprises a description of the vocal characteristic and the method further comprises: determining that the intent indicates the vocal characteristic is to be determined based at least in part on the description. 14. The computer-implemented method of claim 5 , wherein the portion of the first input data comprises image data associated with the vocal characteristic and the method further comprises: determining that the intent indicates the vocal characteristic is to be determined based at least in part on the image data. 15. A system comprising: at least one processor; and at least one memory including instructions that, when executed by the at least one processor, cause the system to: receive first input data representing a vocal characteristic and a request to output content corresponding to the vocal characteristic; perform natural language understanding (NLU) processing using the first input data to determine an intent to perform speech synthesis corresponding to the vocal characteristic represented in a portion of the first input data; based at least in part on determining the intent to perform speech synthesis corresponding to the vocal characteristic represented in the portion of the first input data, process the first input data to determine vocal characteristic data representing at least the vocal characteristic; determine, using a trained model and the vocal characteristic data, a model output data; receive second input data corresponding to a speech synthesis task; determine, using an encoder and the second input data, encoded data; and determine, using a decoder, the model output data and the encoded data, synthesized speech data corresponding to the vocal characteristic
Recurrent networks, e.g. Hopfield networks · CPC title
Combinations of networks · CPC title
Auto-encoder networks; Encoder-decoder networks · CPC title
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title
Convolutional networks [CNN, ConvNet] · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.