Text-to-speech processing using input voice characteristic data

US11373633B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11373633-B2
Application numberUS-201916586007-A
CountryUS
Kind codeB2
Filing dateSep 27, 2019
Priority dateSep 27, 2019
Publication dateJun 28, 2022
Grant dateJun 28, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

During text-to-speech processing, a speech model creates synthesized speech that corresponds to input data. The speech model may include an encoder for encoding the input data into a context vector and a decoder for decoding the context vector into spectrogram data. The speech model may further include a voice decoder that receives vocal characteristic data representing a desired vocal characteristic of synthesized speech. The voice decoder may process the vocal characteristic data to determine configuration data, such as weights, for use by the speech decoder.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method for generating speech from text, the method comprising: receiving a request to generate output speech data corresponding to input text data, the request including a description of a speaking style; determining, using a natural-language understanding (NLU) component, that the input text data represents an intent to perform synthesis of speech using a description of a speaking style indicated in the input text data; determining the description corresponds to a vocal characteristic; determining vocal characteristic data representing at least the vocal characteristic; processing, using a voice decoder, the vocal characteristic data to determine a neural-network model weight; processing, using a first encoder, the input text data to determine encoded linguistic data; receiving second user input corresponding to a speech synthesis task; processing, using a second encoder, the second user input to determine encoded data, wherein determining the encoded data comprises processing, using the second encoder, the encoded linguistic data and the vocal characteristic data; processing, using a speech decoder and the neural-network model weight, the encoded data to determine synthesized speech data; and causing output of audio corresponding to the synthesized speech data. 2. The computer-implemented method of claim 1 , further comprising: receiving audio data representing an utterance; processing the audio data to determine prosody data; and processing the prosody data using a trained model to determine a portion of the vocal characteristic data. 3. The computer-implemented method of claim 1 , further comprising: processing, using a third encoder, the vocal characteristic data to determine encoded paralinguistic data, wherein determining the encoded data further comprises processing, using the second encoder, the encoded paralinguistic data. 4. The computer-implemented method of claim 1 , wherein the request comprises a first portion and a second portion, further comprising: determining that the first portion lacks the description of the vocal characteristic; determining, using a dialog model, audio data representing a prompt for the vocal characteristic; and causing the audio data to be outputted. 5. A computer-implemented method comprising: receiving first input data representing a vocal characteristic and a request to output content corresponding to the vocal characteristic; performing natural language understanding (NLU) processing using the first input data to determine an intent to perform speech synthesis corresponding to the vocal characteristic represented in a portion of the first input data; based at least in part on determining the intent to perform speech synthesis corresponding to the vocal characteristic represented in the portion of the first input data, processing the first input data to determine vocal characteristic data representing at least the vocal characteristic; determining, using a trained model and the vocal characteristic data, a model output data; receiving second input data corresponding to a speech synthesis task; determining, using an encoder and the second input data, encoded data; and determining, using a decoder, the model output data and the encoded data, synthesized speech data corresponding to the vocal characteristic. 6. The computer-implemented method of claim 5 , wherein the first input data comprises audio data, and further comprising: determining that the audio data represents an utterance; processing the audio data to determine prosody data; and processing the prosody data using a trained model to determine a portion of the vocal characteristic data. 7. The computer-implemented method of claim 5 , further comprising: prior to receiving the first input data, receiving third input data; determining, using NLU processing, that the third input data lacks a description of the vocal characteristic; and causing an indication of a request for the vocal characteristic to be sent to a local device. 8. The computer-implemented method of claim 5 , wherein the encoder comprises a first encoder and a second encoder, further comprising: determining, using the first encoder and the first input data, encoded linguistic data, wherein determining the encoded data comprises processing, using the second encoder, the encoded linguistic data and the vocal characteristic data. 9. The computer-implemented method of claim 5 further comprising: determining identification data indicating that the synthesized speech data includes a representation of synthesized speech; determining modified synthesized speech data by processing the identification data with the synthesized speech data; and sending the modified synthesized speech data. 10. The computer-implemented method of claim 5 , wherein the first input data comprises a description of the vocal characteristic, and further comprising: determining a second vocal characteristic different from the description, wherein the vocal characteristic data further represents the second vocal characteristic. 11. The computer-implemented method of claim 5 , wherein the encoder comprises a first encoder and a second encoder, further comprising: determining, using the first encoder and the vocal characteristic data, encoded paralinguistic data, wherein determining the encoded data comprises processing, using the second encoder, the encoded paralinguistic data and the vocal characteristic data. 12. The computer-implemented method of claim 5 , wherein the portion of the first input data comprises audio data representing speech and the method further comprises: determining that the intent indicates the vocal characteristic is to be determined to match a characteristic of the speech. 13. The computer-implemented method of claim 5 , wherein the portion of the first input data comprises a description of the vocal characteristic and the method further comprises: determining that the intent indicates the vocal characteristic is to be determined based at least in part on the description. 14. The computer-implemented method of claim 5 , wherein the portion of the first input data comprises image data associated with the vocal characteristic and the method further comprises: determining that the intent indicates the vocal characteristic is to be determined based at least in part on the image data. 15. A system comprising: at least one processor; and at least one memory including instructions that, when executed by the at least one processor, cause the system to: receive first input data representing a vocal characteristic and a request to output content corresponding to the vocal characteristic; perform natural language understanding (NLU) processing using the first input data to determine an intent to perform speech synthesis corresponding to the vocal characteristic represented in a portion of the first input data; based at least in part on determining the intent to perform speech synthesis corresponding to the vocal characteristic represented in the portion of the first input data, process the first input data to determine vocal characteristic data representing at least the vocal characteristic; determine, using a trained model and the vocal characteristic data, a model output data; receive second input data corresponding to a speech synthesis task; determine, using an encoder and the second input data, encoded data; and determine, using a decoder, the model output data and the encoded data, synthesized speech data corresponding to the vocal characteristic

Assignees

Inventors

Classifications

  • Recurrent networks, e.g. Hopfield networks · CPC title

  • Combinations of networks · CPC title

  • Auto-encoder networks; Encoder-decoder networks · CPC title

  • characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title

  • Convolutional networks [CNN, ConvNet] · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11373633B2 cover?
During text-to-speech processing, a speech model creates synthesized speech that corresponds to input data. The speech model may include an encoder for encoding the input data into a context vector and a decoder for decoding the context vector into spectrogram data. The speech model may further include a voice decoder that receives vocal characteristic data representing a desired vocal characte…
Who is the assignee on this patent?
Amazon Tech Inc
What technology area does this patent fall under?
Primary CPC classification G10L13/033. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jun 28 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).