Speech synthesis utilizing audio waveform difference signal(s)

US11915682B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11915682-B2
Application numberUS-201917610934-A
CountryUS
Kind codeB2
Filing dateMay 20, 2019
Priority dateMay 15, 2019
Publication dateFeb 27, 2024
Grant dateFeb 27, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Techniques are disclosed that enable generation of an audio waveform representing synthesized speech based on a difference signal determined using an autoregressive model. Various implementations include using a distribution of the difference signal values to represent sounds found in human speech with a higher level of granularity than sounds not frequently found in human speech. Additional or alternative implementations include using one or more speakers of a client device to render the generated audio waveform.

First claim

Opening claim text (preview).

What is claimed is: 1. A method implemented by one or more processors, the method comprising: iteratively generating samples of an audio waveform that is synthesized speech of provided text, wherein generating the samples of the audio waveform comprises: at each iteration of a plurality of sequential iterations: generating a respective difference signal for the iteration using an autoregressive model, wherein the respective difference signal is a predicted difference based on an amplitude of a respective preceding sample of the audio waveform generated in an immediately preceding iteration and an amplitude of a respective sample for the iteration, wherein an input to the autoregressive model comprises: a respective representation of at least part of the provided text, the respective preceding sample of the audio waveform generated in the immediately preceding iteration of the sequential iterations, and a respective preceding difference signal generated in the immediately preceding iteration; and determining the respective sample for the iteration using the respective difference signal for the iteration and the respective preceding sample of the audio waveform generated in the immediately preceding iteration, the respective sample for the iteration being one of the samples of the audio waveform; and causing a client device to render the audio waveform by rendering the samples of the audio waveform. 2. The method of claim 1 , wherein the one or more processors are one or more processors of the client device, wherein the client device includes memory and one or more speakers, wherein the autoregressive model is stored in the memory, wherein the audio waveform is generated using one or more of the processors of the client device, and wherein the audio waveform is rendered using one or more of the speakers of the client device. 3. The method of claim 2 , further comprising: determining that one or more conditions of the client device are satisfied; and in response to determining that the one or more conditions are satisfied: determining to utilize the autoregressive model to generate the audio waveform based on difference signals generated using the autoregressive model, instead of utilizing an alternative autoregressive model that is more resource intensive to utilize than the autoregressive model. 4. The method of claim 3 , wherein the one or more conditions of the client device include the client device being powered by a battery which is not fully charged. 5. The method of claim 3 , wherein the one or more conditions of the client device include the one or more of the processors of the client device being throttled by heat. 6. The method of claim 1 , wherein the one or more processors are one or more processors of a server that is remote from the client device, wherein the autoregressive model is stored in memory of the server, wherein the audio waveform is generated using one or more of the processors of the server, and wherein causing the client device to render the audio waveform comprises transmitting the samples of the audio waveform to the client device. 7. The method of claim 6 , further comprising: determining that one or more conditions of the server are satisfied; and in response to determining that the one or more conditions are satisfied: determining to utilize the autoregressive model to generate the audio waveform based on difference signals generated using the autoregressive model, instead of utilizing an alternative autoregressive model that is more resource intensive to utilize than the autoregressive model. 8. The method of claim 7 , wherein the one or more conditions of the server include one or more of the processors of the server being throttled by heat. 9. The method of claim 1 , wherein the autoregressive model is a recurrent neural network model. 10. The method of claim 1 , wherein the difference signal generated for the iteration is a smaller number of bits than a number of bits for the respective sample of the audio waveform of the iteration. 11. The method of claim 1 , wherein the difference signal is a discrete value selected from a difference signal distribution. 12. The method of claim 11 , wherein the difference signal distribution is a log uniform distribution. 13. The method of claim 11 , wherein the difference signal distribution includes 256 discrete values or 512 discrete values. 14. The method of claim 11 , wherein the difference signal distribution includes at least a first difference signal value and a second difference signal value, wherein the first difference signal value represents a change in sound corresponding to a high amplitude high frequency sound not found in human speech, or found in human speech with less than a threshold frequency, wherein the second difference signal value represents a change is sound found in human speech, or found in human speech with greater than a threshold frequency, and wherein the change in sound represented by the first difference signal value is greater than the change in sound represented by the second difference signal value. 15. The method of claim 11 , wherein the difference signal distribution excludes a difference signal value representing a high amplitude high frequency sound not found in human speech, or found in human speech with less than a threshold frequency. 16. The method of claim 1 , wherein the audio waveform comprises the synthesized speech of the provided text representing an individual word. 17. The method of claim 1 , wherein the audio waveform comprises the synthesized speech of the provided text representing an individual phoneme. 18. The method of claim 1 , further comprising: training the autoregressive model using a speech synthesis training instance including provided training text and a ground truth audio waveform corresponding to the provided training text, wherein training the autoregressive model comprises: at each iteration of a plurality of sequential training iterations of generating samples of a training audio waveform: generating a respective training difference signal for the iteration using the autoregressive model, wherein the respective training difference signal is a predicted difference based on an amplitude of a respective preceding training sample of the training audio waveform generated in an immediately preceding iteration and an amplitude of a respective training sample for the iteration, wherein an input to the autoregressive model comprises: a respective representation of at least part of the provided training text, the respective preceding training sample of the training audio waveform generated in the immediately preceding iteration of the sequential training iterations, and a respective preceding training difference signal generated in the immediately preceding iteration; determining the respective training sample for the iteration using the respective training difference signal for the iteration and the respective preceding training sample of the training audio waveform generated in the immediately preceding iteration, the respective training sample for the iteration being one of the samples of the training audio waveform; determining a difference between the respective training sample for the iteration and a corresponding sample of the ground truth audio waveform; and updating one or more weights of the autoregressive model based on the determined difference. 19. The method of claim 1 , wherein the client device executes an automated assistant client.

Assignees

Inventors

Classifications

  • characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title

  • Supervised learning · CPC title

  • G10L13/047Primary

    Architecture of speech synthesisers · CPC title

  • Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination · CPC title

  • using neural networks · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11915682B2 cover?
Techniques are disclosed that enable generation of an audio waveform representing synthesized speech based on a difference signal determined using an autoregressive model. Various implementations include using a distribution of the difference signal values to represent sounds found in human speech with a higher level of granularity than sounds not frequently found in human speech. Additional or…
Who is the assignee on this patent?
Deepmind Tech Ltd
What technology area does this patent fall under?
Primary CPC classification G10L13/047. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Feb 27 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).