Synthesizing speech from text using neural networks

US12148444B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12148444-B2
Application numberUS-202117222736-A
CountryUS
Kind codeB2
Filing dateApr 5, 2021
Priority dateAug 8, 2018
Publication dateNov 19, 2024
Grant dateNov 19, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and computer program products for generating, from an input character sequence, an output sequence of audio data representing the input character sequence. The output sequence of audio data includes a respective audio output sample for each of a number of time steps. One example method includes, for each of the time steps: generating a mel-frequency spectrogram for the time step by processing a representation of a respective portion of the input character sequence using a decoder neural network; generating a probability distribution over a plurality of possible audio output samples for the time step by processing the mel-frequency spectrogram for the time step using a vocoder neural network; and selecting the audio output sample for the time step from the possible audio output samples in accordance with the probability distribution.

First claim

Opening claim text (preview).

What is claimed: 1. A method for generating, from an input data representing a text input, an output sequence of audio data corresponding to the input data, wherein the output sequence of audio data comprises a respective audio output sample for each of a plurality of time steps, and wherein the method comprises, for each of the plurality of time steps: generating a mel-frequency spectrogram for a timestep of the plurality of timesteps by processing a representation of a respective portion of the input data using a decoder neural network, wherein the decoder neural network is an autoregressive neural network comprising an LSTM subnetwork, a linear transform, and a convolutional subnetwork; generating a probability distribution over a plurality of possible audio output samples for the time step by processing the mel-frequency spectrogram for the time step using a vocoder neural network; and selecting an audio output sample for the time step from the plurality of possible audio output samples in accordance with the probability distribution. 2. The method of claim 1 , wherein the vocoder neural network is an autoregressive neural network, and wherein generating the probability distribution further comprises: conditioning the autoregressive neural network on a current output sequence of audio data comprising respective audio output samples for each preceding time step in the output sequence. 3. The method of claim 1 , wherein processing the representation of the respective portion of the input data, comprises: processing, by an encoder neural network, the input data to generate a feature representation of the input data; and processing, by an attention network, the feature representation, to generate a fixed-length context vector for the time step. 4. The method of claim 3 , wherein the encoder neural network comprises a convolutional subnetwork and a bidirectional long short-term memory (LSTM) layer, and wherein the feature representation is a sequential feature representation that represents a local structure of the input data around a particular character in the input data. 5. The method of claim 2 , wherein the autoregressive neural network of the vocoder neural network, comprises: a convolutional subnetwork having one or more dilated convolutional layers; and a logistic output layer configured to generate a continuous probability distribution of possible audio output samples. 6. The method of claim 1 , wherein the probability distribution is a logistic distribution. 7. The method of claim 1 , wherein selecting one of the plurality of possible audio samples in accordance with the probability distribution for the time step comprises sampling from the probability distribution. 8. The method of claim 1 , wherein: each of the plurality of time steps corresponds to a respective time in an audio waveform, and wherein the respective audio sample at each of the plurality of time steps is an amplitude value of the audio waveform at the corresponding time step; and a frame length of each mel-frequency spectrogram is 12.5 milliseconds. 9. A system comprising one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising: generating, from an input data representing a text input, an output sequence of audio data representing the input data, wherein the output sequence of audio data comprises a respective audio output sample for each of a plurality of time steps, including, for each of the plurality of time steps: generating a mel-frequency spectrogram for a timestep of the plurality of timesteps by processing a representation of a respective portion of the input data using a decoder neural network, wherein the decoder neural network is an autoregressive neural network comprising an LSTM subnetwork, a linear transform, and a convolutional subnetwork; generating a probability distribution over a plurality of possible audio output samples for the time step by processing the mel-frequency spectrogram for the time step using a vocoder neural network; and selecting an audio output sample for the time step from the plurality of possible audio output samples in accordance with the probability distribution. 10. The system of claim 9 , wherein the vocoder neural network is an autoregressive neural network, and wherein generating the probability distribution further comprises: conditioning the autoregressive neural network on a current output sequence of audio data comprising respective audio output samples for each preceding time step in the output sequence. 11. The system of claim 9 , wherein processing the representation of the respective portion of the input data using the decoder neural network, comprises: processing, by an encoder neural network, the input data to generate a feature representation of a character sequence in the input data; and processing, by an attention network, the feature representation, to generate a fixed-length context vector for the time step. 12. The system of claim 11 , wherein the encoder neural network comprises a convolutional subnetwork and a bidirectional long short-term memory (LSTM) layer, and wherein the feature representation is a sequential feature representation that represents a local structure of the input data around a particular character in an input character sequence included in the input data. 13. The system of claim 9 , wherein the autoregressive neural network of the vocoder neural network comprises: a convolutional subnetwork having one or more dilated convolutional layers; and a logistic output layer configured to generate a continuous probability distribution of possible audio output samples. 14. The system of claim 9 , wherein the probability distribution is a logistic distribution and wherein selecting one of the plurality of possible audio samples in accordance with the probability distribution for the time step comprises sampling from the probability distribution. 15. The system of claim 11 , wherein each of the plurality of time steps corresponds to a respective time in an audio waveform, and wherein the respective audio sample at each of the plurality of time steps is an amplitude value of the audio waveform at the corresponding time step and wherein a frame length of each mel-frequency spectrogram is 12.5 milliseconds. 16. A method for generating, from an input data representing a text input, an output sequence of audio data corresponding to the input data, wherein the output sequence of audio data comprises a respective audio output sample for each of a plurality of time steps, and wherein the method comprises, for each of the plurality of time steps: generating a mel-frequency spectrogram for a timestep of the plurality of timesteps by processing a representation of a respective portion of the input data using a decoder neural network, wherein the decoder neural network is an autoregressive neural network comprising an LSTM subnetwork, a linear transform, and a convolutional subnetwork; determining, using an autoregressive vocoder neural network and based on the mel-frequency spectrogram, an audio output sample for the time step from among a plurality of possible audio output samples for the time step; and generating an output sequence of audio data based on the audio output samples for the plurality of time steps. 17. The method of claim 16 , further comprising: conditioning the autoregressive vocoder neural network on a current output sequence of audio data comprising res

Assignees

Inventors

Classifications

  • characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title

  • Convolutional networks [CNN, ConvNet] · CPC title

  • Auto-encoder networks; Encoder-decoder networks · CPC title

  • Supervised learning · CPC title

  • Probabilistic graphical models, e.g. probabilistic networks · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12148444B2 cover?
Methods, systems, and computer program products for generating, from an input character sequence, an output sequence of audio data representing the input character sequence. The output sequence of audio data includes a respective audio output sample for each of a number of time steps. One example method includes, for each of the time steps: generating a mel-frequency spectrogram for the time st…
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification G10L25/30. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Nov 19 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 6 related publications on this page (citations in our corpus or others sharing the same primary CPC).