What technology area does this patent fall under?

Primary CPC classification G10L25/30. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Jun 03 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Generating audio using auto-regressive generative neural networks

US12322380B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12322380-B2
Application number	US-202418412394-A
Country	US
Kind code	B2
Filing date	Jan 12, 2024
Priority date	Sep 7, 2022
Publication date	Jun 3, 2025
Grant date	Jun 3, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for generating a prediction of an audio signal. One of the methods includes receiving a request to generate an audio signal conditioned on an input; processing the input using an embedding neural network to map the input to one or more embedding tokens; generating a semantic representation of the audio signal; generating, using one or more generative neural networks and conditioned on at least the semantic representation and the embedding tokens, an acoustic representation of the audio signal; and processing at least the acoustic representation using a decoder neural network to generate the prediction of the audio signal.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for training one or more generative neural networks for generating a prediction of an audio signal, the audio signal having a respective audio sample at each of a plurality of output time steps spanning a time window, the training comprising: obtaining, as training data, a plurality of target audio signals; training, on a set of target semantic representations generated from the target audio signals, a third generative neural network that generates a semantic representation of the audio signal, wherein the semantic representation specifies a respective semantic token at each of a plurality of first time steps spanning the time window, each semantic token representing semantic content of the audio signal at the corresponding first time step; and training, on a set of target acoustic representations generated from the target audio signals, a first generative neural network and a second generative neural network that generate an acoustic representation of the audio signal, wherein the acoustic representation specifies a set of one or more respective acoustic tokens at each of a plurality of second time steps spanning the time window, the one or more respective acoustic tokens at each second time step representing acoustic properties of the audio signal at the corresponding second time step. 2. The method of claim 1 , wherein training, on a set of target semantic representations generated from the target audio signals, a third generative neural network that generates a semantic representation of the audio signal comprises: processing each of the plurality of target audio signals using one or more layers of an audio representation neural network to generate a respective set of audio representation embeddings for each target audio signal; generating a respective target semantic representation for each set of audio representation embeddings by assigning each audio representation embedding to a cluster having a nearest centroid to the audio representation embedding; and training the third generative neural network on the set of target semantic representations to predict, when conditioned on one or more embedding tokens for a given target audio signal, the target semantic representation for the audio representation embeddings generated for the target audio signal. 3. The method of claim 2 , wherein obtaining the set of target semantic representations further comprises removing consecutive repetitions of semantic tokens from each target semantic representation. 4. The method of claim 2 , wherein the audio representation neural network has been trained to minimize a masked language model (MLM) loss and a contrastive loss. 5. The method of claim 2 , wherein the audio representation neural network comprises a self-attention based model that has been trained on a music representation task. 6. The method of claim 2 , wherein the one or more embedding tokens are generated by processing an input using at least an embedding neural network, comprising: generating an embedding vector for the input in a joint embedding space using the embedding neural network; and quantizing the embedding vector to generate the one or more embedding tokens. 7. The method of claim 6 , wherein the input comprises the given target audio signal. 8. The method of claim 6 , wherein the embedding neural network is trained on training data comprising audio signals. 9. The method of claim 6 , wherein the embedding neural network has been trained on an objective so that text that describes audio signals, and the corresponding audio signals, have embeddings that are close to each other in a joint embedding space. 10. The method of claim 6 , wherein the input comprises a melody audio signal representing a melody, and wherein the one or more embedding tokens comprise melody embedding tokens, and wherein the melody embedding tokens are generated by processing the melody audio signal using a melody embedding neural network. 11. The method of claim 10 , wherein the melody embedding neural network has been trained on an objective so that audio clips containing a same melody have embeddings that are close to each other in a joint embedding space. 12. The method of claim 1 , wherein training, on a set of target acoustic representations generated from the target audio signals, a first generative neural network and a second generative neural network that generate an acoustic representation of the audio signal comprises: processing each of the plurality of target audio signals using an encoder neural network to generate a respective embedding at each of the plurality of second time steps for each target audio signal; generating each target acoustic representation by applying quantization to each of the respective embeddings; and training the first generative neural network and the second generative neural network on the set of target acoustic representations to predict the target acoustic representations when conditioned on one or more embedding tokens for the target audio signals. 13. The method of claim 12 , wherein the quantization is residual vector quantization that encodes each embedding using a hierarchy of a plurality of vector quantizers that each generate a respective acoustic token from a corresponding vocabulary of acoustic tokens for the vector quantizer, wherein the hierarchy comprises one or more coarse vector quantizers at one or more first positions in the hierarchy and one or more fine vector quantizers at one or more last positions in the hierarchy. 14. The method of claim 13 , wherein training the first generative neural network and the second generative neural network on the set of target acoustic representations to predict the target acoustic representations comprises training the first generative neural network to predict the target acoustic representations when conditioned on the one or more embedding tokens for the target audio signals and semantic representations for the target audio signals. 15. The method of claim 13 , wherein training the first generative neural network and the second generative neural network on the set of target acoustic representations to predict the target acoustic representations comprises training the second generative neural network to predict the target acoustic representations when conditioned on the one or more embedding tokens for the target audio signals. 16. The method of claim 1 , wherein a prediction of the audio signal is generated by processing at least the acoustic representation using a decoder neural network. 17. The method of claim 16 , wherein the decoder neural network is a decoder neural network of a neural audio codec that has been trained jointly with an encoder neural network on an objective that measures reconstruction quality of predicted audio signals generated by the decoder neural network from acoustic representations generated using outputs generated by the encoder neural network. 18. A system comprising one or more computers and one or more storage devices storing instructions that, when executed by the one or more computers, cause the one or more computers to perform operations comprising a method for training one or more generative neural networks for generating a prediction of an audio signal, the audio signal having a respective audio sample at each of a plurality of output time steps spanning a time window, the method comprising: obtaining, as training data, a plurality of target audio signals; training, on a set of target semantic representations generated from the target audio signals, a third gener

Assignees

Google Llc

Inventors

Classifications

G10L21/0272
Voice signal separating · CPC title
G10L13/027
Concept to speech synthesisers; Generation of natural phrases from machine-based concepts (generation of parameters for speech synthesis out of text G10L13/08) · CPC title
G10H2250/311
Neural networks for electrophonic musical instruments or musical processing, e.g. for musical recognition or control, automatic composition or improvisation · CPC title
G10H1/0008
Associated control or indicating means · CPC title
G06N3/09
Supervised learning · CPC title

Patent family

Related publications grouped by family.

View patent family 88237636

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12322380B2 cover?: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for generating a prediction of an audio signal. One of the methods includes receiving a request to generate an audio signal conditioned on an input; processing the input using an embedding neural network to map the input to one or more embedding tokens; generating a semantic representation of the aud…
Who is the assignee on this patent?: Google Llc
What technology area does this patent fall under?: Primary CPC classification G10L25/30. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Jun 03 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).