What technology area does this patent fall under?

Primary CPC classification G10L25/30. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Feb 22 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Speech coding using content latent embedding vectors and speaker latent embedding vectors

US11257507B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11257507-B2
Application number	US-202016746703-A
Country	US
Kind code	B2
Filing date	Jan 17, 2020
Priority date	Jan 17, 2019
Publication date	Feb 22, 2022
Grant date	Feb 22, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for generating discrete latent representations of input audio data. Only the discrete latent representation needs to be transmitted from an encoder system to a decoder system in order for the decoder system to be able to effectively to decode, i.e., reconstruct, the input audio data.

First claim

Opening claim text (preview).

What is claimed is: 1. A system comprising: a memory for storing: a set of content latent embedding vectors; and a set of speaker latent embedding vectors; one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to implement: an encoder neural network configured to: receive input audio data; and process the input audio data to generate an encoder output that comprises a plurality of encoded vectors, each respective encoded vector of the plurality of encoded vectors corresponding to a respective different latent variable in a sequence of a plurality of latent variables; and a subsystem configured to: provide the input audio data as input to the encoder neural network to obtain the encoder output for the input audio data that comprises the plurality of encoded vectors that each correspond to a respective different latent variable in the sequence of the plurality of latent variables; and generate a discrete latent representation of the input audio data from the encoder output, comprising: for each of the latent variables in the sequence of latent variables, determining, from the set of content latent embedding vectors in the memory, a content latent embedding vector that is nearest to the encoded vector corresponding to the latent variable; generating a speaker vector by combining at least the plurality of encoded vectors in the encoder output generated by the encoder neural network into a single vector; and determining, from the set of speaker latent embedding vectors in the memory, a speaker latent embedding vector from the set of speaker latent embedding vectors stored in the memory that is nearest to the speaker vector that is generated by combining at least the plurality of encoded vectors in the encoder output generated by the encoder neural network into a single vector, wherein: the content latent embedding vectors in the set of content latent embedding vectors are learned during joint training of the encoder neural network and a decoder neural network; and the speaker latent embedding vectors in the set of speaker latent embedding vectors are learned during the joint training of the encoder neural network and the decoder neural network. 2. The system of claim 1 , wherein the discrete latent representation of the input audio data includes (i) for each of the latent variables, an identifier of the nearest latent embedding vector to the encoded vector for the latent variable and (ii) an identifier of the speaker latent embedding vector that is nearest to the speaker vector. 3. The system of claim 1 , wherein generating the speaker vector comprises: applying mean pooling over the encoder vectors to generate the single vector. 4. The system of claim 1 , wherein the input audio data is a portion of an utterance, wherein the input audio data is preceded in the utterance by one or more other portions, and wherein generating the speaker vector comprises: applying mean pooling over the encoder vectors for the input audio data and encoder vectors generated for the one or more other portions of the utterance to generate the single vector. 5. The system of claim 1 , wherein the encoder neural network is a convolutional neural network. 6. The system of claim 5 , wherein the encoder neural network has a dilated convolutional architecture. 7. The system of claim 1 , wherein the instructions further cause the one or more computers to implement: the decoder neural network, wherein the decoder neural network is configured to: receive a decoder input derived from the discrete latent representation of the input audio data, and process the decoder input to generate a reconstruction of the input audio data, and wherein the subsystem is further configured to: generate the decoder input, wherein the decoder input comprises, (i) for each of the latent variables, the content latent embedding vector that is nearest to the encoded vector for the latent variable in the encoder output and (ii) the speaker latent embedding vector that is nearest to the speaker vector, and provide the decoder input as input to the decoder neural network to obtain the reconstruction of the input audio data. 8. The system of claim 7 , wherein the decoder neural network is an auto-regressive convolutional neural network that is configured to auto-regressively generate the reconstruction conditioned on the decoder input. 9. The system of claim 7 , wherein the reconstruction of the audio input data is a predicted companded and quantized representation of the audio input data. 10. A method of training an encoder neural network having a plurality of encoder network parameters and a decoder neural network having a plurality of decoder network parameters and of updating a set of content latent embedding vectors and a set of speaker latent embedding vectors, the method comprising: receiving a training audio input; processing the training audio input through the encoder neural network in accordance with current values of the encoder network parameters of the encoder neural network to generate a training encoder output that comprises a plurality of training encoded vectors, each training encoded vector corresponding to a different latent variable in a sequence of a plurality of for latent variables; selecting, for each latent variable and from a plurality of current content latent embedding vectors currently stored in the memory, a current latent embedding vector that is nearest to the training encoded vector for the latent variable; generating a training speaker vector by combining at least the plurality of training encoded vectors in the training encoder output into a single vector; and selecting, from a plurality of current speaker latent embedding vectors currently stored in the memory, a current speaker latent embedding vector that is nearest to the training speaker vector that is generated by combining at least the plurality of training encoded vectors in the training encoder output generated by the encoder neural network into a single vector; generating a training decoder input that includes the nearest current content latent embedding vectors and the nearest current speaker latent embedding vector; processing the training decoder input through the decoder neural network in accordance with current values of the decoder network parameters of the decoder neural network to generate a training reconstruction of the training audio input; determining a reconstruction update to the current values of the decoder network parameters and the encoder network parameters by determining a gradient with respect to the current values of the decoder network parameters and the encoder network parameters to optimize a reconstruction error between the training reconstruction and the training audio input; and updating the current content latent embedding vectors and the current speaker latent embedding vectors based on the training speaker vector and the plurality of training encoded vectors in the training encoder output. 11. The method of claim 10 , wherein updating the current content latent embedding vectors and the current speaker latent embedding vectors comprises: for each latent variable, determining an update to the nearest current content latent embedding vector for the latent variable by determining a gradient with respect to the nearest current latent embedding vector to minimize an error between the training encoded vector for the latent variable and the nearest current content latent embedding vector to the training encoded vector for the latent variable. 12. The method of claim 10 , wherein up

Assignees

Deepmind Tech Ltd

Inventors

Classifications

G10L25/30Primary
using neural networks · CPC title
G06N3/045
Combinations of networks · CPC title
G06N3/09
Supervised learning · CPC title
G06N3/0464
Convolutional networks [CNN, ConvNet] · CPC title
G06N3/0455
Auto-encoder networks; Encoder-decoder networks · CPC title

Patent family

Related publications grouped by family.

View patent family 69177166

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11257507B2 cover?: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for generating discrete latent representations of input audio data. Only the discrete latent representation needs to be transmitted from an encoder system to a decoder system in order for the decoder system to be able to effectively to decode, i.e., reconstruct, the input audio data.
Who is the assignee on this patent?: Deepmind Tech Ltd
What technology area does this patent fall under?: Primary CPC classification G10L25/30. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Feb 22 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Systems and methods for speech separation and neural decoding of attentional selection in multi-speaker environments

Speech recognition system and method using an adaptive incremental learning approach

Extracting gradient features from neural networks

Frequently asked questions