Speech coding using content latent embedding vectors and speaker latent embedding vectors

US11257507B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11257507-B2
Application numberUS-202016746703-A
CountryUS
Kind codeB2
Filing dateJan 17, 2020
Priority dateJan 17, 2019
Publication dateFeb 22, 2022
Grant dateFeb 22, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for generating discrete latent representations of input audio data. Only the discrete latent representation needs to be transmitted from an encoder system to a decoder system in order for the decoder system to be able to effectively to decode, i.e., reconstruct, the input audio data.

First claim

Opening claim text (preview).

What is claimed is: 1. A system comprising: a memory for storing: a set of content latent embedding vectors; and a set of speaker latent embedding vectors; one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to implement: an encoder neural network configured to: receive input audio data; and process the input audio data to generate an encoder output that comprises a plurality of encoded vectors, each respective encoded vector of the plurality of encoded vectors corresponding to a respective different latent variable in a sequence of a plurality of latent variables; and a subsystem configured to: provide the input audio data as input to the encoder neural network to obtain the encoder output for the input audio data that comprises the plurality of encoded vectors that each correspond to a respective different latent variable in the sequence of the plurality of latent variables; and generate a discrete latent representation of the input audio data from the encoder output, comprising: for each of the latent variables in the sequence of latent variables, determining, from the set of content latent embedding vectors in the memory, a content latent embedding vector that is nearest to the encoded vector corresponding to the latent variable; generating a speaker vector by combining at least the plurality of encoded vectors in the encoder output generated by the encoder neural network into a single vector; and determining, from the set of speaker latent embedding vectors in the memory, a speaker latent embedding vector from the set of speaker latent embedding vectors stored in the memory that is nearest to the speaker vector that is generated by combining at least the plurality of encoded vectors in the encoder output generated by the encoder neural network into a single vector, wherein: the content latent embedding vectors in the set of content latent embedding vectors are learned during joint training of the encoder neural network and a decoder neural network; and the speaker latent embedding vectors in the set of speaker latent embedding vectors are learned during the joint training of the encoder neural network and the decoder neural network. 2. The system of claim 1 , wherein the discrete latent representation of the input audio data includes (i) for each of the latent variables, an identifier of the nearest latent embedding vector to the encoded vector for the latent variable and (ii) an identifier of the speaker latent embedding vector that is nearest to the speaker vector. 3. The system of claim 1 , wherein generating the speaker vector comprises: applying mean pooling over the encoder vectors to generate the single vector. 4. The system of claim 1 , wherein the input audio data is a portion of an utterance, wherein the input audio data is preceded in the utterance by one or more other portions, and wherein generating the speaker vector comprises: applying mean pooling over the encoder vectors for the input audio data and encoder vectors generated for the one or more other portions of the utterance to generate the single vector. 5. The system of claim 1 , wherein the encoder neural network is a convolutional neural network. 6. The system of claim 5 , wherein the encoder neural network has a dilated convolutional architecture. 7. The system of claim 1 , wherein the instructions further cause the one or more computers to implement: the decoder neural network, wherein the decoder neural network is configured to: receive a decoder input derived from the discrete latent representation of the input audio data, and process the decoder input to generate a reconstruction of the input audio data, and wherein the subsystem is further configured to: generate the decoder input, wherein the decoder input comprises, (i) for each of the latent variables, the content latent embedding vector that is nearest to the encoded vector for the latent variable in the encoder output and (ii) the speaker latent embedding vector that is nearest to the speaker vector, and provide the decoder input as input to the decoder neural network to obtain the reconstruction of the input audio data. 8. The system of claim 7 , wherein the decoder neural network is an auto-regressive convolutional neural network that is configured to auto-regressively generate the reconstruction conditioned on the decoder input. 9. The system of claim 7 , wherein the reconstruction of the audio input data is a predicted companded and quantized representation of the audio input data. 10. A method of training an encoder neural network having a plurality of encoder network parameters and a decoder neural network having a plurality of decoder network parameters and of updating a set of content latent embedding vectors and a set of speaker latent embedding vectors, the method comprising: receiving a training audio input; processing the training audio input through the encoder neural network in accordance with current values of the encoder network parameters of the encoder neural network to generate a training encoder output that comprises a plurality of training encoded vectors, each training encoded vector corresponding to a different latent variable in a sequence of a plurality of for latent variables; selecting, for each latent variable and from a plurality of current content latent embedding vectors currently stored in the memory, a current latent embedding vector that is nearest to the training encoded vector for the latent variable; generating a training speaker vector by combining at least the plurality of training encoded vectors in the training encoder output into a single vector; and selecting, from a plurality of current speaker latent embedding vectors currently stored in the memory, a current speaker latent embedding vector that is nearest to the training speaker vector that is generated by combining at least the plurality of training encoded vectors in the training encoder output generated by the encoder neural network into a single vector; generating a training decoder input that includes the nearest current content latent embedding vectors and the nearest current speaker latent embedding vector; processing the training decoder input through the decoder neural network in accordance with current values of the decoder network parameters of the decoder neural network to generate a training reconstruction of the training audio input; determining a reconstruction update to the current values of the decoder network parameters and the encoder network parameters by determining a gradient with respect to the current values of the decoder network parameters and the encoder network parameters to optimize a reconstruction error between the training reconstruction and the training audio input; and updating the current content latent embedding vectors and the current speaker latent embedding vectors based on the training speaker vector and the plurality of training encoded vectors in the training encoder output. 11. The method of claim 10 , wherein updating the current content latent embedding vectors and the current speaker latent embedding vectors comprises: for each latent variable, determining an update to the nearest current content latent embedding vector for the latent variable by determining a gradient with respect to the nearest current latent embedding vector to minimize an error between the training encoded vector for the latent variable and the nearest current content latent embedding vector to the training encoded vector for the latent variable. 12. The method of claim 10 , wherein up

Assignees

Inventors

Classifications

  • G10L25/30Primary

    using neural networks · CPC title

  • Combinations of networks · CPC title

  • Supervised learning · CPC title

  • Convolutional networks [CNN, ConvNet] · CPC title

  • Auto-encoder networks; Encoder-decoder networks · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11257507B2 cover?
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for generating discrete latent representations of input audio data. Only the discrete latent representation needs to be transmitted from an encoder system to a decoder system in order for the decoder system to be able to effectively to decode, i.e., reconstruct, the input audio data.
Who is the assignee on this patent?
Deepmind Tech Ltd
What technology area does this patent fall under?
Primary CPC classification G10L25/30. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Feb 22 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).