Systems and methods for speech separation and neural decoding of attentional selection in multi-speaker environments
US-2019066713-A1 · Feb 28, 2019 · US
US11257507B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11257507-B2 |
| Application number | US-202016746703-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jan 17, 2020 |
| Priority date | Jan 17, 2019 |
| Publication date | Feb 22, 2022 |
| Grant date | Feb 22, 2022 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for generating discrete latent representations of input audio data. Only the discrete latent representation needs to be transmitted from an encoder system to a decoder system in order for the decoder system to be able to effectively to decode, i.e., reconstruct, the input audio data.
Opening claim text (preview).
What is claimed is: 1. A system comprising: a memory for storing: a set of content latent embedding vectors; and a set of speaker latent embedding vectors; one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to implement: an encoder neural network configured to: receive input audio data; and process the input audio data to generate an encoder output that comprises a plurality of encoded vectors, each respective encoded vector of the plurality of encoded vectors corresponding to a respective different latent variable in a sequence of a plurality of latent variables; and a subsystem configured to: provide the input audio data as input to the encoder neural network to obtain the encoder output for the input audio data that comprises the plurality of encoded vectors that each correspond to a respective different latent variable in the sequence of the plurality of latent variables; and generate a discrete latent representation of the input audio data from the encoder output, comprising: for each of the latent variables in the sequence of latent variables, determining, from the set of content latent embedding vectors in the memory, a content latent embedding vector that is nearest to the encoded vector corresponding to the latent variable; generating a speaker vector by combining at least the plurality of encoded vectors in the encoder output generated by the encoder neural network into a single vector; and determining, from the set of speaker latent embedding vectors in the memory, a speaker latent embedding vector from the set of speaker latent embedding vectors stored in the memory that is nearest to the speaker vector that is generated by combining at least the plurality of encoded vectors in the encoder output generated by the encoder neural network into a single vector, wherein: the content latent embedding vectors in the set of content latent embedding vectors are learned during joint training of the encoder neural network and a decoder neural network; and the speaker latent embedding vectors in the set of speaker latent embedding vectors are learned during the joint training of the encoder neural network and the decoder neural network. 2. The system of claim 1 , wherein the discrete latent representation of the input audio data includes (i) for each of the latent variables, an identifier of the nearest latent embedding vector to the encoded vector for the latent variable and (ii) an identifier of the speaker latent embedding vector that is nearest to the speaker vector. 3. The system of claim 1 , wherein generating the speaker vector comprises: applying mean pooling over the encoder vectors to generate the single vector. 4. The system of claim 1 , wherein the input audio data is a portion of an utterance, wherein the input audio data is preceded in the utterance by one or more other portions, and wherein generating the speaker vector comprises: applying mean pooling over the encoder vectors for the input audio data and encoder vectors generated for the one or more other portions of the utterance to generate the single vector. 5. The system of claim 1 , wherein the encoder neural network is a convolutional neural network. 6. The system of claim 5 , wherein the encoder neural network has a dilated convolutional architecture. 7. The system of claim 1 , wherein the instructions further cause the one or more computers to implement: the decoder neural network, wherein the decoder neural network is configured to: receive a decoder input derived from the discrete latent representation of the input audio data, and process the decoder input to generate a reconstruction of the input audio data, and wherein the subsystem is further configured to: generate the decoder input, wherein the decoder input comprises, (i) for each of the latent variables, the content latent embedding vector that is nearest to the encoded vector for the latent variable in the encoder output and (ii) the speaker latent embedding vector that is nearest to the speaker vector, and provide the decoder input as input to the decoder neural network to obtain the reconstruction of the input audio data. 8. The system of claim 7 , wherein the decoder neural network is an auto-regressive convolutional neural network that is configured to auto-regressively generate the reconstruction conditioned on the decoder input. 9. The system of claim 7 , wherein the reconstruction of the audio input data is a predicted companded and quantized representation of the audio input data. 10. A method of training an encoder neural network having a plurality of encoder network parameters and a decoder neural network having a plurality of decoder network parameters and of updating a set of content latent embedding vectors and a set of speaker latent embedding vectors, the method comprising: receiving a training audio input; processing the training audio input through the encoder neural network in accordance with current values of the encoder network parameters of the encoder neural network to generate a training encoder output that comprises a plurality of training encoded vectors, each training encoded vector corresponding to a different latent variable in a sequence of a plurality of for latent variables; selecting, for each latent variable and from a plurality of current content latent embedding vectors currently stored in the memory, a current latent embedding vector that is nearest to the training encoded vector for the latent variable; generating a training speaker vector by combining at least the plurality of training encoded vectors in the training encoder output into a single vector; and selecting, from a plurality of current speaker latent embedding vectors currently stored in the memory, a current speaker latent embedding vector that is nearest to the training speaker vector that is generated by combining at least the plurality of training encoded vectors in the training encoder output generated by the encoder neural network into a single vector; generating a training decoder input that includes the nearest current content latent embedding vectors and the nearest current speaker latent embedding vector; processing the training decoder input through the decoder neural network in accordance with current values of the decoder network parameters of the decoder neural network to generate a training reconstruction of the training audio input; determining a reconstruction update to the current values of the decoder network parameters and the encoder network parameters by determining a gradient with respect to the current values of the decoder network parameters and the encoder network parameters to optimize a reconstruction error between the training reconstruction and the training audio input; and updating the current content latent embedding vectors and the current speaker latent embedding vectors based on the training speaker vector and the plurality of training encoded vectors in the training encoder output. 11. The method of claim 10 , wherein updating the current content latent embedding vectors and the current speaker latent embedding vectors comprises: for each latent variable, determining an update to the nearest current content latent embedding vector for the latent variable by determining a gradient with respect to the nearest current latent embedding vector to minimize an error between the training encoded vector for the latent variable and the nearest current content latent embedding vector to the training encoded vector for the latent variable. 12. The method of claim 10 , wherein up
using neural networks · CPC title
Combinations of networks · CPC title
Supervised learning · CPC title
Convolutional networks [CNN, ConvNet] · CPC title
Auto-encoder networks; Encoder-decoder networks · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.