Methods and systems for end-to-end speech separation with unfolded iterative phase reconstruction

US10529349B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10529349-B2
Application numberUS-201815983256-A
CountryUS
Kind codeB2
Filing dateMay 18, 2018
Priority dateApr 16, 2018
Publication dateJan 7, 2020
Grant dateJan 7, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems and methods for an audio signal processing system for transforming an input audio signal. A processor implements steps of a module by inputting an input audio signal into a spectrogram estimator to extract an audio feature sequence, and process the audio feature sequence to output a set of estimated spectrograms. Processing the set of estimated spectrograms and the audio feature sequence using a spectrogram refinement module, to output a set of refined spectrograms. Wherein the processing of the spectrogram refinement module is based on an iterative reconstruction algorithm. Processing the set of refined spectrograms for the one or more target audio signals using a signal refinement module, to obtain the target audio signal estimates. An output interface to output the optimized target audio signal estimates. Wherein the module is optimized by minimizing an error using an optimizer stored in the memory.

First claim

Opening claim text (preview).

What is claimed is: 1. An audio signal processing system for transforming an input audio signal, wherein the input audio signal includes a mixture of one or more target audio signals, the audio signal processing system comprising: a memory including stored executable instructions and a stored module, such that the stored module transforms the input audio signal to obtain target audio signal estimates; an input interface to receive the input audio signal, a processor in communication with the memory and the input interface, wherein the processor implements steps of the stored module by a spectrogram estimator of the stored module to extract an audio feature sequence from the input audio signal, and process the audio feature sequence to output a set of estimated spectrograms, wherein the set of estimated spectrograms includes an estimated spectrogram for each target audio signal; a spectrogram refinement module of the stored module to process the set of estimated spectrograms and the audio feature sequence, to output a set of refined spectrograms, such that the set of refined spectrograms includes a refined spectrogram for each target audio signal, and wherein using the spectrogram refinement module is based on an iterative reconstruction algorithm; a signal refinement module of the stored module to process the set of refined spectrograms for the one or more target audio signals, to obtain target audio signal estimates, such that there is a target audio signal estimate for each target audio signal; and an output interface to output the target audio signal estimates, wherein parameters of the stored module are trained using training data by minimizing an error using an optimizer stored in the memory, wherein the error includes one or more of an error on the set of refined spectrograms, an error including a consistency measurement on the set of refined spectrograms, or an error on the target audio signal estimates. 2. The audio signal processing system of claim 1 , wherein the spectrogram estimator uses a deep neural network. 3. The audio signal processing system of claim 1 , wherein the spectrogram estimator includes a mask estimation module which outputs a mask estimate value for each target audio signal, and a spectrogram estimate output module which uses the mask estimate value for the one or more target audio signals and the input audio signal, to output the estimated spectrogram for each target audio signal. 4. The audio signal processing system of claim 3 , wherein at least one mask estimate value is greater than 1. 5. The audio signal processing system of claim 1 , wherein the spectrogram refinement module comprises: defining an iterative procedure acting on the set of estimated spectrograms and the input audio feature sequence; unfolding the iterative procedure into a set of layers, such that there is one layer for each iteration of the iterative procedure, and wherein each layer includes a set of fixed network parameters; forming a neural network using fixed network parameters from the sets of fixed network parameters of layers of previous iterations, as variables to be trained, and untying these variables across the layers of previous iterations, by using the variables as separate variables as each variable is separately applicable to their corresponding layer; training the neural network to obtain a trained neural network; and transforming the set of estimated spectrograms and the audio feature sequence using the trained neural network to obtain the set of refined spectrograms. 6. The audio signal processing system of claim 1 , wherein the iterative reconstruction algorithm is an iterative phase reconstruction algorithm. 7. The audio signal processing system of claim 6 , wherein the iterative phase reconstruction algorithm is the Multiple Input Spectrogram Inversion (MISI) algorithm. 8. The audio signal processing system of claim 6 , wherein the iterative phase reconstruction algorithm is the Griffin-Lim algorithm. 9. The audio signal processing system of claim 1 , wherein the error on the target audio signal estimates includes a distance between the target audio signal estimates and reference target audio signals. 10. The audio signal processing system of claim 1 , wherein the error on the target audio signal estimates includes a distance between the estimated spectrograms of target audio signal and the refined spectrograms of the target audio signals. 11. The audio signal processing system of claim 1 , wherein the spectrogram estimator includes a feature extraction module, such that the feature extraction module extracts the input audio signal from the input audio signal. 12. The audio signal processing system of claim 1 , wherein a received audio signal includes one or more of one or more speakers, noise, music, environmental sounds, machine sound. 13. The audio signal processing system of claim 1 , wherein the error further includes an error on the set of estimated spectrograms. 14. A method for transforming input audio signals, comprising the steps of: using a module for transforming an input audio signal of the input audio signals, such that the input audio signal includes a mixture of one or more target audio signals, wherein the module transforms the input audio signal, to obtain target audio signal estimates; using a spectrogram estimator of the model, to extract an audio feature sequence from the input audio signal, and process the audio feature sequence to output a set of estimated spectrograms, wherein the set of estimated spectrograms includes an estimated spectrogram for each target audio signal; using a spectrogram refinement module of the module to process the set of estimated spectrograms and the audio feature sequence, to output a set of refined spectrograms, such that the set of refined spectrograms includes a refined spectrogram for each target audio signal, and wherein using the spectrogram refinement module is based on an iterative reconstruction algorithm; using a signal refinement module of the module to process the set of refined spectrograms for the one or more target audio signals, to obtain target audio signal estimates, such that there is a target audio signal estimate for each target audio signal; and outputting the target audio signal estimates, wherein parameters of the stored module are trained using training data by minimizing an error using an optimizer stored in a memory, wherein the error includes one or more of an error on the set of refined spectrograms, an error including a consistency measurement on the set of refined spectrograms, or an error on the target audio signal estimates, and wherein the steps are performed by a processor in communication with an output device and the memory having stored executable instructions, such that the module is stored in the memory. 15. The method of claim 14 , wherein the spectrogram estimator includes a mask estimation module which outputs a mask estimate value for each target audio signal, and a spectrogram estimate output module which uses the mask estimate value for the one or more target audio signals and the input audio signal, to output the estimated spectrogram for each target audio signal, wherein at least one mask estimate value is greater than 1. 16. The method of claim 14 , wherein the processing of the spectrogram refinement module comprises: defining an iterative procedure acting on the set of estimated spectrograms and the input audio feature sequence; unfolding the iterative procedure into a set of layers, such that there is one layer for each iteration of the iterative procedure, and

Assignees

Inventors

Classifications

  • Backpropagation, e.g. using gradient descent · CPC title

  • Activation functions · CPC title

  • Recurrent networks, e.g. Hopfield networks · CPC title

  • the extracted parameters being spectral information of each sub-band · CPC title

  • using neural networks · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10529349B2 cover?
Systems and methods for an audio signal processing system for transforming an input audio signal. A processor implements steps of a module by inputting an input audio signal into a spectrogram estimator to extract an audio feature sequence, and process the audio feature sequence to output a set of estimated spectrograms. Processing the set of estimated spectrograms and the audio feature sequenc…
Who is the assignee on this patent?
Mitsubishi Electric Res Laboratories Inc
What technology area does this patent fall under?
Primary CPC classification G10L19/06. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 07 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).