Multichannel raw-waveform neural networks
US-2017092265-A1 · Mar 30, 2017 · US
US10249305B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10249305-B2 |
| Application number | US-201615226527-A |
| Country | US |
| Kind code | B2 |
| Filing date | Aug 2, 2016 |
| Priority date | May 19, 2016 |
| Publication date | Apr 2, 2019 |
| Grant date | Apr 2, 2019 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
The techniques described herein improve methods to equip a computing device to conduct automatic speech recognition (“ASR”) in talker-independent multi-talker scenarios. In some examples, permutation invariant training of deep learning models can be used for talker-independent multi-talker scenarios. In some examples, the techniques can determine a permutation-considered assignment between a model's estimate of a source signal and the source signal. In some examples, the techniques can include training the model generating the estimate to minimize a deviation of the permutation-considered assignment. These techniques can be implemented into a neural network's structure itself, solving the label permutation problem that prevented making progress on deep learning based techniques for speech separation. The techniques discussed herein can also include source tracing to trace streams originating from a same source through the frames of a mixed signal.
Opening claim text (preview).
What is claimed is: 1. A method of separating two or more audio source signals from a first mixed signal having audio source signals and noise source signals, the method comprising: generating output layers from a second mixed signal, the output layers being estimates of audio source signals in the second mixed signal; generating a plurality of labels, wherein a total number of the plurality of labels is equal to a total number of the output layers; iteratively assigning the plurality of labels to the output layers for possible combinations of labels and output layers to create a set of possible assignments, each possible assignment in the set of possible assignments corresponding to a combination of labels and output layers; obtaining a plurality of spatially filtered signals, wherein a total number of spatially filtered signals is equal to the total number of the plurality of labels; determining assignment error scores for each of the set of possible assignments, the assignment error scores determined based at least in part on a difference between labels of the plurality of labels for respective output layers for a respective possible assignment and the plurality of spatially filtered signals; determining an assignment order of labels to be assigned to the output layers, individual labels being associated with individual audio source signals and the assignment order being based, at least in part, on a minimum total deviation score between individual output layers and the individual audio source signals, wherein the minimum total deviation score is a lowest assignment error score of the assignment error scores; generating a set of masks by iteratively optimizing model parameters of the model to minimize the minimum total deviation score of the determined assignment order; and generating the two or more audio source signals from the first mixed signal by using the set of masks, the source of the two or more audio source signals being different from a source of the audio source signals in the second mixed signal. 2. A method as claim 1 recites, wherein determining the assignment order of the labels includes: calculating a set of pairwise deviations between the individual output layers and the audio source signals; calculating total deviation scores for possible assignment orders, a total deviation score for a possible assignment order including a summation of the pairwise deviations between respective pairs of the individual output layers and the individual audio source signals to which the individual output layers correspond according to the possible assignment order; and selecting, from the possible assignment orders, the assignment order based at least in part on a total deviation score associated with the assignment order being a minimum total deviation score among the total deviation scores. 3. A method as claim 2 recites, the total deviation scores for an assignment order including a total mean squared error between individual output sources and the individual audio source signals with which the individual output audio sources are associated according to the assignment order. 4. A method as claim 1 recites, wherein assigning an individual label to an individual output layer attributes the individual output layer to a source of an individual audio source signal of the audio source signals. 5. A method as claim 1 recites, wherein the model obtains the output layers using two or more frames of the mixed signal or two or more frames of a feature signal of the second mixed signal. 6. A method as claim 1 recites, further comprising: shifting a current window of the second mixed signal by one or more frames to obtain an adjacent window, wherein the adjacent window and the current window have overlapping frames; and selecting an assignment order for the adjacent window based at least in part on the assignment order being associated with a minimum total deviation score. 7. A method as claim 1 recites, further comprising: selecting assignment orders for multiple windows of the second mixed signal, output layers, and audio source signals; recording the assignment orders for the multiple windows; and tracing, based at least in part on record of assignment orders for the multiple windows, a source signal attributable to a signal-creating audio source through multiple frames of the second mixed signal. 8. A method as claim 7 recites, wherein tracing the audio source signal attributable to a signal-creating audio source includes: identifying a subset of frames of the multiple frames of the second mixed signal that are included in windows having center frames associated with the audio source signal by respective assignment orders. 9. A method as claim 8 recites, further comprising: obtaining a first minimum total deviation associated with a first meta-frame of the output layers; obtaining a second minimum total deviation associated with a second meta-frame of the output layers; calculating a similarity score of an embedding of the output layers; and determining an assignment order for the first meta-frame or a center frame of the first meta-frame based at least in part on the first minimum total deviation or the second minimum total deviation and the similarity score. 10. A method as claim 1 recites, the output layers including: an estimate of a delta representation of a source signal, and; one or more of an estimate of a spectral magnitude of the source signal or an estimate of a spectrum of the source signal; and the method as claim 1 recites, further comprising: tracing, based at least in part on the estimate of the delta representation, a source attributable for the source signal through multiple frames of the second mixed signal. 11. A method as claim 1 recites, further comprising: estimating separated audio source signals based at least in part on assignment orders for multiple frames of the second mixed signal, output layers, and audio source signals, wherein estimating includes: for a signal source attributable to a first signal of the audio source signals, identifying a subset of frames of the multiple frames associated with the first signal, based on the respective permutation-considered assignment orders of the subset of frames; and associating the subset of frames with the signal audio source to obtain a separated signal audio source attributable to a source of the first audio signal. 12. A method as claim 1 recites, further comprising: spatially filtering, by a microphone array, the mixed signal to obtain the audio signal sources and to identify the signal-creating audio sources; and jointly optimizing the model based at least in part on the spatially filtered audio signal sources. 13. A system for separating two or more audio source signals from a first monaural signal having audio source signals and noise source signals, the system comprising: one or more processors; and a memory having stored thereon computer-executable instructions that, when executed by the one or more processors, configure the processors to: generate, from a window of frames of a second monaural signal, output layers comprising estimates of audio source signals attributable to disparate audio signal sources contributing to the second monaural signal; generate a plurality of labels, wherein a total number of the plurality of labels is equal to a total number of the disparate audio signal sources; iteratively assign the plurality of labels to the disparate audio signal sources for all possible combinations of labels and disparate audio signal sources to create a set of possible assignments, each possible assignment in the set
Training, enrolment or model building · CPC title
Voice signal separating · CPC title
based on separation criteria, e.g. independent component analysis · CPC title
overcoming non-stationarity or permutations · CPC title
Artificial neural networks; Connectionist approaches · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.