Who is the assignee on this patent?

Microsoft Technology Licensing Llc

What technology area does this patent fall under?

Primary CPC classification G10L17/04. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Apr 02 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Permutation invariant training for talker-independent multi-talker speech separation

US10249305B2 · US · B2

Patent metadata
Field	Value
Publication number	US-10249305-B2
Application number	US-201615226527-A
Country	US
Kind code	B2
Filing date	Aug 2, 2016
Priority date	May 19, 2016
Publication date	Apr 2, 2019
Grant date	Apr 2, 2019

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The techniques described herein improve methods to equip a computing device to conduct automatic speech recognition (“ASR”) in talker-independent multi-talker scenarios. In some examples, permutation invariant training of deep learning models can be used for talker-independent multi-talker scenarios. In some examples, the techniques can determine a permutation-considered assignment between a model's estimate of a source signal and the source signal. In some examples, the techniques can include training the model generating the estimate to minimize a deviation of the permutation-considered assignment. These techniques can be implemented into a neural network's structure itself, solving the label permutation problem that prevented making progress on deep learning based techniques for speech separation. The techniques discussed herein can also include source tracing to trace streams originating from a same source through the frames of a mixed signal.

First claim

Opening claim text (preview).

What is claimed is: 1. A method of separating two or more audio source signals from a first mixed signal having audio source signals and noise source signals, the method comprising: generating output layers from a second mixed signal, the output layers being estimates of audio source signals in the second mixed signal; generating a plurality of labels, wherein a total number of the plurality of labels is equal to a total number of the output layers; iteratively assigning the plurality of labels to the output layers for possible combinations of labels and output layers to create a set of possible assignments, each possible assignment in the set of possible assignments corresponding to a combination of labels and output layers; obtaining a plurality of spatially filtered signals, wherein a total number of spatially filtered signals is equal to the total number of the plurality of labels; determining assignment error scores for each of the set of possible assignments, the assignment error scores determined based at least in part on a difference between labels of the plurality of labels for respective output layers for a respective possible assignment and the plurality of spatially filtered signals; determining an assignment order of labels to be assigned to the output layers, individual labels being associated with individual audio source signals and the assignment order being based, at least in part, on a minimum total deviation score between individual output layers and the individual audio source signals, wherein the minimum total deviation score is a lowest assignment error score of the assignment error scores; generating a set of masks by iteratively optimizing model parameters of the model to minimize the minimum total deviation score of the determined assignment order; and generating the two or more audio source signals from the first mixed signal by using the set of masks, the source of the two or more audio source signals being different from a source of the audio source signals in the second mixed signal. 2. A method as claim 1 recites, wherein determining the assignment order of the labels includes: calculating a set of pairwise deviations between the individual output layers and the audio source signals; calculating total deviation scores for possible assignment orders, a total deviation score for a possible assignment order including a summation of the pairwise deviations between respective pairs of the individual output layers and the individual audio source signals to which the individual output layers correspond according to the possible assignment order; and selecting, from the possible assignment orders, the assignment order based at least in part on a total deviation score associated with the assignment order being a minimum total deviation score among the total deviation scores. 3. A method as claim 2 recites, the total deviation scores for an assignment order including a total mean squared error between individual output sources and the individual audio source signals with which the individual output audio sources are associated according to the assignment order. 4. A method as claim 1 recites, wherein assigning an individual label to an individual output layer attributes the individual output layer to a source of an individual audio source signal of the audio source signals. 5. A method as claim 1 recites, wherein the model obtains the output layers using two or more frames of the mixed signal or two or more frames of a feature signal of the second mixed signal. 6. A method as claim 1 recites, further comprising: shifting a current window of the second mixed signal by one or more frames to obtain an adjacent window, wherein the adjacent window and the current window have overlapping frames; and selecting an assignment order for the adjacent window based at least in part on the assignment order being associated with a minimum total deviation score. 7. A method as claim 1 recites, further comprising: selecting assignment orders for multiple windows of the second mixed signal, output layers, and audio source signals; recording the assignment orders for the multiple windows; and tracing, based at least in part on record of assignment orders for the multiple windows, a source signal attributable to a signal-creating audio source through multiple frames of the second mixed signal. 8. A method as claim 7 recites, wherein tracing the audio source signal attributable to a signal-creating audio source includes: identifying a subset of frames of the multiple frames of the second mixed signal that are included in windows having center frames associated with the audio source signal by respective assignment orders. 9. A method as claim 8 recites, further comprising: obtaining a first minimum total deviation associated with a first meta-frame of the output layers; obtaining a second minimum total deviation associated with a second meta-frame of the output layers; calculating a similarity score of an embedding of the output layers; and determining an assignment order for the first meta-frame or a center frame of the first meta-frame based at least in part on the first minimum total deviation or the second minimum total deviation and the similarity score. 10. A method as claim 1 recites, the output layers including: an estimate of a delta representation of a source signal, and; one or more of an estimate of a spectral magnitude of the source signal or an estimate of a spectrum of the source signal; and the method as claim 1 recites, further comprising: tracing, based at least in part on the estimate of the delta representation, a source attributable for the source signal through multiple frames of the second mixed signal. 11. A method as claim 1 recites, further comprising: estimating separated audio source signals based at least in part on assignment orders for multiple frames of the second mixed signal, output layers, and audio source signals, wherein estimating includes: for a signal source attributable to a first signal of the audio source signals, identifying a subset of frames of the multiple frames associated with the first signal, based on the respective permutation-considered assignment orders of the subset of frames; and associating the subset of frames with the signal audio source to obtain a separated signal audio source attributable to a source of the first audio signal. 12. A method as claim 1 recites, further comprising: spatially filtering, by a microphone array, the mixed signal to obtain the audio signal sources and to identify the signal-creating audio sources; and jointly optimizing the model based at least in part on the spatially filtered audio signal sources. 13. A system for separating two or more audio source signals from a first monaural signal having audio source signals and noise source signals, the system comprising: one or more processors; and a memory having stored thereon computer-executable instructions that, when executed by the one or more processors, configure the processors to: generate, from a window of frames of a second monaural signal, output layers comprising estimates of audio source signals attributable to disparate audio signal sources contributing to the second monaural signal; generate a plurality of labels, wherein a total number of the plurality of labels is equal to a total number of the disparate audio signal sources; iteratively assign the plurality of labels to the disparate audio signal sources for all possible combinations of labels and disparate audio signal sources to create a set of possible assignments, each possible assignment in the set

Assignees

Microsoft Technology Licensing Llc

Inventors

Yu Dong

Classifications

G10L17/04Primary
Training, enrolment or model building · CPC title
G10L21/0272Primary
Voice signal separating · CPC title
G06F18/2134
based on separation criteria, e.g. independent component analysis · CPC title
G06F18/21348
overcoming non-stationarity or permutations · CPC title
G10L17/18
Artificial neural networks; Connectionist approaches · CPC title

Patent family

Related publications grouped by family.

View patent family 58800898

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10249305B2 cover?: The techniques described herein improve methods to equip a computing device to conduct automatic speech recognition (“ASR”) in talker-independent multi-talker scenarios. In some examples, permutation invariant training of deep learning models can be used for talker-independent multi-talker scenarios. In some examples, the techniques can determine a permutation-considered assignment between a mo…
Who is the assignee on this patent?: Microsoft Technology Licensing Llc
What technology area does this patent fall under?: Primary CPC classification G10L17/04. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Apr 02 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Multichannel raw-waveform neural networks

Method for Enhancing Noisy Speech using Features from an Automatic Speech Recognition System

Mixed speech recognition

Frequently asked questions