Multi-channel speech separation

US10839822B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10839822-B2
Application numberUS-201715805106-A
CountryUS
Kind codeB2
Filing dateNov 6, 2017
Priority dateNov 6, 2017
Publication dateNov 17, 2020
Grant dateNov 17, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Representative embodiments disclose mechanisms to separate and recognize multiple audio sources (e.g., picking out individual speakers) in an environment where they overlap and interfere with each other. The architecture uses a microphone array to spatially separate out the audio signals. The spatially filtered signals are then input into a plurality of separators, so each signal is input into a corresponding signal. The separators use neural networks to separate out audio sources. The separators typically produce multiple output signals for the single input signals. A post selection processor then assesses the separator outputs to pick the signals with the highest quality output. These signals can be used in a variety of systems such as speech recognition, meeting transcription and enhancement, hearing aids, music information retrieval, speech enhancement and so forth.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer implemented method, comprising: receiving audio input from a plurality of microphones; creating a plurality of beamformed signals from the received audio input using a beamformer coupled to the plurality of microphones, each beamformed signal corresponding to a different directional beam; coupling a separator to the output of each beamformed signal in a one-to-one relationship, each separator comprising a trained neural network producing E spectrally filtered outputs from its coupled beamformed signal, wherein the trained neural network produces one or more masks that separate out audio sources in the beamformed signal; and creating a plurality of spectrally filtered outputs from the plurality of beamformed signals using the separators. 2. The method of claim 1 further comprising: selecting a subset of the plurality of spectrally filtered outputs based on a quality metric. 3. The method of claim 2 wherein a number of outputs that comprise the subset equals a number of audio sources received via the plurality of microphones. 4. The method of claim 2 wherein selecting the subset comprises: calculating an affinity matrix between a magnitude spectrogram corresponding to each spectrally filtered output of the separators by their Pearson correlation; grouping columns of the affinity matrix into C+1 clusters using spectral clustering; and selecting C outputs from the clusters having a highest speech quality metric. 5. The method of claim 4 wherein the speech quality metric is computed using a mean to standard deviation criteria. 6. The method of claim 1 wherein E equals 2. 7. The method of claim 1 wherein the plurality of beamformed signals are created using differential beamforming. 8. The method of claim 1 wherein the trained neural network is an Anchored Deep Attractor Network. 9. The method of claim 1 wherein the trained neural network comprises: a Deep Attractor network; a Deep Clustering network; or a Permutation Invariant Training network. 10. A system comprising: a microphone array comprising a plurality of microphones arranged at spatially different locations; a beamformer coupled to the microphone array, the beamformer receiving a plurality of audio signals from the microphone array and producing a plurality spatially filtered audio channels, each channel corresponding to a different directional beam; and a plurality of separators, each separator comprising a trained neural network coupled to a corresponding spatially filtered audio channel of the beamformer in a one-to-one relationship, each separator producing E spectrally filtered outputs corresponding to E audio sources separated for that separator, wherein the trained neural network produces one or more masks that separate out audio sources in the corresponding spatially filtered audio channel. 11. The system of claim 10 wherein the beamformer is a differential beamformer. 12. The system of claim 10 wherein the beamformer is a fixed beamformer unit. 13. The system of claim 10 wherein the trained neural network is an Anchored Deep Attractor Network. 14. The system of claim 10 further wherein the trained neural network comprises: a Deep Attractor network; a Deep Clustering network; or a Permutation Invariant Training network. 15. The system of claim 10 further comprising: a post selection processor to select C audio sources from the audio sources produced by the combined separators based on a speech quality metric. 16. The system of claim 15 wherein the post selection processor comprises acts of: calculating an affinity matrix between a magnitude spectrogram corresponding to each output signal of the separators by their Pearson correlation; grouping columns of the affinity matrix into C+1 clusters using spectral clustering; and selecting C outputs from the clusters having a highest speech quality metric. 17. The system of claim 16 wherein the speech quality metric is computed using a mean to standard deviation criteria. 18. The system of claim 10 wherein E equals 2. 19. A non-transitory computer storage medium comprising executable instructions that, when executed by a processor of a machine, cause the machine to perform acts comprising: receive audio input from a plurality of microphones; create a plurality of beamformed signals from the received audio input using a beamformer so that each beamformed signal corresponds to a different direction; create a plurality of spectrally filtered outputs from the plurality of beamformed signals using a plurality of separators, each separator comprising a trained neural network, each separator coupled to a corresponding one of the plurality of beamformed signals in a one-to-one relationship, each separator producing E spectrally filtered outputs from a single beamformed signal, wherein the trained neural network produces one or more masks that separate out audio sources in the single beamformed signal; and select a subset of the plurality of spectrally filtered outputs based on a quality metric. 20. The medium of claim 19 wherein selecting the subset comprises: calculate an affinity matrix between a magnitude spectrogram corresponding to each spectrally filtered output of the separators by their Pearson correlation; group columns of the affinity matrix into C+1 clusters using spectral clustering; and select C outputs from the clusters having a highest speech quality metric.

Assignees

Inventors

Classifications

  • Combinations of networks · CPC title

  • Recurrent networks, e.g. Hopfield networks · CPC title

  • G06N3/084Primary

    Backpropagation, e.g. using gradient descent · CPC title

  • characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title

  • Supervised learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10839822B2 cover?
Representative embodiments disclose mechanisms to separate and recognize multiple audio sources (e.g., picking out individual speakers) in an environment where they overlap and interfere with each other. The architecture uses a microphone array to spatially separate out the audio signals. The spatially filtered signals are then input into a plurality of separators, so each signal is input into …
Who is the assignee on this patent?
Microsoft Technology Licensing Llc
What technology area does this patent fall under?
Primary CPC classification G06N3/084. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Nov 17 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).