Systems and methods for speech separation and neural decoding of attentional selection in multi-speaker environments

US2024013800A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2024013800-A1
Application numberUS-202318228239-A
CountryUS
Kind codeA1
Filing dateJul 31, 2023
Priority dateJun 14, 2016
Publication dateJan 11, 2024
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Disclosed are devices, systems, apparatus, methods, products, and other implementations, including a method comprising obtaining, by a device, a combined sound signal for signals combined from multiple sound sources in an area in which a person is located, and applying, by the device, speech-separation processing (e.g., deep attractor network (DAN) processing, online DAN processing, LSTM-TasNet processing, Conv-TasNet processing), to the combined sound signal from the multiple sound sources to derive a plurality of separated signals that each contains signals corresponding to different groups of the multiple sound sources. The method further includes obtaining, by the device, neural signals for the person, the neural signals being indicative of one or more of the multiple sound sources the person is attentive to, and selecting one of the plurality of separated signals based on the obtained neural signals. The selected signal may then be processed (amplified, attenuated).

First claim

Opening claim text (preview).

What is claimed is: 1 . A method comprising: obtaining, by a device, a combined sound signal for sound combined from multiple sound sources in an area in which a person is located; applying, by the device, a learning-engine-based speech-separation processing to the combined sound signal to derive in real-time one or more separated sound signals that each contains sound signals from different groups of the multiple sound sources; obtaining, by the device, neural signals measured by invasive electrodes placed in a brain of the person to track, through closed-loop attention decoding, a modifiable direction of attention of the person indicative of one or more of the multiple sound sources the person is attentive to; and selecting, by the device, one of the one or more separated sound signals based on the neural signals obtained for the person. 2 . The method of claim 1 , wherein applying the learning-engine-based speech separation processing comprises one of: extracting from the combined sound signal one set of sound signals for one of the different groups of the multiple sound sources, or separating the combined sound signal into a plurality of sets of separated sounds signals corresponding to a plurality of the different groups of the multiple sound sources. 3 . The method of claim 1 , wherein obtaining the neural signals comprises: obtaining electrocorticography (ECoG) signals for the person via the invasive electrodes placed onto a cortical surface of the brain of the person. 4 . The method of claim 1 , further comprising: processing the selected one of the one or more separated sound signals to perform one or more of: amplifying the selected one of the one or more separated sound signals, or attenuating at least one non-selected sound signal from the one or more separated sound signals. 5 . The method of claim 1 , wherein obtaining the combined sound signal for the multiple sound sources comprises: receiving the combined sound signal for the multiple sound sources at a single microphone coupled to the device. 6 . The method of claim 5 , wherein applying the learning-engine-based speech-separation processing to the combined sound signal comprises: applying the learning-engine-based speech-separation processing to the combined sound signal received through the single microphone to derive the one or more separated sound signals regardless of spatial separation between the multiple sound sources. 7 . The method of claim 1 , wherein the device is a hearing device. 8 . The method of claim 1 , wherein applying the learning-engine-based speech-separation processing to the combined sound signal comprises: providing the combined sound signal from the multiple sound sources to a deep neural network (DNN) configured to separate the different groups of the multiple sound sources into individual signals associated with respective ones of the multiple sound sources. 9 . The method of claim 8 , wherein the DNN comprises one or more long short-term memory recurrent neural networks. 10 . The method of claim 1 , wherein applying the learning-engine-based speech-separation processing to the combined sound signal comprises: generating a combined sound spectrogram from the combined sound signal; and separating the combined sound spectrogram into multiple resultant speaker spectrograms that each corresponds to a different one of the multiple sound sources; and wherein selecting one of the one or more separated sound signals based on the obtained neural signals for the person comprises: generating an attended speaker spectrogram based on the neural signals obtained for the person; comparing the attended speaker spectrogram to the multiple resultant speaker spectrograms to select one of the multiple resultant speaker spectrograms; and transforming the selected one of the multiple resultant speaker spectrograms into an acoustic signal. 11 . The method of claim 10 , wherein separating the combined sound spectrogram into multiple resultant speaker spectrograms comprises: separating the combined sound spectrogram with multiple trained deep neural network (DNN) engines that are each adapted to output a respective spectrogram for one of the multiple sound sources. 12 . The method of claim 1 , wherein applying the learning-engine-based speech-separation processing to the combined sound signal comprises: transforming the combined signal into an embedding space; determining respective reference points in the embedding space for each of the different groups of the multiple sound sources, with the reference points representing locations of the different groups of the multiple sound sources in the embedding space; deriving masks for the determined reference points; and extracting at least one of the multiple sound sources using at least one of the derived masks. 13 . The method of claim 12 , wherein deriving the masks comprises: computing similarity between embedded points within the embedding space and the determined respective reference points. 14 . The method of claim 1 , wherein applying the learning-engine-based speech-separation processing to the combined sound signal comprises: dividing the combined sound signal into non-overlapping segments; transforming the non-overlapping segments into respective weighted sums of a learnable overcomplete basis of signals, wherein weight coefficients for the respective weighted sums are non-negative; performing learning-engine-based processing on the respective weighted sums of the learnable overcomplete basis of signals to derive a plurality of mask matrices corresponding to different groups of the multiple sound sources; and estimating a plurality of reconstructed sounds signals using the derived plurality of mask matrices corresponding to the different groups of the multiple sound sources. 15 . The method of claim 14 , wherein transforming the non-overlapping segments into the respective weighted sums comprises: estimating the respective weighted sums of the learnable overcomplete basis of signals using a gated 1-D convolution layer according to: w k =ReLU( x k *U )⊙σ( x k *V ), k= 1,2, . . . , K, where U∈R N×L and V∈R N×L are N vectors with length L, w k ∈R 1×N is a mixture weight vector for segment k, σ denotes a Sigmoid activation function, and * denotes a convolution operator. 16 . The method of claim 1 , wherein applying the learning-engine-based speech-separation processing to the combined sound signal comprises: representing the combined sound signal as a time-frequency mixture signal in a time-frequency space; projecting the time-frequency mixture signal into an embedding space comprising multiple embedded time-frequency bins; tracking respective reference points for each of the multiple sound sources, with the reference points representing locations of the multiple sound sources in the embedding space, based at least in part on previous locations of the respective reference points at one or more earlier time instances; deriving masks for the tracked respective reference points; and extracting at least one of the multiple sound sources using at least one of the derived masks. 17 . The method of claim 1 , wherein applying the learning-engine-based speech-separation processing to the combined sound signal comprises: dividing the combined sound signal into a plurality of segments; transforming the plurality of segments into a plurality of corresponding encoded segments represented in an intermediate feature space; estimating, for each of the plurality o

Assignees

Inventors

Classifications

  • G10L25/30Primary

    using neural networks · CPC title

  • for extracting parameters related to health condition (detecting or measuring for diagnostic purposes A61B5/00) · CPC title

  • Recognition of special voice characteristics, e.g. for use in lie detectors; Recognition of animal voices · CPC title

  • Audiometering · CPC title

  • Voice signal separating · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2024013800A1 cover?
Disclosed are devices, systems, apparatus, methods, products, and other implementations, including a method comprising obtaining, by a device, a combined sound signal for signals combined from multiple sound sources in an area in which a person is located, and applying, by the device, speech-separation processing (e.g., deep attractor network (DAN) processing, online DAN processing, LSTM-TasNet…
Who is the assignee on this patent?
Univ Columbia
What technology area does this patent fall under?
Primary CPC classification G10L25/30. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Jan 11 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).