Multi-speaker speech separation
US-9818431-B2 · Nov 14, 2017 · US
US2024013800A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2024013800-A1 |
| Application number | US-202318228239-A |
| Country | US |
| Kind code | A1 |
| Filing date | Jul 31, 2023 |
| Priority date | Jun 14, 2016 |
| Publication date | Jan 11, 2024 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Disclosed are devices, systems, apparatus, methods, products, and other implementations, including a method comprising obtaining, by a device, a combined sound signal for signals combined from multiple sound sources in an area in which a person is located, and applying, by the device, speech-separation processing (e.g., deep attractor network (DAN) processing, online DAN processing, LSTM-TasNet processing, Conv-TasNet processing), to the combined sound signal from the multiple sound sources to derive a plurality of separated signals that each contains signals corresponding to different groups of the multiple sound sources. The method further includes obtaining, by the device, neural signals for the person, the neural signals being indicative of one or more of the multiple sound sources the person is attentive to, and selecting one of the plurality of separated signals based on the obtained neural signals. The selected signal may then be processed (amplified, attenuated).
Opening claim text (preview).
What is claimed is: 1 . A method comprising: obtaining, by a device, a combined sound signal for sound combined from multiple sound sources in an area in which a person is located; applying, by the device, a learning-engine-based speech-separation processing to the combined sound signal to derive in real-time one or more separated sound signals that each contains sound signals from different groups of the multiple sound sources; obtaining, by the device, neural signals measured by invasive electrodes placed in a brain of the person to track, through closed-loop attention decoding, a modifiable direction of attention of the person indicative of one or more of the multiple sound sources the person is attentive to; and selecting, by the device, one of the one or more separated sound signals based on the neural signals obtained for the person. 2 . The method of claim 1 , wherein applying the learning-engine-based speech separation processing comprises one of: extracting from the combined sound signal one set of sound signals for one of the different groups of the multiple sound sources, or separating the combined sound signal into a plurality of sets of separated sounds signals corresponding to a plurality of the different groups of the multiple sound sources. 3 . The method of claim 1 , wherein obtaining the neural signals comprises: obtaining electrocorticography (ECoG) signals for the person via the invasive electrodes placed onto a cortical surface of the brain of the person. 4 . The method of claim 1 , further comprising: processing the selected one of the one or more separated sound signals to perform one or more of: amplifying the selected one of the one or more separated sound signals, or attenuating at least one non-selected sound signal from the one or more separated sound signals. 5 . The method of claim 1 , wherein obtaining the combined sound signal for the multiple sound sources comprises: receiving the combined sound signal for the multiple sound sources at a single microphone coupled to the device. 6 . The method of claim 5 , wherein applying the learning-engine-based speech-separation processing to the combined sound signal comprises: applying the learning-engine-based speech-separation processing to the combined sound signal received through the single microphone to derive the one or more separated sound signals regardless of spatial separation between the multiple sound sources. 7 . The method of claim 1 , wherein the device is a hearing device. 8 . The method of claim 1 , wherein applying the learning-engine-based speech-separation processing to the combined sound signal comprises: providing the combined sound signal from the multiple sound sources to a deep neural network (DNN) configured to separate the different groups of the multiple sound sources into individual signals associated with respective ones of the multiple sound sources. 9 . The method of claim 8 , wherein the DNN comprises one or more long short-term memory recurrent neural networks. 10 . The method of claim 1 , wherein applying the learning-engine-based speech-separation processing to the combined sound signal comprises: generating a combined sound spectrogram from the combined sound signal; and separating the combined sound spectrogram into multiple resultant speaker spectrograms that each corresponds to a different one of the multiple sound sources; and wherein selecting one of the one or more separated sound signals based on the obtained neural signals for the person comprises: generating an attended speaker spectrogram based on the neural signals obtained for the person; comparing the attended speaker spectrogram to the multiple resultant speaker spectrograms to select one of the multiple resultant speaker spectrograms; and transforming the selected one of the multiple resultant speaker spectrograms into an acoustic signal. 11 . The method of claim 10 , wherein separating the combined sound spectrogram into multiple resultant speaker spectrograms comprises: separating the combined sound spectrogram with multiple trained deep neural network (DNN) engines that are each adapted to output a respective spectrogram for one of the multiple sound sources. 12 . The method of claim 1 , wherein applying the learning-engine-based speech-separation processing to the combined sound signal comprises: transforming the combined signal into an embedding space; determining respective reference points in the embedding space for each of the different groups of the multiple sound sources, with the reference points representing locations of the different groups of the multiple sound sources in the embedding space; deriving masks for the determined reference points; and extracting at least one of the multiple sound sources using at least one of the derived masks. 13 . The method of claim 12 , wherein deriving the masks comprises: computing similarity between embedded points within the embedding space and the determined respective reference points. 14 . The method of claim 1 , wherein applying the learning-engine-based speech-separation processing to the combined sound signal comprises: dividing the combined sound signal into non-overlapping segments; transforming the non-overlapping segments into respective weighted sums of a learnable overcomplete basis of signals, wherein weight coefficients for the respective weighted sums are non-negative; performing learning-engine-based processing on the respective weighted sums of the learnable overcomplete basis of signals to derive a plurality of mask matrices corresponding to different groups of the multiple sound sources; and estimating a plurality of reconstructed sounds signals using the derived plurality of mask matrices corresponding to the different groups of the multiple sound sources. 15 . The method of claim 14 , wherein transforming the non-overlapping segments into the respective weighted sums comprises: estimating the respective weighted sums of the learnable overcomplete basis of signals using a gated 1-D convolution layer according to: w k =ReLU( x k *U )⊙σ( x k *V ), k= 1,2, . . . , K, where U∈R N×L and V∈R N×L are N vectors with length L, w k ∈R 1×N is a mixture weight vector for segment k, σ denotes a Sigmoid activation function, and * denotes a convolution operator. 16 . The method of claim 1 , wherein applying the learning-engine-based speech-separation processing to the combined sound signal comprises: representing the combined sound signal as a time-frequency mixture signal in a time-frequency space; projecting the time-frequency mixture signal into an embedding space comprising multiple embedded time-frequency bins; tracking respective reference points for each of the multiple sound sources, with the reference points representing locations of the multiple sound sources in the embedding space, based at least in part on previous locations of the respective reference points at one or more earlier time instances; deriving masks for the tracked respective reference points; and extracting at least one of the multiple sound sources using at least one of the derived masks. 17 . The method of claim 1 , wherein applying the learning-engine-based speech-separation processing to the combined sound signal comprises: dividing the combined sound signal into a plurality of segments; transforming the plurality of segments into a plurality of corresponding encoded segments represented in an intermediate feature space; estimating, for each of the plurality o
using neural networks · CPC title
for extracting parameters related to health condition (detecting or measuring for diagnostic purposes A61B5/00) · CPC title
Recognition of special voice characteristics, e.g. for use in lie detectors; Recognition of animal voices · CPC title
Audiometering · CPC title
Voice signal separating · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.