Target speaker mode
US-12217761-B2 · Feb 4, 2025 · US
US12444429B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12444429-B2 |
| Application number | US-202218085705-A |
| Country | US |
| Kind code | B2 |
| Filing date | Dec 21, 2022 |
| Priority date | Dec 21, 2021 |
| Publication date | Oct 14, 2025 |
| Grant date | Oct 14, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Improved systems and methods are provided herein for extracting target speech from audio signals that can contain masking speech or other unwanted noise content. These systems and methods include detection of target speech in an input signal by detecting elevated frequency content in the signal above a threshold frequency. Portions of the signal determined to contain such elevated high frequency content are then used to generate audio filters to extract target speech from subsequently-obtained audio signals. This can include performing non-negative matrix factorization to determine a set of basis vectors to represent noise content in the spectral domain and then using the set of basis vectors to decompose subsequently-obtained audio signals into noise signals that can then be removed from the audio signals.
Opening claim text (preview).
We claim: 1. A non-transitory computer readable medium comprising program instructions executable by at least one processor to cause the at least one processor to perform a method comprising: obtaining a first audio sample; determining that a first portion of the first audio sample contains frequency content at frequencies higher than 5.6 kilohertz that exceeds a threshold energy level; responsive to determining that the first portion contains frequency content at frequencies higher than 5.6 kilohertz that exceeds the threshold energy level, determining a first audio filter based on the first portion of the first audio sample by: determining a first spectrogram for the first portion; and performing non-negative matrix factorization to generate a first matrix and a second matrix whose product corresponds to a low-frequency portion of the first spectrogram that is below a threshold frequency, wherein the first matrix is composed of a set of column vectors that span along a frequency dimension of the first spectrogram, and wherein the second matrix is composed of a set of row vectors that span along a time dimension of the first spectrogram; subsequent to obtaining the first audio sample, obtaining a second audio sample; and applying the first audio filter to the second audio sample to generate a first audio output by: determining a second spectrogram for the second audio sample; applying the first matrix to a low-frequency portion of the second spectrogram that is below the threshold frequency to generate a third spectrogram that represents noise content of the second audio sample; and using the third spectrogram to remove the noise content from the second audio sample, thereby generating the first audio output. 2. The non-transitory computer readable medium of claim 1 , wherein the method further comprises: determining a plurality of zero-crossing rates across time for the first audio sample; and determining a plurality of signal energy levels across time for the first audio sample, wherein determining that the first portion contains frequency content at frequencies higher than 5.6 kilohertz that exceeds the threshold energy level comprises determining (i) that a zero-crossing rate, of the plurality of zero-crossing rates, that corresponds to the first portion exceeds a threshold zero-crossing rate and (ii) that a signal energy level, of the plurality of signal energy levels, that corresponds to the first portion exceeds a threshold signal energy level. 3. The non-transitory computer readable medium of claim 1 , wherein the first audio sample is divided into a plurality of non-overlapping frames, and wherein determining that the first portion contains frequency content at frequencies higher than 5.6 kilohertz that exceeds the threshold energy level comprises: determining that a contiguous subset of the plurality of non-overlapping frames of the first audio sample all contain frequency content at frequencies higher than 5.6 kilohertz that exceeds the threshold energy level, wherein the first portion consists of the contiguous subset of frames of the first audio sample. 4. The non-transitory computer readable medium of claim 3 , wherein each frame of the plurality of non-overlapping frames of the first audio sample has a duration between 15 milliseconds and 50 milliseconds. 5. The non-transitory computer readable medium of claim 1 , wherein the method further comprises: prior to obtaining the first audio sample, obtaining a third audio sample; determining that a second portion of the third audio sample contains frequency content at frequencies higher than 5.6 kilohertz that exceeds the threshold energy level; and responsive to determining that the second portion of the third audio sample contains frequency content at frequencies higher than 5.6 kilohertz that exceeds the threshold energy level, determining a second audio filter based on the second portion of the third audio sample by: determining a fourth spectrogram for the second portion; and performing non-negative matrix factorization to generate a third matrix and a fourth matrix whose product corresponds to a portion of the fourth spectrogram below the threshold frequency, wherein the third matrix is composed of a further set of column vectors that span along a frequency dimension of the fourth spectrogram, and wherein the fourth matrix is composed of a further set of row vectors that span along a time dimension of the fourth spectrogram, wherein performing non-negative matrix factorization to generate the first matrix and the second matrix comprises using, as an initial estimate of the first matrix, the third matrix. 6. The non-transitory computer readable medium of claim 1 , wherein determining that the first portion contains frequency content at frequencies higher than 5.6 kilohertz that exceeds the threshold energy level comprises: determining a spectrogram for the first portion; and determining that a total energy in the spectrogram above 5.6 kilohertz exceeds the threshold energy level. 7. The non-transitory computer readable medium of claim 1 , wherein using the third spectrogram to remove the noise content from the second audio sample comprises: performing an inverse transform on the third spectrogram to generated a time-domain noise signal; and subtracting the time-domain noise signal from the second audio sample to generate the first audio output. 8. The non-transitory computer readable medium of claim 1 , wherein the method further comprises: prior to obtaining the first audio sample, obtaining a third audio sample; determining that a second portion of the third audio sample contains frequency content at frequencies higher than 5.6 kilohertz that exceeds the threshold energy level; and responsive to determining that the second portion of the third audio sample contains frequency content at frequencies higher than 5.6 kilohertz that exceeds the threshold energy level, determining a second audio filter based on the second portion of the third audio sample, wherein determining the first audio filter based on the first portion comprises: determining a third audio filter based on the first portion; and determining the first audio filter as a weighted combination of the second audio filter and the third audio filter. 9. A method comprising: obtaining a first audio sample; determining that a first portion of the first audio sample contains frequency content at frequencies higher than 5.6 kilohertz that exceeds a threshold energy level; responsive to determining that the first portion contains frequency content at frequencies higher than 5.6 kilohertz that exceeds the threshold energy level, determining a first audio filter based on the first portion of the first audio sample by: determining a first spectrogram for the first portion; and performing non-negative matrix factorization to generate a first matrix and a second matrix whose product corresponds to a low-frequency portion of the first spectrogram that is below a threshold frequency, wherein the first matrix is composed of a set of column vectors that span along a frequency dimension of the first spectrogram, and wherein the second matrix is composed of a set of row vectors that span along a time dimension of the first spectrogram; subsequent to obtaining the first audio sample, obtaining a second audio sample; and applying the first audio filter to the second audio sample to generate a first audio output by: determining a second spectrogram for the second audio sample; applying the first matrix to a low-frequency portion of the second spectrogram that is below the threshold frequency to generate a third spectrogram that represents noise content of the second audio sample; a
characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques · CPC title
for discriminating voice from noise · CPC title
Voice signal separating · CPC title
the extracted parameters being spectral information of each sub-band · CPC title
Processing in the time domain · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.