Multi-Channel AEC System Identification for Self-Calibration
US-2024121568-A1 · Apr 11, 2024 · US
US12475906B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12475906-B2 |
| Application number | US-202318450784-A |
| Country | US |
| Kind code | B2 |
| Filing date | Aug 16, 2023 |
| Priority date | Aug 16, 2023 |
| Publication date | Nov 18, 2025 |
| Grant date | Nov 18, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Aspects of the present disclosure provided a method for voice control that includes transforming, using a short-time Fourier transform (STFT) applied to data in each window aligned across each input channel of the multichannel audio stream, the multichannel audio stream into a complex valued frequency-domain representation. For a current window, the method further includes: updating a first complex-valued covariance matrix corresponding to a slowly-adapting beamformer and forming a single-channel denoised estimate for each frequency band in the STFT; calculating a voice activity detection (VAD) estimate for each frequency band in the STFT by comparing a magnitude of the single-channel denoised estimate to a magnitude of each input channel of the multichannel audio stream; and selectively updating or refraining from updating, responsive to the VAD estimate respectively indicating a presence or an absence of speech, a second complex-valued covariance matrix corresponding to a quickly-adapting beamformer.
Opening claim text (preview).
What is claimed is: 1 . A method for voice control, comprising: transforming, using a short-time Fourier transform (STFT) applied to data in each of a plurality of windows aligned across each input channel of a multichannel audio stream, the multichannel audio stream into a complex valued frequency-domain representation, wherein for a current one of the plurality of windows, the method comprises: updating a first complex-valued covariance matrix corresponding to a slowly-adapting beamformer and forming a single-channel denoised estimate for each frequency band in the STFT; calculating a voice activity detection (VAD) estimate for each frequency band in the STFT by comparing a magnitude of the single-channel denoised estimate to a magnitude of each input channel of the multichannel audio stream; and selectively updating or refraining from updating, responsive to the VAD estimate respectively indicating a presence or an absence of speech, a second complex-valued covariance matrix corresponding to a quickly-adapting beamformer; and controlling, by a hardware processor, a voice user interface based device to perform a user perceptible action, responsive to an output of at least the quickly-adapting beamformer. 2 . The method in accordance with claim 1 , further comprising configuring the slowly-adapting beamformer and the quickly-adapting beamformer to use different equations for an update rule and different adaptation speeds. 3 . The method in accordance with claim 1 , further comprising configuring the slowly-adapting beamformer and the quickly-adapting beamformer to both use a same equation for an update rule, but have different adaptation speeds. 4 . The method in accordance with claim 3 , further comprising: configuring the slowly-adapting beamformer to slowly adapt to an acoustic background including background noise in the multichannel audio stream; and configuring the quickly-adapting beamformer to quickly adapt to speech in the multichannel audio stream. 5 . The method in accordance with claim 1 , wherein the method further comprises, responsive to the magnitude of the single-channel denoised estimate being lower than the magnitude of a given input channel of the multichannel audio stream by a threshold amount, considering a particular frequency band in the STFT corresponding to the single-channel denoised estimate to be noise and selectively refraining from updating the second complex-valued covariance matrix corresponding to the quickly-adapting beamformer. 6 . The method in accordance with claim 1 , further comprising applying the method in a beamforming strategy configured to use a correlation matrix of any of the multichannel audio stream and noise. 7 . The method in accordance with claim 1 , wherein forming the single-channel denoised estimate comprises calculating a beamformer vector at each frequency band in the STFT. 8 . The method in accordance with claim 7 , wherein the beamformer vector is calculated at each frequency band in the STFT responsive to the frequency-domain representation of the current one of the plurality of windows. 9 . The method in accordance with claim 7 , wherein the single-charmed denoised estimate is determined by multiplying the beamformer vector by a corresponding beamformer weight vector of the STFT. 10 . The method in accordance with claim 1 , wherein forming the single-channel denoised estimate comprises calculating a minimum variance distortionless response (MVDR) beamformer vector at each frequency band in the STFT. 11 . The method in accordance with claim 1 , wherein forming the single-channel denoised estimate comprises calculating a minimum power distortionless response (MPDR) beamformer vector at each frequency band in the STFT. 12 . The method in accordance with claim 1 , further comprising receiving the multichannel audio stream from a microphone array and a desired source direction-of-arrival (DOA). 13 . The method in accordance with claim 1 , wherein the transforming comprises applying frequency-domain processing to the complex valued frequency-domain representation to obtain processed audio, and the method further includes using an inverse STFT and an overlap-add procedure to transform the processed audio from a frequency domain to a time domain and provide a time-domain representation. 14 . The method in accordance with claim 13 , further comprising performing the method on-line using a first buffer configured to collect input samples and a second buffer configured to feed out the processed audio. 15 . The method in accordance with claim 1 , further comprising performing the method by an automatic speech recognition system operatively coupled to a voice user interface of the voice user interface based device and configured to convert the speech to commands to control the voice user interface based device to perform the user perceptible action. 16 . The method in accordance with claim 1 , wherein the plurality of windows are exponential windows. 17 . The method in accordance with claim 1 , wherein the plurality of windows are rectangular windows of different lengths, wherein the slowly-adapting beamformer uses a time window having a longer length than the quickly-adapting beamformer to capture a stationarity of time. 18 . The method in accordance with claim 1 , wherein the method is performed off-line. 19 . The method in accordance with claim 1 , further comprising writing a covariance update step and beamformer vector calculation step in a mathematically equivalent form to avoid computing matrix inversions and limit to algebraic operations. 20 . The method in accordance with claim 1 , further comprising weighting speech captured from a driver's location more heavily than speech captured from a passenger's location. 21 . The method in accordance with claim 1 , further comprising weighting speech captured from a front seat passenger's location more heavily than speech captured from a back seat passenger's location. 22 . A computer program product configured to enable voice control, the computer program product comprising one or more non-transitory computer-readable media, having instructions stored thereon that when executed by one or more processors cause the one or more processors, individually or in combination, to perform a method comprising: transforming, using a short-time Fourier transform (STFT) applied to data in each of a plurality of windows aligned across each input channel of a multichannel audio stream, the multichannel audio stream into a complex valued frequency-domain representation, wherein for a current one of the plurality of windows, the method comprises: updating a first complex-valued covariance matrix corresponding to a slowly-adapting beamformer and forming a single-channel denoised estimate for each frequency band in the STFT; calculating a voice activity detection (VAD) estimate for each frequency band in the STFT by comparing a magnitude of the single-channel denoised estimate to a magnitude of each input channel of the multichannel audio stream; and selectively updating or refraining from updating, responsive to the VAD estimate respectively indicating a presence or an absence of speech, a second complex-valued covariance matrix corresponding to a quickly-adapting beamformer; and controlling, by a hardware processor, a voice user interface based device to perform a user perceptible action, responsive to an output of at least the quickly-adapting bea
Microphone arrays; Beamforming · CPC title
characterised by the method used for estimating noise · CPC title
Noise filtering · CPC title
Detection of presence or absence of voice signals (switching of direction of transmission by voice frequency in two-way loud-speaking telephone systems H04M9/10) · CPC title
Speech recognition (G10L17/00 takes precedence) · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.