Method and system for multiple time resolution audio processing

US12475906B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12475906-B2
Application numberUS-202318450784-A
CountryUS
Kind codeB2
Filing dateAug 16, 2023
Priority dateAug 16, 2023
Publication dateNov 18, 2025
Grant dateNov 18, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Aspects of the present disclosure provided a method for voice control that includes transforming, using a short-time Fourier transform (STFT) applied to data in each window aligned across each input channel of the multichannel audio stream, the multichannel audio stream into a complex valued frequency-domain representation. For a current window, the method further includes: updating a first complex-valued covariance matrix corresponding to a slowly-adapting beamformer and forming a single-channel denoised estimate for each frequency band in the STFT; calculating a voice activity detection (VAD) estimate for each frequency band in the STFT by comparing a magnitude of the single-channel denoised estimate to a magnitude of each input channel of the multichannel audio stream; and selectively updating or refraining from updating, responsive to the VAD estimate respectively indicating a presence or an absence of speech, a second complex-valued covariance matrix corresponding to a quickly-adapting beamformer.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method for voice control, comprising: transforming, using a short-time Fourier transform (STFT) applied to data in each of a plurality of windows aligned across each input channel of a multichannel audio stream, the multichannel audio stream into a complex valued frequency-domain representation, wherein for a current one of the plurality of windows, the method comprises: updating a first complex-valued covariance matrix corresponding to a slowly-adapting beamformer and forming a single-channel denoised estimate for each frequency band in the STFT; calculating a voice activity detection (VAD) estimate for each frequency band in the STFT by comparing a magnitude of the single-channel denoised estimate to a magnitude of each input channel of the multichannel audio stream; and selectively updating or refraining from updating, responsive to the VAD estimate respectively indicating a presence or an absence of speech, a second complex-valued covariance matrix corresponding to a quickly-adapting beamformer; and controlling, by a hardware processor, a voice user interface based device to perform a user perceptible action, responsive to an output of at least the quickly-adapting beamformer. 2 . The method in accordance with claim 1 , further comprising configuring the slowly-adapting beamformer and the quickly-adapting beamformer to use different equations for an update rule and different adaptation speeds. 3 . The method in accordance with claim 1 , further comprising configuring the slowly-adapting beamformer and the quickly-adapting beamformer to both use a same equation for an update rule, but have different adaptation speeds. 4 . The method in accordance with claim 3 , further comprising: configuring the slowly-adapting beamformer to slowly adapt to an acoustic background including background noise in the multichannel audio stream; and configuring the quickly-adapting beamformer to quickly adapt to speech in the multichannel audio stream. 5 . The method in accordance with claim 1 , wherein the method further comprises, responsive to the magnitude of the single-channel denoised estimate being lower than the magnitude of a given input channel of the multichannel audio stream by a threshold amount, considering a particular frequency band in the STFT corresponding to the single-channel denoised estimate to be noise and selectively refraining from updating the second complex-valued covariance matrix corresponding to the quickly-adapting beamformer. 6 . The method in accordance with claim 1 , further comprising applying the method in a beamforming strategy configured to use a correlation matrix of any of the multichannel audio stream and noise. 7 . The method in accordance with claim 1 , wherein forming the single-channel denoised estimate comprises calculating a beamformer vector at each frequency band in the STFT. 8 . The method in accordance with claim 7 , wherein the beamformer vector is calculated at each frequency band in the STFT responsive to the frequency-domain representation of the current one of the plurality of windows. 9 . The method in accordance with claim 7 , wherein the single-charmed denoised estimate is determined by multiplying the beamformer vector by a corresponding beamformer weight vector of the STFT. 10 . The method in accordance with claim 1 , wherein forming the single-channel denoised estimate comprises calculating a minimum variance distortionless response (MVDR) beamformer vector at each frequency band in the STFT. 11 . The method in accordance with claim 1 , wherein forming the single-channel denoised estimate comprises calculating a minimum power distortionless response (MPDR) beamformer vector at each frequency band in the STFT. 12 . The method in accordance with claim 1 , further comprising receiving the multichannel audio stream from a microphone array and a desired source direction-of-arrival (DOA). 13 . The method in accordance with claim 1 , wherein the transforming comprises applying frequency-domain processing to the complex valued frequency-domain representation to obtain processed audio, and the method further includes using an inverse STFT and an overlap-add procedure to transform the processed audio from a frequency domain to a time domain and provide a time-domain representation. 14 . The method in accordance with claim 13 , further comprising performing the method on-line using a first buffer configured to collect input samples and a second buffer configured to feed out the processed audio. 15 . The method in accordance with claim 1 , further comprising performing the method by an automatic speech recognition system operatively coupled to a voice user interface of the voice user interface based device and configured to convert the speech to commands to control the voice user interface based device to perform the user perceptible action. 16 . The method in accordance with claim 1 , wherein the plurality of windows are exponential windows. 17 . The method in accordance with claim 1 , wherein the plurality of windows are rectangular windows of different lengths, wherein the slowly-adapting beamformer uses a time window having a longer length than the quickly-adapting beamformer to capture a stationarity of time. 18 . The method in accordance with claim 1 , wherein the method is performed off-line. 19 . The method in accordance with claim 1 , further comprising writing a covariance update step and beamformer vector calculation step in a mathematically equivalent form to avoid computing matrix inversions and limit to algebraic operations. 20 . The method in accordance with claim 1 , further comprising weighting speech captured from a driver's location more heavily than speech captured from a passenger's location. 21 . The method in accordance with claim 1 , further comprising weighting speech captured from a front seat passenger's location more heavily than speech captured from a back seat passenger's location. 22 . A computer program product configured to enable voice control, the computer program product comprising one or more non-transitory computer-readable media, having instructions stored thereon that when executed by one or more processors cause the one or more processors, individually or in combination, to perform a method comprising: transforming, using a short-time Fourier transform (STFT) applied to data in each of a plurality of windows aligned across each input channel of a multichannel audio stream, the multichannel audio stream into a complex valued frequency-domain representation, wherein for a current one of the plurality of windows, the method comprises: updating a first complex-valued covariance matrix corresponding to a slowly-adapting beamformer and forming a single-channel denoised estimate for each frequency band in the STFT; calculating a voice activity detection (VAD) estimate for each frequency band in the STFT by comparing a magnitude of the single-channel denoised estimate to a magnitude of each input channel of the multichannel audio stream; and selectively updating or refraining from updating, responsive to the VAD estimate respectively indicating a presence or an absence of speech, a second complex-valued covariance matrix corresponding to a quickly-adapting beamformer; and controlling, by a hardware processor, a voice user interface based device to perform a user perceptible action, responsive to an output of at least the quickly-adapting bea

Assignees

Inventors

Classifications

  • Microphone arrays; Beamforming · CPC title

  • characterised by the method used for estimating noise · CPC title

  • Noise filtering · CPC title

  • Detection of presence or absence of voice signals (switching of direction of transmission by voice frequency in two-way loud-speaking telephone systems H04M9/10) · CPC title

  • Speech recognition (G10L17/00 takes precedence) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12475906B2 cover?
Aspects of the present disclosure provided a method for voice control that includes transforming, using a short-time Fourier transform (STFT) applied to data in each window aligned across each input channel of the multichannel audio stream, the multichannel audio stream into a complex valued frequency-domain representation. For a current window, the method further includes: updating a first com…
Who is the assignee on this patent?
Analog Devices Inc
What technology area does this patent fall under?
Primary CPC classification G10L21/0216. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Nov 18 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 6 related publications on this page (citations in our corpus or others sharing the same primary CPC).