Multi channel voice activity detection

US11790888B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11790888-B2
Application numberUS-202217806198-A
CountryUS
Kind codeB2
Filing dateJun 9, 2022
Priority dateOct 22, 2020
Publication dateOct 17, 2023
Grant dateOct 17, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method for multi-channel voice activity detection includes receiving a sequence of input frames characterizing streaming multi-channel audio captured by an array of microphones. Each channel of the streaming multi-channel audio includes respective audio features captured by a separate dedicated microphone. The method also includes determining, using a location fingerprint model, a location fingerprint indicating a location of a source of the multi-channel audio relative to the user device based on the respective audio features of each channel of the multi-channel audio. The method also includes generating an output from an application-specific classifier. The first score indicates a likelihood that the multi-channel audio corresponds to a particular audio type that the particular application is configured to process. The method also includes determining whether to accept or reject the multi-channel audio for processing by the particular application based on the first score generated as output from the application-specific classifier.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method when executed on data processing hardware of a user device causes the data processing hardware to perform operations comprising: receiving streaming multi-channel audio captured by an array of microphones in communication with the data processing hardware, each channel of the multi-channel audio comprising respective audio features captured by a separate dedicated microphone in the array of microphones; processing the respective audio features of each channel of the multi-channel audio to determine an embedding associated with a source of the multi-channel audio; based on the embedding associated with the source of the multi-channel audio, determining a first score indicating that the multi-channel audio originates from one of a single source location or a multiple source location; and processing, by a particular application, the multi-channel audio based on the first score indicating that the multi-channel audio originates from the single source location. 2. The computer-implemented method of claim 1 , wherein the embedding associated with the source of the multi-channel audio comprises a location embedding indicating a location of the source of the multi-channel audio relative to the user device. 3. The computer-implemented method of claim 1 , wherein the embedding associated with the source of the multi-channel audio comprises a direction embedding indicating a direction of the source of the multi-channel audio relative to the user device. 4. The computer-implemented method of claim 1 , wherein determining the first score indicating the likelihood that the multi-channel audio originates from one of the single source location or the multiple source location comprises executing a classifier model configured to: receive, as input, the embedding associated with the multi-channel audio; and generate, as output, the first score indicating a likelihood that the multi-channel audio originates from one of the single source location or the multiple source location. 5. The computer-implemented method of claim 1 , wherein the operations further comprise determining that the particular application is configured to process single source audio. 6. The computer-implemented method of claim 5 , wherein the operations further comprise: determining that the first score satisfies a first score threshold; and based on determining that the first score satisfies the first score threshold, accepting the multi-channel audio for processing by the particular application. 7. The computer-implemented method of claim 5 , wherein the operations further comprise: determining, using a voice activity detector (VAD) model, a second score indicating a likelihood that the multi-channel audio corresponds to human-originated speech, wherein processing the multi-channel audio by the particular application is further based on the second score indicating the likelihood that the multi-channel audio corresponds to human-originated speech. 8. The computer-implemented method of claim 7 , wherein the operations further comprise: combining the first score and the second score into a combined score; determining that the combined score satisfies an acceptance threshold; and based on determining that the combined score satisfies the acceptance threshold, accepting the multi-channel audio for processing by the particular application. 9. The computer-implemented method of claim 1 , wherein processing the respective audio features of each channel of the multi-channel audio to determine the embedding associated with a source of the multi-channel audio comprises processing each channel of the multi-channel audio using a time difference of arrival and gain model. 10. The computer-implemented method of claim 1 , wherein processing the respective audio features of each channel of the multi-channel audio to determine the embedding associated with a source of the multi-channel audio comprises processing each channel of the multi-channel audio using a spatial probability model. 11. A system comprising: data processing hardware of a user device; and memory hardware in communication with the data processing hardware, the memory hardware storing instructions that when executed on the data processing hardware cause the data processing hardware to perform operations comprising: receiving streaming multi-channel audio captured by an array of microphones in communication with the data processing hardware, each channel of the multi-channel audio comprising respective audio features captured by a separate dedicated microphone in the array of microphones; processing the respective audio features of each channel of the multi-channel audio to determine an embedding associated with a source of the multi-channel audio; based on the embedding associated with the source of the multi-channel audio, determining a first score indicating that the multi-channel audio originates from one of a single source location or a multiple source location; and processing, by a particular application, the multi-channel audio based on the first score indicating that the multi-channel audio originates from the single source location. 12. The system of claim 11 , wherein the embedding associated with the source of the multi-channel audio comprises a location embedding indicating a location of the source of the multi-channel audio relative to the user device. 13. The system of claim 11 , wherein the embedding associated with the source of the multi-channel audio comprises a direction embedding indicating a direction of the source of the multi-channel audio relative to the user device. 14. The system of claim 11 , wherein determining the first score indicating the likelihood that the multi-channel audio originates from one of the single source location or the multiple source location comprises executing a classifier model configured to: receive, as input, the embedding associated with the multi-channel audio; and generate, as output, the first score indicating a likelihood that the multi-channel audio originates from one of the single source location or the multiple source location. 15. The system of claim 11 , wherein the operations further comprise determining that the particular application is configured to process single source audio. 16. The system of claim 15 , wherein the operations further comprise: determining that the first score satisfies a first score threshold; and based on determining that the first score satisfies the first score threshold, accepting the multi-channel audio for processing by the particular application. 17. The system of claim 15 , wherein the operations further comprise: determining, using a voice activity detector (VAD) model, a second score indicating a likelihood that the multi-channel audio corresponds to human-originated speech, wherein processing the multi-channel audio by the particular application is further based on the second score indicating the likelihood that the multi-channel audio corresponds to human-originated speech. 18. The system of claim 17 , wherein the operations further comprise: combining the first score and the second score into a combined score; determining that the combined score satisfies an acceptance threshold; and based on determining that the combined score satisfies the acceptance threshold, accepting the multi-channel audio for processing by the particular application. 19. The system of claim 11 , wherein processing the respective audio features of each channel of the multi-chan

Assignees

Inventors

Classifications

  • G10L15/02Primary

    Feature extraction for speech recognition; Selection of recognition unit · CPC title

  • for combining the signals of two or more microphones (specially adapted for hearing aids H04R25/407) · CPC title

  • G10L25/78Primary

    Detection of presence or absence of voice signals (switching of direction of transmission by voice frequency in two-way loud-speaking telephone systems H04M9/10) · CPC title

  • for comparison or discrimination · CPC title

  • Microphone arrays; Beamforming · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11790888B2 cover?
A method for multi-channel voice activity detection includes receiving a sequence of input frames characterizing streaming multi-channel audio captured by an array of microphones. Each channel of the streaming multi-channel audio includes respective audio features captured by a separate dedicated microphone. The method also includes determining, using a location fingerprint model, a location fi…
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification G10L15/02. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Oct 17 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).