Voice activity detection using a soft decision mechanism

US11670325B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11670325-B2
Application numberUS-202016880560-A
CountryUS
Kind codeB2
Filing dateMay 21, 2020
Priority dateAug 1, 2013
Publication dateJun 6, 2023
Grant dateJun 6, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Voice activity detection (VAD) is an enabling technology for a variety of speech based applications. Herein disclosed is a robust VAD algorithm that is also language independent. Rather than classifying short segments of the audio as either “speech” or “silence”, the VAD as disclosed herein employees a soft-decision mechanism. The VAD outputs a speech-presence probability, which is based on a variety of characteristics.

First claim

Opening claim text (preview).

The invention claimed is: 1. A computing system, comprising: a processor having an input port for receiving audio data; and a storage system comprising a storage medium comprising executable instructions, wherein the processor is configured to execute the executable instructions, that, when executed by the at least one processor, cause the at least one processor to: calculate an activity probability Q for the audio data based on values calculated based on energy features of the audio data; and output the activity probability Q to an external device, wherein the activity probability Q is given by the equation: Q =√{square root over ( p B ·max{ {tilde over (p)} E ,{tilde over (p)} P ,{tilde over (p)} R })} where: P B is band energy speech probability; P E is overall energy speech probability; P P is spectral peakiness speech probability; and P R is residual energy speech probability; and whereby Q greater than the threshold indicates voice in the audio data. 2. The computing system of claim 1 , wherein the residual energy speech probability (P R ) is obtained by: p R = ( 1 - ɛ ∑ k = 1 F ⁢ ⁢ ( x k ) 2 ) 2 . ⁢ p ~ R = α · p ~ R + ( 1 - α ) · p R . 3. The computing system of claim 1 , wherein the executable instructions, when executed by the processor, further cause the processor to: segment the audio data into a sequence of frames, calculate the activity probability for each frame in the sequence, wherein the activity probability corresponds to a probability that the frame contains speech; determine, frame-by-frame, a state of each frame in the sequence as either speech or non-speech by comparing a moving average of activity probabilities for a group of frames, including the frame, to a selected threshold, wherein the selected threshold for a particular frame depends on the determined state of a frame proceeding the particular frame in the sequence, identify non-speech segments in the audio data based upon the determined states of the frames; and deactivate subsequent processing of the non-speech segments in the audio data. 4. The computing system of claim 3 , wherein the selected threshold for a frame following a non-speech frame is a maximum activity probability, which the moving average must exceed for the state of the frame to be determined as speech. 5. The computing system of claim 3 , wherein the selected threshold for a frame following a speech frame is a minimum activity probability, which the moving average must be below for the state of the frame to be determined as non-speech. 6. The computing system of claim 3 , wherein the activity probability for a frame is a combination of a plurality of different speech probabilities computed using the audio data of the frame. 7. The computing system of claim 6 , wherein the plurality of different speech probabilities comprises: an overall energy speech probability based on an overall the energy of the audio data; a band energy speech probability based on an energy of the audio data contained within one or more spectral bands; a spectral peakiness speech probability based on an energy of the audio data that is concentrated in one or more spectral peaks; and a residual energy speech probability based on a residual energy resulting from a linear prediction of the audio data. 8. The computing system of claim 7 , wherein the overall energy speech probability, the band energy speech probability, the spectral peakiness probability and the residual energy speech probability each have a value between 0 and 1, wherein 0 corresponds to non-speech and 1 corresponds to speech. 9. The computing system of claim 8 , wherein the activity probability is the square root of the band energy speech probability multiplied by the largest of the overall energy probability, the spectral peakiness probability, and the residual energy probability. 10. The computing system of claim 3 , wherein each non-speech segment corresponds to audio data in one or more consecutive non-speech frames bordered in the sequence by speech frames. 11. The computing system of claim 10 , wherein each speech segment corresponds to audio data in one or more consecutive speech frames bordered in the sequence by non-speech frames. 12. A method for identifying speech and non-speech segments in audio data, the method comprising: calculating an activity probability Q for the audio data based on values calculated based on energy features of the audio data; and outputting the activity probability Q to an external device, wherein the activity probability Q is given by the equation: Q =√{square root over ( p B ·max{ {tilde over (p)} E ,{tilde over (p)} P ,{tilde over (p)} R })} where: P B is band energy speech probability; P E is overall energy speech probability; P P is spectral peakiness speech probability; and P R is residual energy speech probability; identifying segments in the audio data containing non-speech data according to the activity probability Q; and detecting voice activity by comparing Q to a threshold, whereby Q greater than the threshold indicates voice in the audio data. 13. The method of claim 12 , further comprising: segmenting the audio data into a sequence of

Assignees

Inventors

Classifications

  • G10L25/78Primary

    Detection of presence or absence of voice signals (switching of direction of transmission by voice frequency in two-way loud-speaking telephone systems H04M9/10) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11670325B2 cover?
Voice activity detection (VAD) is an enabling technology for a variety of speech based applications. Herein disclosed is a robust VAD algorithm that is also language independent. Rather than classifying short segments of the audio as either “speech” or “silence”, the VAD as disclosed herein employees a soft-decision mechanism. The VAD outputs a speech-presence probability, which is based on a v…
Who is the assignee on this patent?
Verint Systems Ltd
What technology area does this patent fall under?
Primary CPC classification G10L25/78. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jun 06 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).