Automatic speaker identification using speech recognition features
US-2017140761-A1 · May 18, 2017 · US
US11670325B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11670325-B2 |
| Application number | US-202016880560-A |
| Country | US |
| Kind code | B2 |
| Filing date | May 21, 2020 |
| Priority date | Aug 1, 2013 |
| Publication date | Jun 6, 2023 |
| Grant date | Jun 6, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Voice activity detection (VAD) is an enabling technology for a variety of speech based applications. Herein disclosed is a robust VAD algorithm that is also language independent. Rather than classifying short segments of the audio as either “speech” or “silence”, the VAD as disclosed herein employees a soft-decision mechanism. The VAD outputs a speech-presence probability, which is based on a variety of characteristics.
Opening claim text (preview).
The invention claimed is: 1. A computing system, comprising: a processor having an input port for receiving audio data; and a storage system comprising a storage medium comprising executable instructions, wherein the processor is configured to execute the executable instructions, that, when executed by the at least one processor, cause the at least one processor to: calculate an activity probability Q for the audio data based on values calculated based on energy features of the audio data; and output the activity probability Q to an external device, wherein the activity probability Q is given by the equation: Q =√{square root over ( p B ·max{ {tilde over (p)} E ,{tilde over (p)} P ,{tilde over (p)} R })} where: P B is band energy speech probability; P E is overall energy speech probability; P P is spectral peakiness speech probability; and P R is residual energy speech probability; and whereby Q greater than the threshold indicates voice in the audio data. 2. The computing system of claim 1 , wherein the residual energy speech probability (P R ) is obtained by: p R = ( 1 - ɛ ∑ k = 1 F ( x k ) 2 ) 2 . p ~ R = α · p ~ R + ( 1 - α ) · p R . 3. The computing system of claim 1 , wherein the executable instructions, when executed by the processor, further cause the processor to: segment the audio data into a sequence of frames, calculate the activity probability for each frame in the sequence, wherein the activity probability corresponds to a probability that the frame contains speech; determine, frame-by-frame, a state of each frame in the sequence as either speech or non-speech by comparing a moving average of activity probabilities for a group of frames, including the frame, to a selected threshold, wherein the selected threshold for a particular frame depends on the determined state of a frame proceeding the particular frame in the sequence, identify non-speech segments in the audio data based upon the determined states of the frames; and deactivate subsequent processing of the non-speech segments in the audio data. 4. The computing system of claim 3 , wherein the selected threshold for a frame following a non-speech frame is a maximum activity probability, which the moving average must exceed for the state of the frame to be determined as speech. 5. The computing system of claim 3 , wherein the selected threshold for a frame following a speech frame is a minimum activity probability, which the moving average must be below for the state of the frame to be determined as non-speech. 6. The computing system of claim 3 , wherein the activity probability for a frame is a combination of a plurality of different speech probabilities computed using the audio data of the frame. 7. The computing system of claim 6 , wherein the plurality of different speech probabilities comprises: an overall energy speech probability based on an overall the energy of the audio data; a band energy speech probability based on an energy of the audio data contained within one or more spectral bands; a spectral peakiness speech probability based on an energy of the audio data that is concentrated in one or more spectral peaks; and a residual energy speech probability based on a residual energy resulting from a linear prediction of the audio data. 8. The computing system of claim 7 , wherein the overall energy speech probability, the band energy speech probability, the spectral peakiness probability and the residual energy speech probability each have a value between 0 and 1, wherein 0 corresponds to non-speech and 1 corresponds to speech. 9. The computing system of claim 8 , wherein the activity probability is the square root of the band energy speech probability multiplied by the largest of the overall energy probability, the spectral peakiness probability, and the residual energy probability. 10. The computing system of claim 3 , wherein each non-speech segment corresponds to audio data in one or more consecutive non-speech frames bordered in the sequence by speech frames. 11. The computing system of claim 10 , wherein each speech segment corresponds to audio data in one or more consecutive speech frames bordered in the sequence by non-speech frames. 12. A method for identifying speech and non-speech segments in audio data, the method comprising: calculating an activity probability Q for the audio data based on values calculated based on energy features of the audio data; and outputting the activity probability Q to an external device, wherein the activity probability Q is given by the equation: Q =√{square root over ( p B ·max{ {tilde over (p)} E ,{tilde over (p)} P ,{tilde over (p)} R })} where: P B is band energy speech probability; P E is overall energy speech probability; P P is spectral peakiness speech probability; and P R is residual energy speech probability; identifying segments in the audio data containing non-speech data according to the activity probability Q; and detecting voice activity by comparing Q to a threshold, whereby Q greater than the threshold indicates voice in the audio data. 13. The method of claim 12 , further comprising: segmenting the audio data into a sequence of
Detection of presence or absence of voice signals (switching of direction of transmission by voice frequency in two-way loud-speaking telephone systems H04M9/10) · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.