Unsupervised acoustic model training
US-9401140-B1 · Jul 26, 2016 · US
US2017148444A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2017148444-A1 |
| Application number | US-201514950670-A |
| Country | US |
| Kind code | A1 |
| Filing date | Nov 24, 2015 |
| Priority date | Nov 24, 2015 |
| Publication date | May 25, 2017 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Techniques related to key phrase detection for applications such as wake on voice are discussed. Such techniques may include updating a start state based rejection model and a key phrase model based on scores of sub-phonetic units from an acoustic model to generate a rejection likelihood score and a key phrase likelihood score and determining whether received audio input is associated with a predetermined key phrase based on the rejection likelihood score and the key phrase likelihood score.
Opening claim text (preview).
What is claimed is: 1 . A computer-implemented method for key phrase detection comprising: generating, via acoustic scoring of an acoustic model, a time series of scores of sub-phonetic units based on a time series of feature vectors representative of received audio input; updating a start state based rejection model and a key phrase model associated with a predetermined key phrase based on at least some of the time series of scores of sub-phonetic units to generate a rejection likelihood score and a key phrase likelihood score; and determining whether the received audio input is associated with the predetermined key phrase based on the rejection likelihood score and the key phrase likelihood score. 2 . The method of claim 1 , wherein the start state based rejection model comprises self loops associated with at least some of the scores of sub-phonetic units of the acoustic model. 3 . The method of claim 1 , wherein the start state based rejection model consists of a single state preceding the key phrase model. 4 . The method of claim 1 , wherein the key phrase model comprises a multi-state lexicon look up key phrase model having transitions associated with the lexicon look up for the predetermined key phrase. 5 . The method of claim 4 , wherein the key phrase likelihood score is associated with a final state of the multi-state lexicon look up key phrase model. 6 . The method of claim 1 , wherein determining whether the received audio input is associated with the predetermined key phrase comprises: determining a log likelihood score based on the rejection likelihood score and the key phrase likelihood score; and comparing the log likelihood score to a threshold. 7 . The method of claim 1 , wherein the acoustic model comprises a deep neural network and the time series of feature vectors comprises a first feature vector comprising a stack of a time series of coefficients each associated with a sampling time. 8 . The method of claim 1 , further comprising: updating a second key phrase model associated with a second predetermined key phrase based on at least some of the time series of scores of sub-phonetic units to generate a second key phrase likelihood score; and determining whether the received audio input is associated with the second predetermined key phrase based on the rejection likelihood score and the second key phrase likelihood score. 9 . The method of claim 8 , wherein the received audio input is associated with the second predetermined key phrase, the method further comprising: providing a system command corresponding to the second predetermined key phrase. 10 . A system for performing key phrase detection comprising: a memory configured to store an acoustic model, a start state based rejection model, and a key phrase model associated with a predetermined key phrase; and a digital signal processor coupled to the memory, the digital signal processor to generate, based on the acoustic model, a time series of scores of sub-phonetic units based on a time series of feature vectors representative of an audio input, to update the start state based rejection model and the key phrase model based on at least some of the time series of scores of sub-phonetic units to generate a rejection likelihood score and a key phrase likelihood score, and to determine whether the received audio input is associated with the predetermined key phrase based on the rejection likelihood score and the key phrase likelihood score. 11 . The system of claim 10 , wherein the start state based rejection model comprises self loops associated with at least some of the scores of sub-phonetic units of the acoustic model. 12 . The system of claim 10 , wherein the start state based rejection model consists of a single state preceding the key phrase model. 13 . The system of claim 10 , wherein the key phrase model comprises a multi-state lexicon look up key phrase model having transitions associated with the lexicon look up for the predetermined key phrase. 14 . The system of claim 13 , wherein the key phrase likelihood score is associated with a final state of the multi-state lexicon look up key phrase model. 15 . The system of claim 10 , wherein the digital signal processor to determine whether the received audio input is associated with the predetermined key phrase comprises the digital signal processor to determine a log likelihood score based on the rejection likelihood score and the key phrase likelihood score and compare the log likelihood score to a threshold. 16 . The system of claim 10 , wherein the acoustic model comprises a deep neural network and the time series of feature vectors comprises a first feature vector comprising a stack of a time series of coefficients each associated with a sampling time. 17 . The system of claim 10 , wherein the digital signal processor is further to update a second key phrase model associated with a second predetermined key phrase based on at least some of the time series of scores of sub-phonetic units to generate a second key phrase likelihood score and determine whether the received audio input is associated with the second predetermined key phrase based on the rejection likelihood score and the second key phrase likelihood score. 18 . At least one machine readable medium comprising a plurality of instructions that, in response to being executed on a device, cause the device to generate a key phrase detection model comprising a start state based rejection model, a key phrase model, and a pruned acoustic model by: training an acoustic model having a plurality of output nodes, the output nodes comprising a plurality of sub-phonetic units in the form of tied context-dependent triphone HMM-states, wherein each of the tied context-dependent triphone HMM-states is associated with one of a plurality of monophones; and generating a selected subset of the output nodes by: determining a usage rate for each of the sub-phonetic units during the training; including, in the selected subset, at least one output node corresponding to a highest usage rate sub-phonetic unit for each of the plurality of monophones; and including, in the selected subset, output nodes corresponding to nodes of the key phrase model. 19 . The machine readable medium of claim 18 , further comprising instructions that, in response to being executed on a computing device, cause the device to generate the key phrase detection model by: generating a pruned acoustic model having outputs consisting of the selected subset of the output nodes. 20 . The machine readable medium of claim 18 , wherein the plurality of output nodes of the acoustic model further comprise a plurality of non-speech nodes, and wherein the selected subset of the output nodes comprises the plurality of non-speech nodes. 21 . The machine readable medium of claim 18 , wherein determining the usage rate for each of the sub-phonetic units comprises incrementing a first usage rate associated with a first sub-phonetic unit when the first sub-phonetic unit has a non-zero output during the training of the acoustic model. 22 . The machine readable medium of claim 18 , wherein the start state based rejection model comprises a single state and self loops corresponding to the output nodes of the highest usage rate sub-phonetic unit for each of the plurality of monophones of the selected subset of the output nodes. 23 . The machine readable medium of claim 18 , wherein the
Lexical analysis, e.g. tokenisation or collocates · CPC title
Hidden Markov Models [HMMs] · CPC title
Procedures used during a speech recognition process, e.g. man-machine dialogue · CPC title
Phonemes, fenemes or fenones being the recognition units · CPC title
Interactive procedures; Man-machine interfaces · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.