Feature normalization inputs to front end processing for automatic speech recognition
US-2015206527-A1 · Jul 23, 2015 · US
US2016189706A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2016189706-A1 |
| Application number | US-201514606588-A |
| Country | US |
| Kind code | A1 |
| Filing date | Jan 27, 2015 |
| Priority date | Dec 30, 2014 |
| Publication date | Jun 30, 2016 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods, systems, and apparatuses are described for isolated word training and detection. Isolated word training devices and systems are provided in which a user may provide a wake-up phrase from 1 to 3 times to train the device or system. A concatenated phoneme model of the user-provided wake-up phrase may be generated based on the provided wake-up phrase and a pre-trained phoneme model database. A word model of the wake-up phrase may be subsequently generated from the concatenated phoneme model and the provided wake-up phrase. Once trained, the user-provided wake-up phrase may be used to unlock the device or system and/or to wake up the device or system from a standby mode of operation. The word model of the user-provided wake-up phrase may be further adapted based on additional provisioning of the wake-up phrase.
Opening claim text (preview).
What is claimed is: 1 . An isolated word training system that comprises: an input component configured to receive at least one audio input representation; a recognition component configured to generate a phoneme concatenation model of the at least one audio input representation based on a phoneme transcription; an adaptation component configured to generate a first word model of the at least one audio input based on the phoneme concatenation model. 2 . The isolated word training system of claim 1 , wherein the adaptation component is further configured to generate a second word model based on the first word model and at least one additional audio input representation. 3 . The isolated word training system of claim 1 , that further comprises: a phoneme model database that includes the one or more stored phoneme models, each of the one or more stored phoneme models being a pre-trained Hidden Markov Model. 4 . The isolated word training system of claim 1 , wherein the recognition component is configured to generate the phoneme concatenation model by: decoding feature vectors of the at least one audio input using the one or more stored phoneme models to generate the phoneme transcription that comprises one or more phoneme identifiers; selecting a subset of phoneme identifiers from the phoneme transcription; and selecting corresponding phoneme models from a phoneme model database for each phoneme identifier in the subset as the phoneme concatenation model. 5 . The isolated word training system of claim 4 , that further comprises: a feature extraction component configured to: derive speech features of speech of the at least one audio input; and generate the feature vectors based on the speech features. 6 . The isolated word training system of claim 5 , that further comprises: a voice activity detection component configured to: detect an onset of the speech; determine a termination of the speech; and provide the speech to the feature extraction component. 7 . The isolated word training system of claim 1 , wherein the adaptation component is configured to generate the first word model by at least one of: removing one or more phonemes from the phoneme concatenation model; combining one or more phonemes of the phoneme concatenation model; adapting one or more state transition probabilities of the phoneme concatenation model; adapting an observation symbol probability distribution of the phoneme concatenation model; or adapting the phoneme concatenation model as a whole. 8 . An electronic device that comprises: a first processing component configured to: receive a user-specified wake-up phrase, generate a phoneme concatenation model of the user-specified wake-up phrase, and generate a word model of the user-specified wake-up phrase based on the phoneme concatenation model; and a second processing component configured to: detect audio activity of the user, and determine if the user-specified wake-up phrase is present within the audio activity based on the word model. 9 . The electronic device of claim 8 , wherein the first processing component is further configured to operate in a training mode of the electronic device, and wherein the second processing component is further configured to: operate in a stand-by mode of the electronic device, and provide, in the stand-by mode, an indication to the electronic device to operate in a normal operating mode subsequent to a determination that the user-specified wake-up phrase is present in the audio activity. 10 . The electronic device of claim 9 , that further comprises: a memory component configured to: buffer the audio activity and the user-specified wake-up phrase of the user; and provide the audio activity or the user-specified wake-up phrase to at least one of a voice activity detector or an automatic speech recognition component. 11 . The electronic device of claim 9 , wherein the first processing component, in the training mode, is further configured to: decode feature vectors of the at least one audio input using the one or more stored phoneme models to generate a phoneme transcription that comprises one or more phoneme identifiers; select a subset of phoneme identifiers from the phoneme transcription; and select corresponding phoneme models from a phoneme model database for each phoneme identifier in the subset as the phoneme concatenation model. 12 . The electronic device of claim 11 , wherein the first processing component, in the training mode, is further configured to: derive speech features of the user-specified wake-up phrase; and generate the feature vectors based on the speech features. 13 . The electronic device of claim 9 , wherein the second processing component, in the stand-by mode, is further configured to: derive speech features of the audio activity; generate feature vectors based on the speech features; and determine if the user-specified wake-up phrase is present by comparing the feature vectors to the word model. 14 . The electronic device of claim 8 , wherein the first processing component is further configured to: generate the first word model by at least one of: removing one or more phonemes from the phoneme concatenation model; combining one or more phonemes of the phoneme concatenation model; adapting one or more state transition probabilities of the phoneme concatenation model; adapting an observation symbol probability distribution of the phoneme concatenation model; or adapting the phoneme concatenation model as a whole. 15 . The electronic device of claim 8 , wherein the first processing component is further configured to update the word model based on the user-specified wake-up phrase in a subsequent audio input. 16 . The electronic device of claim 8 , that further comprises: a phoneme model database that includes one or more stored phoneme models, each of the one or more stored phoneme models being a pre-trained Hidden Markov Model. 17 . A computer-readable storage medium having program instructions recorded thereon that, when executed by an electronic device, perform a method for utilizing a user-specified wake-up phrase in the electronic device, the method comprising: training the electronic device using the user-specified wake-up phrase, received from a user, to generate, at the electronic device, a user-dependent word model of the user-specified wake-up phrase; and transitioning the electronic device from a stand-by mode to a normal operating mode subsequent to a detection, that is based on the word model, of the user-specified wake-up phrase. 18 . The computer-readable storage medium of claim 17 , wherein training the electronic device using the user-specified wake-up phrase includes receiving the user-specified wake-up phrase from the user no more than three times. 19 . The computer-readable storage medium of claim 18 , wherein the electronic device generates the word model based on a phoneme concatenation model of the user-specified wake-up phrase. 20 . The computer-readable storage medium of claim 19 , wherein the training further comprises: generating a phoneme transcription of the user-specified wake-up phrase that includes one or more phoneme identifiers; and generating the phoneme concatenation model of the user-specified wake-up phrase based on stored phoneme models that correspond to the one or more phoneme identifiers.
Training · CPC title
Feature extraction for speech recognition; Selection of recognition unit · CPC title
Phonemes, fenemes or fenones being the recognition units · CPC title
Word boundary detection · CPC title
Training of HMMs · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.