Keyword detection without decoding
US-9378733-B1 · Jun 28, 2016 · US
US9842585B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-9842585-B2 |
| Application number | US-201313792241-A |
| Country | US |
| Kind code | B2 |
| Filing date | Mar 11, 2013 |
| Priority date | Mar 11, 2013 |
| Publication date | Dec 12, 2017 |
| Grant date | Dec 12, 2017 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Described herein are various technologies pertaining to a multilingual deep neural network (MDNN). The MDNN includes a plurality of hidden layers, wherein values for weight parameters of the plurality of hidden layers are learned during a training phase based upon training data in terms of acoustic raw features for multiple languages. The MDNN further includes softmax layers that are trained for each target language separately, making use of the hidden layer values trained jointly with multiple source languages. The MDNN is adaptable, such that a new softmax layer may be added on top of the existing hidden layers, where the new softmax layer corresponds to a new target language.
Opening claim text (preview).
What is claimed is: 1. A method, comprising: at a computing device that comprises at least one computer processor: receiving an acoustic signal at an automatic speech recognition (ASR) system, the ASR system configured to identify words in multiple different languages, the ASR system comprises a deep neural network (DNN), the DNN comprises an output layer that includes at least one softmax layer, the at least one softmax layer comprises output nodes, the output nodes correspond to the multiple different languages, wherein the DNN is trained based at least in part upon training data, the training data comprising spoken utterances in a source language, the acoustic signal comprising a spoken utterance that includes a word in a target language, the target language being different from the source language; extracting a plurality of features from the acoustic signal to form a feature vector; providing the feature vector to an input layer of the DNN, the DNN producing an output at the output layer responsive to being provided with the feature vector; identifying the word in the target language in the spoken utterance based upon the output of the DNN at the output layer; and performing at least one computing operation based upon the word in the target language in the spoken utterance being identified. 2. The method of claim 1 , wherein the training data comprise spoken utterances in the target language. 3. The method of claim 2 , wherein the spoken utterance comprises a second word in a second target language, and further comprising identifying the second word in the second target language in the spoken utterance based upon the output of the DNN at the output layer. 4. The method of claim 1 , wherein the DNN comprises: a plurality of hidden layers, wherein each hidden layer in the plurality of hidden layers comprises a respective plurality of nodes, each node configured to perform a linear or nonlinear transformation on its respective input, and wherein the at least one softmax layer comprises a first softmax layer that receives outputs of respective nodes in an uppermost layer of the plurality of hidden layers, the first softmax layer comprises a plurality of modeling units that are representative of respective senones used in the target language, wherein the first softmax layer is trained based solely upon training data in the target language. 5. The method of claim 4 , wherein the at least one softmax layer further comprises a second softmax layer that receives outputs of respective nodes in the uppermost layer of the plurality of hidden layers, the second softmax layer comprising a plurality of modeling units that are representative of senones used in speech in a second target language, wherein the second softmax layer is trained based solely upon training data in the second target language. 6. The method of claim 1 , wherein the DNN comprises: a plurality of hidden layers, wherein each hidden layer in the plurality of hidden layers comprises a respective plurality of nodes, each node configured to perform a linear or nonlinear transformation on its respective input, and wherein the at least one softmax layer is a single softmax layer, the single softmax layer receives outputs of respective nodes in the uppermost layer of the plurality of hidden layers, the single softmax layer comprising a plurality of modeling units that are representative of senones used in speech in the source language and the target language, the training data comprising spoken utterances in the target language, the method further comprising: at the computing device that comprises the at least one processor: identifying that the spoken utterance comprises the word in the target language; and selectively activating input synapses to the single softmax layer corresponding to senones used in the target language while failing to activate input synapses to the single softmax layer corresponding to senones not used in the target language. 7. The method of claim 1 executed in a mobile computing or a gaming device. 8. The method of claim 1 , wherein the DNN comprises a plurality of hidden layers and the at least one softmax layer comprises a plurality of softmax layers, and further wherein the DNN is trained in a parallel fashion using training data for different source languages, with values of parameters of the plurality of hidden layers and the plurality of softmax layers for each source language being adjusted simultaneously, and wherein the DNN is updated to comprise a new softmax layer, where the new softmax layer corresponds to a new target language and is trained by acoustic signals comprising spoken utterances in the new target language. 9. The method of claim 1 , wherein the DNN comprises a plurality of hidden layers and the at least one softmax layer includes a single softmax layer, wherein supervised learning is employed to train the DNN to learn values of parameters of the hidden layers and the single softmax layers based upon the training data. 10. The method of claim 1 , wherein the DNN is trained utilizing a plurality of sets of training data, each set of training data in the plurality of sets of training data corresponding to a different respective language. 11. A computing device comprising: at least one processor; and memory that comprises: a recognition system that is configured to detect words in multiple languages, the recognition system comprising: a deep neural network (DNN) that comprises: an input layer; a plurality of hidden layers, each hidden layer comprising a respective plurality of nodes, each node in a hidden layer being configured to perform a linear or nonlinear transformation on output of at least one node from an adjacent layer in the DNN, the plurality of hidden layers having parameters corresponding thereto, wherein values of the parameters are based upon training data that comprises acoustic signals that include spoken utterances in a plurality of different source languages; and at least one softmax layer that comprises modeling units that are representative of phonetic elements used in the multiple languages, the multiple languages include a target language, the at least one softmax layer having parameters corresponding thereto, wherein values of the parameters of the at least one softmax layer are based upon training data that comprises acoustic signals that include spoken utterances in the target language, the at least one softmax layer receiving outputs of nodes from an uppermost hidden layer in the DNN, wherein output of the at least one softmax layer is a probability distribution over the modeling units; and instructions that, when executed by the at least one processor, cause the at least one processor to perform acts comprising: receiving an acoustic signal that comprises a word in the target language; extracting features from the acoustic signal to generate a feature vector; providing the feature vector to the DNN; and identifying the word in the target language based upon the probability distribution over at least a subset of the modeling units. 12. The computing device of claim 11 being a mobile telephone or a gaming device. 13. The computing device of claim 12 being a server that is accessible by way of a telephone. 14. The computing device of claim 11 , wherein the at least one softmax layer comprises a plurality of softmax layers, each softmax layer in the plurality of softmax layers corresponding to a respective language in the multiple languages. 15. The computing device of claim 11 , wherein the modeling units represent senones, wherein the at least one softmax layer compris
Combinations of networks · CPC title
Supervised learning · CPC title
Transfer learning · CPC title
Feedforward networks · CPC title
Backpropagation, e.g. using gradient descent · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.