Multilingual, acoustic deep neural networks
US-9460711-B1 · Oct 4, 2016 · US
US2017206894A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2017206894-A1 |
| Application number | US-201615187581-A |
| Country | US |
| Kind code | A1 |
| Filing date | Jun 20, 2016 |
| Priority date | Jan 18, 2016 |
| Publication date | Jul 20, 2017 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A speech recognition apparatus based on a deep-neural-network (DNN) sound model includes a memory and a processor. As the processor executes a program stored in the memory, the processor generates sound-model state sets corresponding to a plurality of pieces of set training speech data included in multi-set training speech data, generates a multi-set state cluster from the sound-model state sets, and sets the multi-set training speech data as an input node and the multi-set state cluster as output nodes so as to learn a DNN structured parameter.
Opening claim text (preview).
What is claimed is: 1 . A speech recognition apparatus based on a deep-neural-network (DNN) sound model, comprising: a memory; and a processor configured to execute a program stored in the memory, wherein, as the program is executed, the processor generates sound-model state sets corresponding to a plurality of pieces of set training speech data included in multi-set training speech data, generates a multi-set state cluster from the sound-model state sets, and sets the multi-set training speech data as an input node and the multi-set state cluster as an output node so as to learn a DNN structured parameter, and when a user's speech and characteristic information thereof are received via a user interface, the processor recognizes the user's speech on the basis of the learned DNN structured parameter by setting a sound-model state set corresponding to the characteristic information of the user's speech as an output node. 2 . The speech recognition apparatus of claim 1 , wherein the processor generates multi-state sets by collecting the sound-model state sets, and generates the multi-set state cluster by clustering the multi-state sets. 3 . The speech recognition apparatus of claim 2 , wherein the processor calculates state log likelihoods of each state of the sound-model state sets, and generates the multi-set state cluster by merging similar state clusters on the basis of the state log likelihoods and state tying information of the sound-model state sets. 4 . The speech recognition apparatus of claim 3 , wherein the processor calculates a state log likelihood corresponding to a result of merging states of two random sound-model state sets included in the multi-state sets, and merges the two random sound-model state sets when a difference between a sum of the state log likelihoods of the two random sound-model state sets and the state log likelihood corresponding to the result of merging the two random sound-model state sets is equal to or less than a predetermined threshold. 5 . The speech recognition apparatus of claim 3 , wherein the processor merges two random sound-model state sets included in the multi-state sets when logical tri-phone sets corresponding to the two random sound-model state sets are the same. 6 . The speech recognition apparatus of claim 3 , wherein the processor merges two random sound-model state sets included in the multi-state sets when logical tri-phone sets of the two random sound-model state sets are mutually inclusive and no logical tri-phone set has a relation including another sound-model state set. 7 . The speech recognition apparatus of claim 4 , wherein each of the sound-model state sets configures an independent state space on the multi-set state cluster, and the result of merging the two random sound-model state sets shares the independent state space. 8 . The speech recognition apparatus of claim 5 , wherein each of the sound-model state sets configures an independent state space on the multi-set state cluster, and the result of merging the two random sound-model state sets shares the independent state space. 9 . The speech recognition apparatus of claim 1 , wherein the processor generates state-level alignment information regarding each of the sound-model state sets, and sets the multi-set training speech data including the state-level alignment information as an input node. 10 . The speech recognition apparatus of claim 1 , wherein the processor sets the plurality of pieces of set training speech data included in the multi-set training speech data as input nodes, and sets the sound-model state sets included in the multi-set state cluster and corresponding to the plurality of pieces of set training speech data as output nodes. 11 . The speech recognition apparatus of claim 1 , wherein the plurality of pieces of set training speech data comprise different acoustic-statistical characteristics. 12 . The speech recognition apparatus of claim 11 , wherein the different acoustic-statistical characteristics comprise acoustic-statistical characteristics corresponding to speakers of different native languages. 13 . A speech recognition method based on a deep-neural-network (DNN) sound model, comprising: generating sound-model state sets corresponding to a plurality of pieces of set training speech data included in multi-set training speech data; generating a multi-set state cluster from the sound-model state sets; learning a DNN structured parameter by setting the multi-set training speech data as an input node and the multi-set state cluster as an output node; receiving a user's speech and characteristic information thereof via a user interface; and recognizing the user's speech on the basis of the learned DNN structured parameter by setting a sound-model state set corresponding to the characteristic information of the user's speech as an output node. 14 . The speech recognition method of claim 13 , wherein the generating of the multi-set state cluster comprises: generating multi-state sets by collecting the sound-model state sets; and generating the multi-set state cluster by clustering the multi-state sets. 15 . The speech recognition method of claim 14 , wherein the generating of the multi-set state cluster by clustering the multi-state sets comprises: calculating state log likelihoods of each state of the sound-model state sets; and merging similar state clusters on the basis of the state log likelihoods and state tying information of the sound-model state sets. 16 . The speech recognition method of claim 15 , wherein the merging of the similar state clusters comprises: calculating a state log likelihood corresponding to a result of merging states of two random sound-model state sets included in the multi-state sets; and merging the two random sound-model state sets when a difference between a sum of the state log likelihoods of the two random sound-model state sets and the state log likelihood corresponding to the result of merging the two random sound-model state sets is equal to or less than a predetermined threshold. 17 . The speech recognition method of claim 15 , wherein the merging of the similar state clusters comprises merging two random sound-model state sets included in the multi-state sets when logical tri-phone sets corresponding to the two random sound-model state sets are the same. 18 . The speech recognition method of claim 15 , wherein the merging of the similar state clusters comprises merging two random sound-model state sets included in the multi-state sets when logical tri-phone sets of the two random sound-model state sets are mutually inclusive and no logical tri-phone set has a relation including another sound-model state set. 19 . The speech recognition method of claim 13 , further comprising generating state-level alignment information regarding each of the sound-model state sets, and wherein the learning of the DNN structured parameter comprises setting the multi-set training speech data including the state-level alignment information as an input node. 20 . The speech recognition method of claim 13 , wherein the learning of the DNN structured parameter comprises setting the plurality of pieces of set training speech data included in the multi-set training speech data as input nodes, and setting the sound-model state sets included in the multi-set state cluster and corresponding to the plurality of pieces of set training speech data as output nodes.
Training · CPC title
Demisyllables, biphones or triphones being the recognition units · CPC title
to the speaker · CPC title
Threshold criteria for the updating · CPC title
using artificial neural networks · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.