Who is the assignee on this patent?

Electronics & Telecommunications Res Inst

What technology area does this patent fall under?

Primary CPC classification G10L15/16. Mapped technology areas include Physics.

When was this patent published?

Publication date Thu Jul 20 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Apparatus and method for recognizing speech

US2017206894A1 · US · A1

Patent metadata
Field	Value
Publication number	US-2017206894-A1
Application number	US-201615187581-A
Country	US
Kind code	A1
Filing date	Jun 20, 2016
Priority date	Jan 18, 2016
Publication date	Jul 20, 2017
Grant date	—

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A speech recognition apparatus based on a deep-neural-network (DNN) sound model includes a memory and a processor. As the processor executes a program stored in the memory, the processor generates sound-model state sets corresponding to a plurality of pieces of set training speech data included in multi-set training speech data, generates a multi-set state cluster from the sound-model state sets, and sets the multi-set training speech data as an input node and the multi-set state cluster as output nodes so as to learn a DNN structured parameter.

First claim

Opening claim text (preview).

What is claimed is: 1 . A speech recognition apparatus based on a deep-neural-network (DNN) sound model, comprising: a memory; and a processor configured to execute a program stored in the memory, wherein, as the program is executed, the processor generates sound-model state sets corresponding to a plurality of pieces of set training speech data included in multi-set training speech data, generates a multi-set state cluster from the sound-model state sets, and sets the multi-set training speech data as an input node and the multi-set state cluster as an output node so as to learn a DNN structured parameter, and when a user's speech and characteristic information thereof are received via a user interface, the processor recognizes the user's speech on the basis of the learned DNN structured parameter by setting a sound-model state set corresponding to the characteristic information of the user's speech as an output node. 2 . The speech recognition apparatus of claim 1 , wherein the processor generates multi-state sets by collecting the sound-model state sets, and generates the multi-set state cluster by clustering the multi-state sets. 3 . The speech recognition apparatus of claim 2 , wherein the processor calculates state log likelihoods of each state of the sound-model state sets, and generates the multi-set state cluster by merging similar state clusters on the basis of the state log likelihoods and state tying information of the sound-model state sets. 4 . The speech recognition apparatus of claim 3 , wherein the processor calculates a state log likelihood corresponding to a result of merging states of two random sound-model state sets included in the multi-state sets, and merges the two random sound-model state sets when a difference between a sum of the state log likelihoods of the two random sound-model state sets and the state log likelihood corresponding to the result of merging the two random sound-model state sets is equal to or less than a predetermined threshold. 5 . The speech recognition apparatus of claim 3 , wherein the processor merges two random sound-model state sets included in the multi-state sets when logical tri-phone sets corresponding to the two random sound-model state sets are the same. 6 . The speech recognition apparatus of claim 3 , wherein the processor merges two random sound-model state sets included in the multi-state sets when logical tri-phone sets of the two random sound-model state sets are mutually inclusive and no logical tri-phone set has a relation including another sound-model state set. 7 . The speech recognition apparatus of claim 4 , wherein each of the sound-model state sets configures an independent state space on the multi-set state cluster, and the result of merging the two random sound-model state sets shares the independent state space. 8 . The speech recognition apparatus of claim 5 , wherein each of the sound-model state sets configures an independent state space on the multi-set state cluster, and the result of merging the two random sound-model state sets shares the independent state space. 9 . The speech recognition apparatus of claim 1 , wherein the processor generates state-level alignment information regarding each of the sound-model state sets, and sets the multi-set training speech data including the state-level alignment information as an input node. 10 . The speech recognition apparatus of claim 1 , wherein the processor sets the plurality of pieces of set training speech data included in the multi-set training speech data as input nodes, and sets the sound-model state sets included in the multi-set state cluster and corresponding to the plurality of pieces of set training speech data as output nodes. 11 . The speech recognition apparatus of claim 1 , wherein the plurality of pieces of set training speech data comprise different acoustic-statistical characteristics. 12 . The speech recognition apparatus of claim 11 , wherein the different acoustic-statistical characteristics comprise acoustic-statistical characteristics corresponding to speakers of different native languages. 13 . A speech recognition method based on a deep-neural-network (DNN) sound model, comprising: generating sound-model state sets corresponding to a plurality of pieces of set training speech data included in multi-set training speech data; generating a multi-set state cluster from the sound-model state sets; learning a DNN structured parameter by setting the multi-set training speech data as an input node and the multi-set state cluster as an output node; receiving a user's speech and characteristic information thereof via a user interface; and recognizing the user's speech on the basis of the learned DNN structured parameter by setting a sound-model state set corresponding to the characteristic information of the user's speech as an output node. 14 . The speech recognition method of claim 13 , wherein the generating of the multi-set state cluster comprises: generating multi-state sets by collecting the sound-model state sets; and generating the multi-set state cluster by clustering the multi-state sets. 15 . The speech recognition method of claim 14 , wherein the generating of the multi-set state cluster by clustering the multi-state sets comprises: calculating state log likelihoods of each state of the sound-model state sets; and merging similar state clusters on the basis of the state log likelihoods and state tying information of the sound-model state sets. 16 . The speech recognition method of claim 15 , wherein the merging of the similar state clusters comprises: calculating a state log likelihood corresponding to a result of merging states of two random sound-model state sets included in the multi-state sets; and merging the two random sound-model state sets when a difference between a sum of the state log likelihoods of the two random sound-model state sets and the state log likelihood corresponding to the result of merging the two random sound-model state sets is equal to or less than a predetermined threshold. 17 . The speech recognition method of claim 15 , wherein the merging of the similar state clusters comprises merging two random sound-model state sets included in the multi-state sets when logical tri-phone sets corresponding to the two random sound-model state sets are the same. 18 . The speech recognition method of claim 15 , wherein the merging of the similar state clusters comprises merging two random sound-model state sets included in the multi-state sets when logical tri-phone sets of the two random sound-model state sets are mutually inclusive and no logical tri-phone set has a relation including another sound-model state set. 19 . The speech recognition method of claim 13 , further comprising generating state-level alignment information regarding each of the sound-model state sets, and wherein the learning of the DNN structured parameter comprises setting the multi-set training speech data including the state-level alignment information as an input node. 20 . The speech recognition method of claim 13 , wherein the learning of the DNN structured parameter comprises setting the plurality of pieces of set training speech data included in the multi-set training speech data as input nodes, and setting the sound-model state sets included in the multi-set state cluster and corresponding to the plurality of pieces of set training speech data as output nodes.

Assignees

Electronics & Telecommunications Res Inst

Inventors

Classifications

G10L15/063
Training · CPC title
G10L2015/022
Demisyllables, biphones or triphones being the recognition units · CPC title
G10L15/07
to the speaker · CPC title
G10L2015/0636
Threshold criteria for the updating · CPC title
G10L15/16Primary
using artificial neural networks · CPC title

Patent family

Related publications grouped by family.

View patent family 59314613

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2017206894A1 cover?: A speech recognition apparatus based on a deep-neural-network (DNN) sound model includes a memory and a processor. As the processor executes a program stored in the memory, the processor generates sound-model state sets corresponding to a plurality of pieces of set training speech data included in multi-set training speech data, generates a multi-set state cluster from the sound-model state set…
Who is the assignee on this patent?: Electronics & Telecommunications Res Inst
What technology area does this patent fall under?: Primary CPC classification G10L15/16. Mapped technology areas include Physics.
When was this patent published?: Publication date Thu Jul 20 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).