Pre-training apparatus and method for speech recognition

US9875737B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9875737-B2
Application numberUS-201615207673-A
CountryUS
Kind codeB2
Filing dateJul 12, 2016
Priority dateMar 18, 2016
Publication dateJan 23, 2018
Grant dateJan 23, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A pre-training apparatus and method for recognition speech, which initialize, by layers, a deep neural network to correct a node connection weight. The pre-training apparatus for speech recognition includes an input unit configured to receive speech data, a model generation unit configured to initialize a connection weight of a deep neural network, based on the speech data, and an output unit configured to output information about the connection weight. In order for a state of a phoneme unit corresponding to the speech data to be output, the model generation unit trains the connection weight by piling a plurality of hidden layers according to a determined structure of the deep neural network, applies an output layer to a certain layer between the plurality of hidden layers to correct the trained connection weight in each of the plurality of hidden layers, thereby initializing the connection weight.

First claim

Opening claim text (preview).

What is claimed is: 1. A pre-training apparatus for speech recognition, the pre-training apparatus comprising: a memory and at least one processor coupled to the memory; a set of units configured to pre-train a deep neural network, the set of units comprising: an input unit configured to receive speech data; a model generation unit configured to initialize a connection weight of a deep neural network, based on the speech data; and an output unit configured to output information about the connection weight; wherein in order for a state of a phoneme unit corresponding to the speech data to be output, the model generation unit trains the connection weight by piling a plurality of hidden layers according to a determined structure of the deep neural network, applies an output layer to a certain layer between the plurality of hidden layers to correct the trained connection weight in each of the plurality of hidden layers, thereby initializing the connection weight, wherein when generating a structure of deep neural network by piling a plurality of hidden layers, the model generation unit applies the output layer to one hidden layer to correct a connection weight of the one hidden layer, removes the output layer, piles another hidden layer, applies the output layer to the other hidden layer to correct a connection weight of the other hidden layer, sequentially performs the application and correction on a next hidden layer to a last hidden layer to correct a connection weight of each of the hidden layers subsequent to the other hidden layer, thereby initializing the connection weight. 2. The pre-training apparatus of claim 1 , wherein the output layer is used for speech recognition directly matching a state of each of a plurality of phoneme units. 3. The pre-training apparatus of claim 1 , wherein the model generation unit initializes a first connection weight between the input layer, to which the speech data is input, and a first hidden layer piled on the input layer by using the output layer, initializes a second connection weight between the first hidden layer and a second hidden layer piled on the first hidden layer by using the output layer, and sequentially performs the initialization on a next hidden layer to a last hidden layer to initialize a connection weight of each of hidden layers subsequent to the second hidden layer. 4. The pre-training apparatus of claim 3 , wherein the model generation unit converts the speech data into a frame feature vector, inputs an input frame group, which is a set of frame feature vectors configuring two or more frames, to the input layer, determines the first connection weight between the input layer and the first hidden layer piled on the input layer by using a maximum entropy method, piles the output layer on the first hidden layer, and corrects the first connection weight in order for a state of a phoneme unit corresponding to the input frame group to be output, thereby initializing the first connection weight. 5. The pre-training apparatus of claim 4 , wherein in order to initialize a second connection weight, the model generation unit corrects the first connection weight, removes the output layer, piles a second hidden layer on the first hidden layer, calculates a value in each of nodes of the first hidden layer by using the first connection weight, determines the second connection weight between the first hidden layer and the second hidden layer piled by using the maximum entropy method, piles the output layer on the second hidden layer, and corrects the second connection weight in order for the state of the phoneme unit corresponding to the input frame group to be output. 6. The pre-training apparatus of claim 1 , wherein the input unit performs communication over a wired network or a wireless network to receive the speech data, receives the speech data from a storage medium, or directly receives a speech and digitalizes the speech to convert the speech into speech data. 7. A computer-implemented pre-training method for speech recognition, the computer implemented pre-training method comprising: receiving speech data in an input unit; and initializing a connection weight of a deep neural network using at least one processor, based on the speech data, wherein the initializing of the connection weight comprises, in order for a state of a phoneme unit corresponding to the speech data to be output, training the connection weight by piling a plurality of hidden layers according to a determined structure of the deep neural network, and applying an output layer to a certain layer between the plurality of hidden layers to correct the trained connection weight in each of the plurality of hidden layers, thereby initializing the connection weight, wherein the initializing of the connection weight further comprises: when generating a structure of deep neural network by piling a plurality of hidden layers, applying the output layer to one hidden layer to correct a connection weight of the one hidden layer; removing the output layer and piling another hidden layer; applying the output layer to the other hidden layer to correct a connection weight of the other hidden layer; and sequentially performing the applying and correcting on a next hidden layer to a last hidden layer to initialize a connection weight of each of the plurality of hidden layers. 8. The pre-training method of claim 7 , wherein the initializing of the connection weight comprises: initializing a first connection weight between the input layer, to which the speech data is input, and a first hidden layer piled on the input layer by using the output layer; initializing a second connection weight between the first hidden layer and a second hidden layer piled on the first hidden layer by using the output layer; and sequentially performing the initialization on a next hidden layer to a last hidden layer to initialize a connection weight of each of hidden layers subsequent to the second hidden layer. 9. The pre-training method of claim 8 , wherein the initializing of the first connection weight comprises: converting the speech data into a frame feature vector, inputs an input frame group, which is a set of frame feature vectors configuring two or more frames, to the input layer; determining the first connection weight between the input layer and the first hidden layer piled on the input layer by using the maximum entropy method; and piling the output layer on the first hidden layer and correcting the first connection weight in order for a state of a phoneme unit corresponding to the input frame group to be output.

Assignees

Inventors

Classifications

  • Supervised learning · CPC title

  • Feedforward networks · CPC title

  • Architecture, e.g. interconnection topology · CPC title

  • Feature extraction for speech recognition; Selection of recognition unit · CPC title

  • using artificial neural networks · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9875737B2 cover?
A pre-training apparatus and method for recognition speech, which initialize, by layers, a deep neural network to correct a node connection weight. The pre-training apparatus for speech recognition includes an input unit configured to receive speech data, a model generation unit configured to initialize a connection weight of a deep neural network, based on the speech data, and an output unit c…
Who is the assignee on this patent?
Electronics & Telecommunications Res Inst
What technology area does this patent fall under?
Primary CPC classification G10L15/063. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 23 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).