Encoder-decoder models for sequence to sequence mapping
US-10706840-B2 · Jul 7, 2020 · US
US11004443B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11004443-B2 |
| Application number | US-201816117373-A |
| Country | US |
| Kind code | B2 |
| Filing date | Aug 30, 2018 |
| Priority date | Aug 30, 2018 |
| Publication date | May 11, 2021 |
| Grant date | May 11, 2021 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods and apparatuses are provided for performing acoustic to word (A2W) speech recognition training performed by at least one processor. The method includes initializing, by the at least one processor, one or more first layers of a neural network with phone based Connectionist Temporal Classification (CTC), initializing, by the at least one processor, one or more second layers of the neural network with grapheme based CTC, acquiring, by the at least one processor, training data and performing, by the at least one processor, A2W speech recognition training based the initialized one or more first layers and one or more second layers of the neural network using the training data.
Opening claim text (preview).
What is claimed is: 1. A method of performing acoustic to word (A2W) speech recognition training performed by at least one processor, the method comprising: initializing, by the at least one processor, one or more first layers of a neural network with phone based Connectionist Temporal Classification (CTC); initializing, by the at least one processor, one or more second layers of the neural network with grapheme based CTC; acquiring, by the at least one processor, training data; and performing, by the at least one processor, A2W speech recognition training based the initialized one or more first layers and one or more second layers of the neural network using the training data, wherein the one or more first layers with the phone based CTC is a lower layer, which is initialized prior to the one or more second layers with the graphene based CTC, which is an upper layer provided after the lower layer. 2. The method of claim 1 , wherein the one or more first layers of the neural network are lower layers of the neural network that are closer to an input of the training data. 3. The method of claim 2 , wherein the one or more second layers of the neural network are stacked above one or more first layers of the neural network. 4. The method of claim 1 , wherein the one or more first layers of the neural network comprises at least one of a Convolutional Neural Network (CNN) layer and a Bi-directional Long Short-Term Memory (BLSTM) layer. 5. The method of claim 1 , wherein the performing the A2W speech recognition training comprises: generating a first training model by performing a first training stage by predicting only a first set of target words; and generating a second training model by performing a second training stage by predicting a second set of target of words based on the first training model. 6. The method of claim 5 , wherein all utterances in the training data of words not belonging to the first set of target words are excluded from the first training stage. 7. The method of claim 1 , wherein the one or more second layers comprises a first linear projection layer, and wherein the performing the A2W speech recognition training comprises: projecting an output of the first linear projection layer with a second linear projection layer and a third linear projection layer. 8. The method of claim 7 , wherein output from the second linear layer is directly connected to a final output layer of a CE model to receive error signals from CE loss, and wherein both the output from the second linear layer and output from the third linear layer are concatenated to obtain a final output distribution for computing CTC loss. 9. An acoustic to word (A2W) speech recognition training apparatus comprising: at least one memory operable to store program code; and at least one processor operable to read said program code and operate as instructed by said program code, said program code comprising: a first initialization code configured to initialize one or more first layers of a neural network with phone based Connectionist Temporal Classification (CTC); a second initialization code configured to initialize one or more second layers of the neural network with grapheme based CTC; an acquiring code configured to acquire training data; and a training code configured to perform A2W speech recognition training based the initialized one or more first layers and one or more second layers of the neural network using the training data, wherein the one or more first layers with the phone based CTC is a lower layer, which is initialized prior to the one or more second layers with the graphene based CTC, which is an upper layer provided after the lower layer. 10. The A2W speech recognition training apparatus of claim 9 , wherein the one or more first layers of the neural network are lower layers of the neural network that are closer to an input of the training data. 11. The A2W speech recognition training apparatus of claim 10 , wherein the one or more second layers of the neural network are stacked above one or more first layers of the neural network. 12. The A2W speech recognition training apparatus of claim 9 , wherein the one or more first layers of the neural network comprises at least one of a Convolutional Neural Network (CNN) layer and a Bi-directional Long Short-Term Memory (BLSTM) layer. 13. The A2W speech recognition training apparatus of claim 9 , wherein the performing the A2W speech recognition training comprises: generating a first training model by performing a first training stage by predicting only a first set of target words; and generating a second training model by performing a second training stage by predicting a second set of target words based on the first training model. 14. The A2W speech recognition training apparatus of claim 13 , wherein all utterances in the training data of words not belonging to the first set of target words are excluded from the first training stage. 15. The A2W speech recognition training apparatus of claim 9 , wherein the one or more second layers comprises a first linear projection layer, and wherein the performing the A2W speech recognition training comprises: projecting an output of the first linear projection layer with a second linear projection layer and a third linear projection layer. 16. The A2W speech recognition training apparatus of claim 15 , wherein output from the second linear layer is directly connected to a final output layer of a CE model to receive error signals from CE loss, and wherein both the output from the second linear layer and output from the third linear layer are concatenated to obtain a final output distribution for computing CTC loss. 17. A non-transitory computer readable medium having stored thereon program code for performing an acoustic to word (A2W) speech recognition training, said program code comprising: a first initialization code configured to initialize one or more first layers of a neural network with phone based Connectionist Temporal Classification (CTC); a second initialization code configured to initialize one or more second layers of the neural network with grapheme based CTC; an acquiring code configured to acquire training data; and a training code configured to perform A2W speech recognition training based the initialized one or more first layers and one or more second layers of the neural network using the training data, wherein the one or more first layers with the phone based CTC is a lower layer, which is initialized prior to the one or more second layers with the graphene based CTC, which is an upper layer provided after the lower layer. 18. The non-transitory computer readable medium according to claim 17 , wherein the training code to perform A2W speech recognition training comprises: a first generating code configured to generate a first training model by performing a first training stage by predicting only a first set of target words; and a first generating code configured to generate a second training model by performing a second training stage by predicting a second set of target words based on the first training model.
Training · CPC title
Combinations of networks · CPC title
Recurrent networks, e.g. Hopfield networks · CPC title
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title
Supervised learning · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.