Multistage curriculum training framework for acoustic-to-word speech recognition

US11004443B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11004443-B2
Application numberUS-201816117373-A
CountryUS
Kind codeB2
Filing dateAug 30, 2018
Priority dateAug 30, 2018
Publication dateMay 11, 2021
Grant dateMay 11, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods and apparatuses are provided for performing acoustic to word (A2W) speech recognition training performed by at least one processor. The method includes initializing, by the at least one processor, one or more first layers of a neural network with phone based Connectionist Temporal Classification (CTC), initializing, by the at least one processor, one or more second layers of the neural network with grapheme based CTC, acquiring, by the at least one processor, training data and performing, by the at least one processor, A2W speech recognition training based the initialized one or more first layers and one or more second layers of the neural network using the training data.

First claim

Opening claim text (preview).

What is claimed is: 1. A method of performing acoustic to word (A2W) speech recognition training performed by at least one processor, the method comprising: initializing, by the at least one processor, one or more first layers of a neural network with phone based Connectionist Temporal Classification (CTC); initializing, by the at least one processor, one or more second layers of the neural network with grapheme based CTC; acquiring, by the at least one processor, training data; and performing, by the at least one processor, A2W speech recognition training based the initialized one or more first layers and one or more second layers of the neural network using the training data, wherein the one or more first layers with the phone based CTC is a lower layer, which is initialized prior to the one or more second layers with the graphene based CTC, which is an upper layer provided after the lower layer. 2. The method of claim 1 , wherein the one or more first layers of the neural network are lower layers of the neural network that are closer to an input of the training data. 3. The method of claim 2 , wherein the one or more second layers of the neural network are stacked above one or more first layers of the neural network. 4. The method of claim 1 , wherein the one or more first layers of the neural network comprises at least one of a Convolutional Neural Network (CNN) layer and a Bi-directional Long Short-Term Memory (BLSTM) layer. 5. The method of claim 1 , wherein the performing the A2W speech recognition training comprises: generating a first training model by performing a first training stage by predicting only a first set of target words; and generating a second training model by performing a second training stage by predicting a second set of target of words based on the first training model. 6. The method of claim 5 , wherein all utterances in the training data of words not belonging to the first set of target words are excluded from the first training stage. 7. The method of claim 1 , wherein the one or more second layers comprises a first linear projection layer, and wherein the performing the A2W speech recognition training comprises: projecting an output of the first linear projection layer with a second linear projection layer and a third linear projection layer. 8. The method of claim 7 , wherein output from the second linear layer is directly connected to a final output layer of a CE model to receive error signals from CE loss, and wherein both the output from the second linear layer and output from the third linear layer are concatenated to obtain a final output distribution for computing CTC loss. 9. An acoustic to word (A2W) speech recognition training apparatus comprising: at least one memory operable to store program code; and at least one processor operable to read said program code and operate as instructed by said program code, said program code comprising: a first initialization code configured to initialize one or more first layers of a neural network with phone based Connectionist Temporal Classification (CTC); a second initialization code configured to initialize one or more second layers of the neural network with grapheme based CTC; an acquiring code configured to acquire training data; and a training code configured to perform A2W speech recognition training based the initialized one or more first layers and one or more second layers of the neural network using the training data, wherein the one or more first layers with the phone based CTC is a lower layer, which is initialized prior to the one or more second layers with the graphene based CTC, which is an upper layer provided after the lower layer. 10. The A2W speech recognition training apparatus of claim 9 , wherein the one or more first layers of the neural network are lower layers of the neural network that are closer to an input of the training data. 11. The A2W speech recognition training apparatus of claim 10 , wherein the one or more second layers of the neural network are stacked above one or more first layers of the neural network. 12. The A2W speech recognition training apparatus of claim 9 , wherein the one or more first layers of the neural network comprises at least one of a Convolutional Neural Network (CNN) layer and a Bi-directional Long Short-Term Memory (BLSTM) layer. 13. The A2W speech recognition training apparatus of claim 9 , wherein the performing the A2W speech recognition training comprises: generating a first training model by performing a first training stage by predicting only a first set of target words; and generating a second training model by performing a second training stage by predicting a second set of target words based on the first training model. 14. The A2W speech recognition training apparatus of claim 13 , wherein all utterances in the training data of words not belonging to the first set of target words are excluded from the first training stage. 15. The A2W speech recognition training apparatus of claim 9 , wherein the one or more second layers comprises a first linear projection layer, and wherein the performing the A2W speech recognition training comprises: projecting an output of the first linear projection layer with a second linear projection layer and a third linear projection layer. 16. The A2W speech recognition training apparatus of claim 15 , wherein output from the second linear layer is directly connected to a final output layer of a CE model to receive error signals from CE loss, and wherein both the output from the second linear layer and output from the third linear layer are concatenated to obtain a final output distribution for computing CTC loss. 17. A non-transitory computer readable medium having stored thereon program code for performing an acoustic to word (A2W) speech recognition training, said program code comprising: a first initialization code configured to initialize one or more first layers of a neural network with phone based Connectionist Temporal Classification (CTC); a second initialization code configured to initialize one or more second layers of the neural network with grapheme based CTC; an acquiring code configured to acquire training data; and a training code configured to perform A2W speech recognition training based the initialized one or more first layers and one or more second layers of the neural network using the training data, wherein the one or more first layers with the phone based CTC is a lower layer, which is initialized prior to the one or more second layers with the graphene based CTC, which is an upper layer provided after the lower layer. 18. The non-transitory computer readable medium according to claim 17 , wherein the training code to perform A2W speech recognition training comprises: a first generating code configured to generate a first training model by performing a first training stage by predicting only a first set of target words; and a first generating code configured to generate a second training model by performing a second training stage by predicting a second set of target words based on the first training model.

Assignees

Inventors

Classifications

  • G10L15/063Primary

    Training · CPC title

  • Combinations of networks · CPC title

  • Recurrent networks, e.g. Hopfield networks · CPC title

  • characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title

  • Supervised learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11004443B2 cover?
Methods and apparatuses are provided for performing acoustic to word (A2W) speech recognition training performed by at least one processor. The method includes initializing, by the at least one processor, one or more first layers of a neural network with phone based Connectionist Temporal Classification (CTC), initializing, by the at least one processor, one or more second layers of the neural …
Who is the assignee on this patent?
Tencent America LLC
What technology area does this patent fall under?
Primary CPC classification G10L15/063. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue May 11 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).