Joint Many-Task Neural Network Model for Multiple Natural Language Processing (NLP) Tasks
US-2018121787-A1 · May 3, 2018 · US
US10573295B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10573295-B2 |
| Application number | US-201815878113-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jan 23, 2018 |
| Priority date | Oct 27, 2017 |
| Publication date | Feb 25, 2020 |
| Grant date | Feb 25, 2020 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
The disclosed technology teaches a deep end-to-end speech recognition model, including using multi-objective learning criteria to train a deep end-to-end speech recognition model on training data comprising speech samples temporally labeled with ground truth transcriptions. The multi-objective learning criteria updates model parameters of the model over one thousand to millions of backpropagation iterations by combining, at each iteration, a maximum likelihood objective function that modifies the model parameters to maximize a probability of outputting a correct transcription and a policy gradient function that modifies the model parameters to maximize a positive reward defined based on a non-differentiable performance metric which penalizes incorrect transcriptions in accordance with their conformity to corresponding ground truth transcriptions; and upon convergence after a final backpropagation iteration, persisting the modified model parameters learned by using the multi-objective learning criteria with the model to be applied to further end-to-end speech recognition.
Opening claim text (preview).
We claim as follows: 1. A computer-implemented method of training a deep end-to-end speech recognition model, the method including: using a multi-objective learning criteria to train a deep end-to-end speech recognition model on training data comprising speech samples temporally labeled with ground truth transcriptions, wherein the multi-objective learning criteria updates model parameters of the model over one thousand to millions of backpropagation iterations by combining, at each iteration, a maximum likelihood objective function that modifies the model parameters to maximize a probability of outputting a correct transcription and a policy gradient function that modified the model parameters to maximize a positive reward defined based on a non-differentiable performance metric which penalizes incorrect transcriptions in accordance with their conformity to corresponding ground truth transcriptions; and upon convergence after a final backpropagation iteration, persisting the modified model parameters learned by using the multi-objective learning criteria with the model to be applied to further end-to-end speech recognition. 2. The method of claim 1 , wherein, for each timestep, the model produces a normalized distribution of softmax probabilities over a set of transcription labels, including a blank label. 3. The method of claim 2 , wherein the maximum likelihood objective function is a connectionist temporal classification (abbreviated CTC) objective function that maximizes the probability of outputting the correct transcription by: combining individual probabilities of a plurality of candidate output transcriptions to produce an output transcription, wherein an individual probability of a candidate output transcription is determined by selecting a most probable label for each timestep and multiplying softmax probabilities of each of the selected labels; and measuring differences between the output transcription and a ground truth transcription. 4. The method of claim 2 , wherein the policy gradient function determines the reward for an output transcription by: independently sampling a transcription label for each timestep and concatenating the transcription labels sampled across the timesteps to produce the output transcription; and measuring differences between the output transcription and a ground truth transcription based on the performance metric. 5. The method of claim 1 , wherein the performance metric is word error rate (abbreviated WER). 6. The method of claim 5 , wherein the reward is determined based on a reward function that is defined as 1−WER. 7. The method of claim 1 , wherein the performance metric is character error rate (abbreviated CER). 8. The method of claim 1 , wherein the policy gradient function minimizes a negative reward defined based on the performance metric. 9. The method of claim 1 , wherein the policy gradient function is applied using self-critical sequence training (abbreviated SCST). 10. The method of claim 1 , wherein relative reliance on the maximum likelihood objective function and the policy gradient function shifts during training, with greater emphasis on the maximum likelihood objective function early in training than late in training. 11. A deep end-to-end speech recognition system, comprising: an input port that receives digital audio samples of a signal comprising speech; and a deep end-to-end speech recognition processor comprising hardware and a stack of layers running on the hardware including convolution layers and recurrent layers, coupled to the input port and configurable to process the digital audio samples, recognize speech from the audio samples, and output transcriptions corresponding to recognized speech; wherein the deep end-to-end speech recognition processor includes parameters trained using a multi-objective learning criteria on training data comprising speech samples temporally labeled with ground truth transcriptions; and wherein the multi-objective learning criteria update the processor parameters over one thousand to millions of backpropagation iterations by combining, at each iteration, a maximum likelihood objective function that modified the processor parameters to maximize a probability of outputting a correct transcription, and a policy gradient function that modified the processor parameters to maximize a positive reward defined based on a non-differentiable performance metric which penalizes incorrect transcriptions in accordance with their conformity to corresponding ground truth transcriptions. 12. The deep end-to-end speech recognition system of claim 11 , wherein, for each timestep, the processor produces a normalized distribution of softmax probabilities over a set of transcription labels, including a blank label. 13. The deep end-to-end speech recognition system of claim 12 , wherein the maximum likelihood objective function is a connectionist temporal classification (abbreviated CTC) objective function that maximizes the probability of outputting a correct transcription by: combining individual probabilities of a plurality of candidate output transcriptions to produce an output transcription, wherein an individual probability of a candidate output transcription is determined by selecting a most probable label for each timestep and multiplying softmax probabilities of each of the selected labels; and measuring differences between the output transcription and a ground truth transcription. 14. The deep end-to-end speech recognition system of claim 11 , wherein the policy gradient function determines the reward for an output transcription by: independently sampling a transcription label for each timestep and concatenating the transcription labels sampled across the timesteps to produce the output transcription; and measuring differences between the output transcription and a ground truth transcription based on the performance metric. 15. The deep end-to-end speech recognition system of claim 11 , wherein the performance metric is word error rate (abbreviated WER). 16. The deep end-to-end speech recognition system of claim 15 , wherein the reward is determined based on a reward function that is defined as 1−WER. 17. The deep end-to-end speech recognition system of claim 11 , wherein the performance metric is character error rate (abbreviated CER). 18. The deep end-to-end speech recognition system of claim 11 , wherein the policy gradient function minimizes a negative reward defined based on the performance metric. 19. The deep end-to-end speech recognition system of claim 11 , wherein the policy gradient function is applied using self-critical sequence training (abbreviated SCST). 20. The deep end-to-end speech recognition system of claim 11 , where relative reliance on the maximum likelihood objective function and the policy gradient function shifts during training, with greater emphasis on the maximum likelihood objective function early in training than late in training. 21. A tangible non-transitory computer readable storage medium impressed with computer program instructions executable by a processor, the instructions, when executed on a processor, implement a method including: using a multi-objective learning criteria to train a deep end-to-end speech recognition model on training data comprising speech samples temporally labeled with ground truth transcriptions, wherein the multi-objective learning criteria updates model parameters of the model over one thousand to millions of backpropagation iterations
for comparison or discrimination · CPC title
using statistical models, e.g. Hidden Markov Models [HMMs] (G10L15/18 takes precedence) · CPC title
Training · CPC title
Backpropagation, e.g. using gradient descent · CPC title
using artificial neural networks · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.