What technology area does this patent fall under?

Primary CPC classification G10L15/063. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Feb 25 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).

End-to-end speech recognition with policy learning

US10573295B2 · US · B2

Patent metadata
Field	Value
Publication number	US-10573295-B2
Application number	US-201815878113-A
Country	US
Kind code	B2
Filing date	Jan 23, 2018
Priority date	Oct 27, 2017
Publication date	Feb 25, 2020
Grant date	Feb 25, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The disclosed technology teaches a deep end-to-end speech recognition model, including using multi-objective learning criteria to train a deep end-to-end speech recognition model on training data comprising speech samples temporally labeled with ground truth transcriptions. The multi-objective learning criteria updates model parameters of the model over one thousand to millions of backpropagation iterations by combining, at each iteration, a maximum likelihood objective function that modifies the model parameters to maximize a probability of outputting a correct transcription and a policy gradient function that modifies the model parameters to maximize a positive reward defined based on a non-differentiable performance metric which penalizes incorrect transcriptions in accordance with their conformity to corresponding ground truth transcriptions; and upon convergence after a final backpropagation iteration, persisting the modified model parameters learned by using the multi-objective learning criteria with the model to be applied to further end-to-end speech recognition.

First claim

Opening claim text (preview).

We claim as follows: 1. A computer-implemented method of training a deep end-to-end speech recognition model, the method including: using a multi-objective learning criteria to train a deep end-to-end speech recognition model on training data comprising speech samples temporally labeled with ground truth transcriptions, wherein the multi-objective learning criteria updates model parameters of the model over one thousand to millions of backpropagation iterations by combining, at each iteration, a maximum likelihood objective function that modifies the model parameters to maximize a probability of outputting a correct transcription and a policy gradient function that modified the model parameters to maximize a positive reward defined based on a non-differentiable performance metric which penalizes incorrect transcriptions in accordance with their conformity to corresponding ground truth transcriptions; and upon convergence after a final backpropagation iteration, persisting the modified model parameters learned by using the multi-objective learning criteria with the model to be applied to further end-to-end speech recognition. 2. The method of claim 1 , wherein, for each timestep, the model produces a normalized distribution of softmax probabilities over a set of transcription labels, including a blank label. 3. The method of claim 2 , wherein the maximum likelihood objective function is a connectionist temporal classification (abbreviated CTC) objective function that maximizes the probability of outputting the correct transcription by: combining individual probabilities of a plurality of candidate output transcriptions to produce an output transcription, wherein an individual probability of a candidate output transcription is determined by selecting a most probable label for each timestep and multiplying softmax probabilities of each of the selected labels; and measuring differences between the output transcription and a ground truth transcription. 4. The method of claim 2 , wherein the policy gradient function determines the reward for an output transcription by: independently sampling a transcription label for each timestep and concatenating the transcription labels sampled across the timesteps to produce the output transcription; and measuring differences between the output transcription and a ground truth transcription based on the performance metric. 5. The method of claim 1 , wherein the performance metric is word error rate (abbreviated WER). 6. The method of claim 5 , wherein the reward is determined based on a reward function that is defined as 1−WER. 7. The method of claim 1 , wherein the performance metric is character error rate (abbreviated CER). 8. The method of claim 1 , wherein the policy gradient function minimizes a negative reward defined based on the performance metric. 9. The method of claim 1 , wherein the policy gradient function is applied using self-critical sequence training (abbreviated SCST). 10. The method of claim 1 , wherein relative reliance on the maximum likelihood objective function and the policy gradient function shifts during training, with greater emphasis on the maximum likelihood objective function early in training than late in training. 11. A deep end-to-end speech recognition system, comprising: an input port that receives digital audio samples of a signal comprising speech; and a deep end-to-end speech recognition processor comprising hardware and a stack of layers running on the hardware including convolution layers and recurrent layers, coupled to the input port and configurable to process the digital audio samples, recognize speech from the audio samples, and output transcriptions corresponding to recognized speech; wherein the deep end-to-end speech recognition processor includes parameters trained using a multi-objective learning criteria on training data comprising speech samples temporally labeled with ground truth transcriptions; and wherein the multi-objective learning criteria update the processor parameters over one thousand to millions of backpropagation iterations by combining, at each iteration, a maximum likelihood objective function that modified the processor parameters to maximize a probability of outputting a correct transcription, and a policy gradient function that modified the processor parameters to maximize a positive reward defined based on a non-differentiable performance metric which penalizes incorrect transcriptions in accordance with their conformity to corresponding ground truth transcriptions. 12. The deep end-to-end speech recognition system of claim 11 , wherein, for each timestep, the processor produces a normalized distribution of softmax probabilities over a set of transcription labels, including a blank label. 13. The deep end-to-end speech recognition system of claim 12 , wherein the maximum likelihood objective function is a connectionist temporal classification (abbreviated CTC) objective function that maximizes the probability of outputting a correct transcription by: combining individual probabilities of a plurality of candidate output transcriptions to produce an output transcription, wherein an individual probability of a candidate output transcription is determined by selecting a most probable label for each timestep and multiplying softmax probabilities of each of the selected labels; and measuring differences between the output transcription and a ground truth transcription. 14. The deep end-to-end speech recognition system of claim 11 , wherein the policy gradient function determines the reward for an output transcription by: independently sampling a transcription label for each timestep and concatenating the transcription labels sampled across the timesteps to produce the output transcription; and measuring differences between the output transcription and a ground truth transcription based on the performance metric. 15. The deep end-to-end speech recognition system of claim 11 , wherein the performance metric is word error rate (abbreviated WER). 16. The deep end-to-end speech recognition system of claim 15 , wherein the reward is determined based on a reward function that is defined as 1−WER. 17. The deep end-to-end speech recognition system of claim 11 , wherein the performance metric is character error rate (abbreviated CER). 18. The deep end-to-end speech recognition system of claim 11 , wherein the policy gradient function minimizes a negative reward defined based on the performance metric. 19. The deep end-to-end speech recognition system of claim 11 , wherein the policy gradient function is applied using self-critical sequence training (abbreviated SCST). 20. The deep end-to-end speech recognition system of claim 11 , where relative reliance on the maximum likelihood objective function and the policy gradient function shifts during training, with greater emphasis on the maximum likelihood objective function early in training than late in training. 21. A tangible non-transitory computer readable storage medium impressed with computer program instructions executable by a processor, the instructions, when executed on a processor, implement a method including: using a multi-objective learning criteria to train a deep end-to-end speech recognition model on training data comprising speech samples temporally labeled with ground truth transcriptions, wherein the multi-objective learning criteria updates model parameters of the model over one thousand to millions of backpropagation iterations

Assignees

Salesforce Com Inc

Inventors

Classifications

G10L25/51
for comparison or discrimination · CPC title
G10L15/14
using statistical models, e.g. Hidden Markov Models [HMMs] (G10L15/18 takes precedence) · CPC title
G10L15/063Primary
Training · CPC title
G06N3/084
Backpropagation, e.g. using gradient descent · CPC title
G10L15/16
using artificial neural networks · CPC title

Patent family

Related publications grouped by family.

View patent family 66245654

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10573295B2 cover?: The disclosed technology teaches a deep end-to-end speech recognition model, including using multi-objective learning criteria to train a deep end-to-end speech recognition model on training data comprising speech samples temporally labeled with ground truth transcriptions. The multi-objective learning criteria updates model parameters of the model over one thousand to millions of backpropagati…
Who is the assignee on this patent?: Salesforce Com Inc
What technology area does this patent fall under?: Primary CPC classification G10L15/063. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Feb 25 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).