Minimum word error rate training for attention-based sequence-to-sequence models
US-2020043483-A1 · Feb 6, 2020 · US
US11636848B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11636848-B2 |
| Application number | US-202117316856-A |
| Country | US |
| Kind code | B2 |
| Filing date | May 11, 2021 |
| Priority date | Feb 14, 2019 |
| Publication date | Apr 25, 2023 |
| Grant date | Apr 25, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method of attention-based end-to-end (A-E2E) automatic speech recognition (ASR) training, includes performing cross-entropy training of a model, based on one or more input features of a speech signal, determining a posterior probability vector at a time of a first wrong token among one or more output tokens of the model of which the cross-entropy training is performed, and determining a loss of the first wrong token at the time, based on the determined posterior probability vector. The method further includes determining a total loss of a training set of the model of which the cross-entropy training is performed, based on the determined loss of the first wrong token, and updating the model of which the cross-entropy training is performed, based on the determined total loss of the training set.
Opening claim text (preview).
What is claimed is: 1. A method of attention-based end-to-end (A-E2E) automatic speech recognition (ASR) training, the method comprising: performing cross-entropy training of a model, based on one or more input features of a speech signal; determining a posterior probability vector at a time of a first wrong token among one or more output tokens of the model of which the cross-entropy training is performed; determining a loss of the first wrong token at the time, based on the determined posterior probability vector; determining a total loss of a training set of the model of which the cross-entropy training is performed, based on L ( θ ) TWT = ∑ ( y , r ) ∈ ( Y , R ) l θ ( y t ω , r t ω ) , where L(θ) denotes the total loss of the training set, (Y,R) denotes hypothesis-reference pairs in the training set, t ω denotes the time, y t ω denotes the first wrong token at the time, r t ω denotes a reference token at the time, and l θ (y t ω , r t ω ) denotes the loss of the first wrong token; and updating the model of which the cross-entropy training is performed, based on the determined total loss of the training set. 2. The method of claim 1 , wherein the posterior probability vector at the time is determined as follows: p t =Decoder( s t−1 ∈{r t−1 ,y t−1 },H enc ), where t denotes the time, p t denotes the posterior probability vector at the time t, H enc denotes the one or more features that are encoded, y t−1 denotes an output token at a previous time t−1, r t−1 denotes a reference token at the previous time t−1, and s t−1 denotes a token randomly selected from {r t−1 ,y t−1 }. 3. The method of claim 1 , wherein the loss of the first wrong token is determined as follows: l θ ( y t ω ,r t ω )=−log p t ω ,r t ω , where p t ω ,r t ω denotes a posterior probability of the reference token at the time. 4. The method of claim 1 , wherein the loss of the first wrong token is determined as follows: l θ ( y t ω ,r t ω )=−log p t ω ,r t ω +log p t ω ,y t ω , where p t ω ,r t ω denotes a posterior probability of the reference token at the time, and p t ω ,y t ω denotes a posterior probability of the first wrong token at the time. 5. The method of claim 1 , further comprising selecting a hypothesis with a longest correct prefix, from a plurality of hypotheses of the model of which the cross-entropy training is performed, wherein the determining posterior probability vector at the time comprises determining the posterior probability vector at the time of the first wrong token included in the selected hypothesis. 6. The method of claim 5 , wherein the total loss of the training set is determined as follows: L ( θ ) TWTiB = ∑ ( y , r ) ∈ ( Y , R ) l θ ( y t jl , ω jl , r t jl , ω ) , where L(θ) denotes the total loss of the training set, (Y,R) denotes hypothesis-reference pairs in the training set, t jl,ω denotes the time, y t jlω jl denotes the first wrong token at the time, r t jl,ω denotes a reference token at the time, and l θ ( y t jl , ω jl , r t jl , ω ) denotes the loss of the first wrong token. 7. An apparatus for attention-based end-to-end (A-E2E) automatic speech recognition (ASR) training, the apparatus comprising: at least one memory configured to store program code; and at least one processor configured to read the program code and operate as instructed by the program cod
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title
Auto-encoder networks; Encoder-decoder networks · CPC title
Supervised learning · CPC title
Recurrent networks, e.g. Hopfield networks · CPC title
updating or merging of old and new templates; Mean values; Weighting · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.