Internal language model for E2E models

US11527238B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11527238-B2
Application numberUS-202117154956-A
CountryUS
Kind codeB2
Filing dateJan 21, 2021
Priority dateOct 30, 2020
Publication dateDec 13, 2022
Grant dateDec 13, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A computer device is provided that includes one or more processors configured to receive an end-to-end (E2E) model that has been trained for automatic speech recognition with training data from a source-domain, and receive an external language model that has been trained with training data from a target-domain. The one or more processors are configured to perform an inference of the probability of an output token sequence given a sequence of input speech features. Performing the inference includes computing an E2E model score, computing an external language model score, and computing an estimated internal language model score for the E2E model. The estimated internal language model score is computed by removing a contribution of an intrinsic acoustic model. The processor is further configured to compute an integrated score based at least on E2E model score, the external language model score, and the estimated internal language model score.

First claim

Opening claim text (preview).

The invention claimed is: 1. A computer device comprising: one or more processors configured to: receive an end-to-end (E2E) model that has been trained for automatic speech recognition with training data from a source domain; receive an external language model that has been trained with training data from a target domain; perform an inference of the probability of an output token sequence of tokenized text represented by one or more embedding vectors, given a sequence of input speech features by: computing an E2E model score for one or more candidate output token sequences based on the sequence of input speech features using the E2E model; computing an external language model score for the one or more candidate output token sequences using the external language model; computing an estimated internal language model score for the one or more candidate output token sequences for the E2E model, wherein the E2E model encodes an intrinsic language model and an intrinsic acoustic model, and wherein the estimated internal language model score is computed by removing a contribution of the intrinsic acoustic model; and computing an integrated score for the one or more candidate output token sequences based at least on the E2E model score, the external language model score, and the estimated internal language model score. 2. The computer device of claim 1 , wherein the E2E model has been trained to minimize a standard E2E model loss. 3. The computer device of claim 1 , wherein the E2E model has been trained to minimize a weighted combination of an internal language model loss and a standard E2E model loss. 4. The computer device of claim 3 , wherein the internal language model loss is determined based on summing negative log probabilities of the intrinsic language model over a training corpus. 5. The computer device of claim 1 , wherein the integrated score for one or more candidate output token sequence is computed by subtracting the estimated internal language model score from a log-linear combination of the E2E model score and the external language model score. 6. The computer device of claim 1 , wherein the one or more processors are further configured to: receive a speech input associated with the target domain via an input device; and evaluate a set of input data from the target domain using the trained E2E model implementing language model integration with the trained external language model for the target domain. 7. The computer device of claim 1 , wherein the E2E model is a recurrent neural network transducer (RNN-T) model, wherein the RNN-T model includes an encoder, a prediction network, and a joint network that combines an output of the encoder and an output of the prediction network via a feed-forward network, and wherein the estimated internal language model score is computed by removing a contribution of the encoder of the RNN-T model to the feed-forward network. 8. The computer device of claim 1 , wherein the E2E model is an attention-based encoder-decoder (AED) model, wherein the AED model includes an encoder that maps sequence of input speech features into a sequence of hidden representations, an attention network that generates attention weights for encoded features in the sequence of hidden representations, a context vector that is computed as a linear combination of the sequence of hidden representations weighted by the attention weights, and a decoder that takes the context vector and a token sequence as input, and wherein the estimated internal language model score is computed by removing a contribution of the encoder to the decoder of the AED model. 9. The computer device of claim 1 , wherein the E2E model is trained with training data that includes audio-transcript pairs. 10. The computer device of claim 1 , wherein the external language model is trained with training data that includes text data. 11. The computer device of claim 1 , wherein the integrated score is estimated at each step of a beam search inference algorithm. 12. A method comprising: at one or more processors of a computer device: receiving an end-to-end (E2E) model that has been trained for automatic speech recognition with training data from a source-domain; receiving an external language model that has been trained with training data from a target-domain; performing an inference of the probability of an output token sequence of tokenized text represented by one or more embedding vectors, given a sequence of input speech features by: computing an E2E model score for one or more candidate output token sequences based on the sequence of input speech features using the E2E model; computing an external language model score for the one or more candidate output token sequences using the external language model; computing an estimated internal language model score for the one or more candidate output token sequences for the E2E model, wherein the E2E model encodes an intrinsic language model and an intrinsic acoustic model, and wherein the estimated internal language model score is computed by removing a contribution of the intrinsic acoustic model; and computing an integrated score for the one or more candidate output token sequences based at least on the E2E model score, the external language model score, and the estimated internal language model score. 13. The method of claim 12 , wherein the E2E model has been trained to minimize a standard E2E model loss. 14. The method of claim 12 , wherein the E2E model has been trained to minimize a weighted combination of an internal language model loss and a standard E2E model loss. 15. The method of claim 14 , wherein the internal language model loss is determined based on summing negative log probabilities of the intrinsic language model over a training corpus. 16. The method of claim 12 , wherein the integrated score for one or more candidate output token sequences is computed by subtracting the estimated internal language model score from a log-linear combination of the E2E model score and the external language model score. 17. The method of claim 12 , wherein the E2E model is a recurrent neural network transducer (RNN-T) model that includes an encoder, a prediction network, and a joint network that combines an output of the encoder and an output of the prediction network via a feed-forward network, and wherein the method includes computing the estimated internal language model score by removing a contribution of the encoder of the RNN-T model to the feed-forward network. 18. The method of claim 12 , wherein the E2E model is an attention-based encoder-decoder (AED) model that includes an encoder that maps sequence of input speech features into a sequence of hidden representations, an attention network that generates attention weights for encoded features in the sequence of hidden representations, a context vector that is computed as a linear combination of the sequence of hidden representations weighted by the attention weights, and a decoder that takes the context vector and a token sequence as input, and wherein the method includes computing the estimated internal language model score by removing a contribution of the encoder to the decoder of the AED model. 19. The method of claim 12 , wherein the integrated score is estimated at each step of a beam search inference algorithm. 20. A server system comprising: one or more processors configured to: receive an end-to-end (E2E) model that has been trained for automatic speech recognition with training data from a s

Assignees

Inventors

Classifications

  • Combinations of networks · CPC title

  • Recurrent networks, e.g. Hopfield networks · CPC title

  • Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs · CPC title

  • using context dependencies, e.g. language models · CPC title

  • Training · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11527238B2 cover?
A computer device is provided that includes one or more processors configured to receive an end-to-end (E2E) model that has been trained for automatic speech recognition with training data from a source-domain, and receive an external language model that has been trained with training data from a target-domain. The one or more processors are configured to perform an inference of the probability…
Who is the assignee on this patent?
Microsoft Technology Licensing Llc
What technology area does this patent fall under?
Primary CPC classification G10L15/16. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Dec 13 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).