Who is the assignee on this patent?

Microsoft Technology Licensing Llc

What technology area does this patent fall under?

Primary CPC classification G10L15/16. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Dec 13 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Internal language model for E2E models

US11527238B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11527238-B2
Application number	US-202117154956-A
Country	US
Kind code	B2
Filing date	Jan 21, 2021
Priority date	Oct 30, 2020
Publication date	Dec 13, 2022
Grant date	Dec 13, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A computer device is provided that includes one or more processors configured to receive an end-to-end (E2E) model that has been trained for automatic speech recognition with training data from a source-domain, and receive an external language model that has been trained with training data from a target-domain. The one or more processors are configured to perform an inference of the probability of an output token sequence given a sequence of input speech features. Performing the inference includes computing an E2E model score, computing an external language model score, and computing an estimated internal language model score for the E2E model. The estimated internal language model score is computed by removing a contribution of an intrinsic acoustic model. The processor is further configured to compute an integrated score based at least on E2E model score, the external language model score, and the estimated internal language model score.

First claim

Opening claim text (preview).

The invention claimed is: 1. A computer device comprising: one or more processors configured to: receive an end-to-end (E2E) model that has been trained for automatic speech recognition with training data from a source domain; receive an external language model that has been trained with training data from a target domain; perform an inference of the probability of an output token sequence of tokenized text represented by one or more embedding vectors, given a sequence of input speech features by: computing an E2E model score for one or more candidate output token sequences based on the sequence of input speech features using the E2E model; computing an external language model score for the one or more candidate output token sequences using the external language model; computing an estimated internal language model score for the one or more candidate output token sequences for the E2E model, wherein the E2E model encodes an intrinsic language model and an intrinsic acoustic model, and wherein the estimated internal language model score is computed by removing a contribution of the intrinsic acoustic model; and computing an integrated score for the one or more candidate output token sequences based at least on the E2E model score, the external language model score, and the estimated internal language model score. 2. The computer device of claim 1 , wherein the E2E model has been trained to minimize a standard E2E model loss. 3. The computer device of claim 1 , wherein the E2E model has been trained to minimize a weighted combination of an internal language model loss and a standard E2E model loss. 4. The computer device of claim 3 , wherein the internal language model loss is determined based on summing negative log probabilities of the intrinsic language model over a training corpus. 5. The computer device of claim 1 , wherein the integrated score for one or more candidate output token sequence is computed by subtracting the estimated internal language model score from a log-linear combination of the E2E model score and the external language model score. 6. The computer device of claim 1 , wherein the one or more processors are further configured to: receive a speech input associated with the target domain via an input device; and evaluate a set of input data from the target domain using the trained E2E model implementing language model integration with the trained external language model for the target domain. 7. The computer device of claim 1 , wherein the E2E model is a recurrent neural network transducer (RNN-T) model, wherein the RNN-T model includes an encoder, a prediction network, and a joint network that combines an output of the encoder and an output of the prediction network via a feed-forward network, and wherein the estimated internal language model score is computed by removing a contribution of the encoder of the RNN-T model to the feed-forward network. 8. The computer device of claim 1 , wherein the E2E model is an attention-based encoder-decoder (AED) model, wherein the AED model includes an encoder that maps sequence of input speech features into a sequence of hidden representations, an attention network that generates attention weights for encoded features in the sequence of hidden representations, a context vector that is computed as a linear combination of the sequence of hidden representations weighted by the attention weights, and a decoder that takes the context vector and a token sequence as input, and wherein the estimated internal language model score is computed by removing a contribution of the encoder to the decoder of the AED model. 9. The computer device of claim 1 , wherein the E2E model is trained with training data that includes audio-transcript pairs. 10. The computer device of claim 1 , wherein the external language model is trained with training data that includes text data. 11. The computer device of claim 1 , wherein the integrated score is estimated at each step of a beam search inference algorithm. 12. A method comprising: at one or more processors of a computer device: receiving an end-to-end (E2E) model that has been trained for automatic speech recognition with training data from a source-domain; receiving an external language model that has been trained with training data from a target-domain; performing an inference of the probability of an output token sequence of tokenized text represented by one or more embedding vectors, given a sequence of input speech features by: computing an E2E model score for one or more candidate output token sequences based on the sequence of input speech features using the E2E model; computing an external language model score for the one or more candidate output token sequences using the external language model; computing an estimated internal language model score for the one or more candidate output token sequences for the E2E model, wherein the E2E model encodes an intrinsic language model and an intrinsic acoustic model, and wherein the estimated internal language model score is computed by removing a contribution of the intrinsic acoustic model; and computing an integrated score for the one or more candidate output token sequences based at least on the E2E model score, the external language model score, and the estimated internal language model score. 13. The method of claim 12 , wherein the E2E model has been trained to minimize a standard E2E model loss. 14. The method of claim 12 , wherein the E2E model has been trained to minimize a weighted combination of an internal language model loss and a standard E2E model loss. 15. The method of claim 14 , wherein the internal language model loss is determined based on summing negative log probabilities of the intrinsic language model over a training corpus. 16. The method of claim 12 , wherein the integrated score for one or more candidate output token sequences is computed by subtracting the estimated internal language model score from a log-linear combination of the E2E model score and the external language model score. 17. The method of claim 12 , wherein the E2E model is a recurrent neural network transducer (RNN-T) model that includes an encoder, a prediction network, and a joint network that combines an output of the encoder and an output of the prediction network via a feed-forward network, and wherein the method includes computing the estimated internal language model score by removing a contribution of the encoder of the RNN-T model to the feed-forward network. 18. The method of claim 12 , wherein the E2E model is an attention-based encoder-decoder (AED) model that includes an encoder that maps sequence of input speech features into a sequence of hidden representations, an attention network that generates attention weights for encoded features in the sequence of hidden representations, a context vector that is computed as a linear combination of the sequence of hidden representations weighted by the attention weights, and a decoder that takes the context vector and a token sequence as input, and wherein the method includes computing the estimated internal language model score by removing a contribution of the encoder to the decoder of the AED model. 19. The method of claim 12 , wherein the integrated score is estimated at each step of a beam search inference algorithm. 20. A server system comprising: one or more processors configured to: receive an end-to-end (E2E) model that has been trained for automatic speech recognition with training data from a s

Assignees

Microsoft Technology Licensing Llc

Inventors

Classifications

G06N3/045
Combinations of networks · CPC title
G06N3/044
Recurrent networks, e.g. Hopfield networks · CPC title
G06N3/049
Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs · CPC title
G10L15/183
using context dependencies, e.g. language models · CPC title
G10L15/063
Training · CPC title

Patent family

Related publications grouped by family.

View patent family 81380367

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11527238B2 cover?: A computer device is provided that includes one or more processors configured to receive an end-to-end (E2E) model that has been trained for automatic speech recognition with training data from a source-domain, and receive an external language model that has been trained with training data from a target-domain. The one or more processors are configured to perform an inference of the probability…
Who is the assignee on this patent?: Microsoft Technology Licensing Llc
What technology area does this patent fall under?: Primary CPC classification G10L15/16. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Dec 13 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

System and method for a multi-primary wide gamut color system

Leveraging unpaired text data for training end-to-end spoken language understanding systems

Method and system for training neural sequence-to-sequence models by incorporating global features

End-to-end speech recognition with policy learning

Training sequence generation neural networks using quality scores

Frequently asked questions