What technology area does this patent fall under?

Primary CPC classification H04L67/306. Mapped technology areas include Electricity.

When was this patent published?

Publication date Tue Jan 20 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Automatic speech recognition using language model-generated context

US12531056B1 · US · B1

Patent metadata
Field	Value
Publication number	US-12531056-B1
Application number	US-202318541315-A
Country	US
Kind code	B1
Filing date	Dec 15, 2023
Priority date	Dec 15, 2023
Publication date	Jan 20, 2026
Grant date	Jan 20, 2026

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Techniques for ASR processing using language model (LM)-generated context are described. A LM is prompted to generate words that are relevant for/may be included in a future user input. The prompt to the LM can include words from user interaction history, dialog history, dialog topic, user preferences, etc. The information included in the prompt may focus on rare or unique words rather than words that the ASR model is already confident in recognizing. The techniques can be plugged into an existing/pretrained ASR model and can be used with any existing/pretrained LM, thus saving resources needed to implement and maintain the components.

First claim

Opening claim text (preview).

What is claimed is: 1 . A computer-implemented method comprising: receiving first audio data representing a first spoken input associated with a dialog session; determining dialog data corresponding to the dialog session, the dialog data representing at least one previous spoken input associated with the dialog session; determining topic data corresponding to the dialog session; generating a first prompt including the dialog data and the topic data, the first prompt being a command to a large language model (LLM) to determine, based on the dialog data and the topic data, context data relevant for transcribing the first spoken input; processing, using the LLM, the first prompt to determine the context data to include a first plurality of words, the first plurality of words being relevant for transcribing the first spoken input; determining first embedding data corresponding to the context data; determining second embedding data corresponding to the first audio data; and processing, using an automatic speech recognition (ASR) model, the first embedding data and the second embedding data to determine first ASR data representing a first transcription of the first spoken input. 2 . The computer-implemented method of claim 1 , further comprising: determining a second plurality of words included in the dialog data; determining a user profile corresponding to the first audio data; receiving interaction history data representing past user inputs, the interaction history data being associated with the user profile; determining a third plurality of words included in the interaction history data; and determining a fourth plurality of words that are unique between the second plurality of words and the third plurality of words, wherein the first prompt includes the fourth plurality of words instead of the dialog data. 3 . The computer-implemented method of claim 1 , further comprising: determining user profile data associated with the first audio data, the user profile data representing personalized words; determining third embedding data corresponding to the user profile data; processing the first embedding data, the second embedding data, and the third embedding data using an attention component configured to apply attention to a first portion of the first audio data based on the context data or the user profile data; and determining the first ASR data based at least in part on the processing using the attention component. 4 . The computer-implemented method of claim 1 , further comprising: determining a user profile associated with the first audio data; receiving personalized data associated with the user profile; receiving a second plurality of words corresponding to a domain; and determining a third plurality of words unique among the dialog data, the personalized data and the first plurality of words, wherein the first prompt includes the second plurality of words, wherein the first plurality of words includes at least a first word of the third plurality of words. 5 . A computer-implemented method comprising: receiving first audio data representing a first spoken input; determining first data including a first plurality of words; generating a first prompt including the first data, the first prompt being a command to a language model (LM) to determine context data, based on the first data, relevant for transcribing the first audio data; processing, using the LM, the first prompt to determine second data including a second plurality of words; determining first embedding data corresponding to the second data; determining second embedding data corresponding to the first audio data; and processing the first embedding data and the second embedding data to determine a transcript of the first spoken input. 6 . The computer-implemented method of claim 5 , further comprising: receiving interaction data representing a plurality of past user inputs; determining a third plurality of words included in the interaction data, the third plurality of words being nouns; and determining the first data to include the third plurality of words, wherein the second data includes at least a first word of the third plurality of words. 7 . The computer-implemented method of claim 5 , further comprising: determining the first audio data is associated with a dialog session; determining dialog data associated with the dialog session, the dialog data representing at least one previous user input; and determining, using the dialog data, the first plurality of words that are unique. 8 . The computer-implemented method of claim 5 , wherein the first audio data is associated with a dialog session, and the method further comprises: determining dialog data associated with dialog session, the dialog data representing at least one previous user input; determining, using the dialog data, a topic corresponding to the dialog session; and determining the first data to include the topic, wherein the second plurality of words includes words related to the topic. 9 . The computer-implemented method of claim 5 , further comprising: processing the second data to determine a third plurality of words that are unique words of the second plurality of words, wherein determining the first embedding data comprises processing, using an encoder, the third plurality of words to determine the first embedding data. 10 . The computer-implemented method of claim 5 , further comprising: processing the first embedding data and the second embedding data using an attention component configured to apply attention to a portion of the first audio data based on at least one of the second plurality of words; and determining the transcript based on the processing using the attention component. 11 . The computer-implemented method of claim 5 , further comprising: determining user profile data representing personalized words, the user profile data associated with the first audio data; determining third embedding data corresponding to the user profile data; and processing the first embedding data, the second embedding data, and the third embedding data to determine the transcript. 12 . The computer-implemented method of claim 5 , further comprising: receiving second audio data representing a second spoken input; determining third data including a third plurality of words; generating a second prompt including the third data, the second prompt being a command to the LM to determine context data, based on the third data, relevant for transcribing the second audio data; processing, using the LM, the second prompt to determine third embedding data corresponding to a third plurality of words, the third embedding data being generated by an intermediate layer of the LM; determining fourth embedding data corresponding to the second audio data; and processing the third embedding data and the fourth embedding data to determine a transcript of the second spoken input. 13 . A system comprising: at least one processor; and at least one memory including instructions that, when executed by the at least one processor, cause the system to: receive first audio data representing a first spoken input; determine first data including a first plurality of words; generate a first prompt including the first data, the first prompt being a command to a language model (LM) to determine context data, based on the first data, relevant for transcribing the first audio data; process, using the LM, the first prompt to determine second data including a second plurality of words; determine first embedding data corresponding to the second data; deter

Assignees

Amazon Tech Inc

Inventors

Classifications

G10L15/063
Training · CPC title
G10L15/22
Procedures used during a speech recognition process, e.g. man-machine dialogue · CPC title
G10L2015/223
Execution procedure of a spoken command · CPC title
H04L67/306Primary
User profiles · CPC title
G10L15/16Primary
using artificial neural networks · CPC title

Patent family

Related publications grouped by family.

View patent family 98434112

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12531056B1 cover?: Techniques for ASR processing using language model (LM)-generated context are described. A LM is prompted to generate words that are relevant for/may be included in a future user input. The prompt to the LM can include words from user interaction history, dialog history, dialog topic, user preferences, etc. The information included in the prompt may focus on rare or unique words rather than wor…
Who is the assignee on this patent?: Amazon Tech Inc
What technology area does this patent fall under?: Primary CPC classification H04L67/306. Mapped technology areas include Electricity.
When was this patent published?: Publication date Tue Jan 20 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).