Method for clustering photos for pictoral storytelling
US-2024419384-A1 · Dec 19, 2024 · US
US12531056B1 · US · B1
| Field | Value |
|---|---|
| Publication number | US-12531056-B1 |
| Application number | US-202318541315-A |
| Country | US |
| Kind code | B1 |
| Filing date | Dec 15, 2023 |
| Priority date | Dec 15, 2023 |
| Publication date | Jan 20, 2026 |
| Grant date | Jan 20, 2026 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Techniques for ASR processing using language model (LM)-generated context are described. A LM is prompted to generate words that are relevant for/may be included in a future user input. The prompt to the LM can include words from user interaction history, dialog history, dialog topic, user preferences, etc. The information included in the prompt may focus on rare or unique words rather than words that the ASR model is already confident in recognizing. The techniques can be plugged into an existing/pretrained ASR model and can be used with any existing/pretrained LM, thus saving resources needed to implement and maintain the components.
Opening claim text (preview).
What is claimed is: 1 . A computer-implemented method comprising: receiving first audio data representing a first spoken input associated with a dialog session; determining dialog data corresponding to the dialog session, the dialog data representing at least one previous spoken input associated with the dialog session; determining topic data corresponding to the dialog session; generating a first prompt including the dialog data and the topic data, the first prompt being a command to a large language model (LLM) to determine, based on the dialog data and the topic data, context data relevant for transcribing the first spoken input; processing, using the LLM, the first prompt to determine the context data to include a first plurality of words, the first plurality of words being relevant for transcribing the first spoken input; determining first embedding data corresponding to the context data; determining second embedding data corresponding to the first audio data; and processing, using an automatic speech recognition (ASR) model, the first embedding data and the second embedding data to determine first ASR data representing a first transcription of the first spoken input. 2 . The computer-implemented method of claim 1 , further comprising: determining a second plurality of words included in the dialog data; determining a user profile corresponding to the first audio data; receiving interaction history data representing past user inputs, the interaction history data being associated with the user profile; determining a third plurality of words included in the interaction history data; and determining a fourth plurality of words that are unique between the second plurality of words and the third plurality of words, wherein the first prompt includes the fourth plurality of words instead of the dialog data. 3 . The computer-implemented method of claim 1 , further comprising: determining user profile data associated with the first audio data, the user profile data representing personalized words; determining third embedding data corresponding to the user profile data; processing the first embedding data, the second embedding data, and the third embedding data using an attention component configured to apply attention to a first portion of the first audio data based on the context data or the user profile data; and determining the first ASR data based at least in part on the processing using the attention component. 4 . The computer-implemented method of claim 1 , further comprising: determining a user profile associated with the first audio data; receiving personalized data associated with the user profile; receiving a second plurality of words corresponding to a domain; and determining a third plurality of words unique among the dialog data, the personalized data and the first plurality of words, wherein the first prompt includes the second plurality of words, wherein the first plurality of words includes at least a first word of the third plurality of words. 5 . A computer-implemented method comprising: receiving first audio data representing a first spoken input; determining first data including a first plurality of words; generating a first prompt including the first data, the first prompt being a command to a language model (LM) to determine context data, based on the first data, relevant for transcribing the first audio data; processing, using the LM, the first prompt to determine second data including a second plurality of words; determining first embedding data corresponding to the second data; determining second embedding data corresponding to the first audio data; and processing the first embedding data and the second embedding data to determine a transcript of the first spoken input. 6 . The computer-implemented method of claim 5 , further comprising: receiving interaction data representing a plurality of past user inputs; determining a third plurality of words included in the interaction data, the third plurality of words being nouns; and determining the first data to include the third plurality of words, wherein the second data includes at least a first word of the third plurality of words. 7 . The computer-implemented method of claim 5 , further comprising: determining the first audio data is associated with a dialog session; determining dialog data associated with the dialog session, the dialog data representing at least one previous user input; and determining, using the dialog data, the first plurality of words that are unique. 8 . The computer-implemented method of claim 5 , wherein the first audio data is associated with a dialog session, and the method further comprises: determining dialog data associated with dialog session, the dialog data representing at least one previous user input; determining, using the dialog data, a topic corresponding to the dialog session; and determining the first data to include the topic, wherein the second plurality of words includes words related to the topic. 9 . The computer-implemented method of claim 5 , further comprising: processing the second data to determine a third plurality of words that are unique words of the second plurality of words, wherein determining the first embedding data comprises processing, using an encoder, the third plurality of words to determine the first embedding data. 10 . The computer-implemented method of claim 5 , further comprising: processing the first embedding data and the second embedding data using an attention component configured to apply attention to a portion of the first audio data based on at least one of the second plurality of words; and determining the transcript based on the processing using the attention component. 11 . The computer-implemented method of claim 5 , further comprising: determining user profile data representing personalized words, the user profile data associated with the first audio data; determining third embedding data corresponding to the user profile data; and processing the first embedding data, the second embedding data, and the third embedding data to determine the transcript. 12 . The computer-implemented method of claim 5 , further comprising: receiving second audio data representing a second spoken input; determining third data including a third plurality of words; generating a second prompt including the third data, the second prompt being a command to the LM to determine context data, based on the third data, relevant for transcribing the second audio data; processing, using the LM, the second prompt to determine third embedding data corresponding to a third plurality of words, the third embedding data being generated by an intermediate layer of the LM; determining fourth embedding data corresponding to the second audio data; and processing the third embedding data and the fourth embedding data to determine a transcript of the second spoken input. 13 . A system comprising: at least one processor; and at least one memory including instructions that, when executed by the at least one processor, cause the system to: receive first audio data representing a first spoken input; determine first data including a first plurality of words; generate a first prompt including the first data, the first prompt being a command to a language model (LM) to determine context data, based on the first data, relevant for transcribing the first audio data; process, using the LM, the first prompt to determine second data including a second plurality of words; determine first embedding data corresponding to the second data; deter
Training · CPC title
Procedures used during a speech recognition process, e.g. man-machine dialogue · CPC title
Execution procedure of a spoken command · CPC title
User profiles · CPC title
using artificial neural networks · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.