Method and system for optimizing use of retrieval augmented generation pipelines in generative artificial intelligence applications

US12405977B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-12405977-B1
Application numberUS-202418812707-A
CountryUS
Kind codeB1
Filing dateAug 22, 2024
Priority dateSep 20, 2023
Publication dateSep 2, 2025
Grant dateSep 2, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A system and method of improving performance of LLMs including receiving context files, generating refined context files from the context files, sending the refined context files to h-LLMs, receiving a user prompt, generating a plurality of derived prompts from the user prompt, transmitting the plurality of derived prompts to the h-LLMs, receiving a plurality of h-LLM results, processing the plurality of h-LLM results to generate a responsive result, and transmitting the responsive result to a user interface.

First claim

Opening claim text (preview).

What is claimed is: 1. A method of improving performance of large language models (LLMs) comprising: receiving one or more context files via an application programming interface (API) at an input broker from a user interface; generating one or more refined context files from the one or more context files using one or more refining LLMs; sending the one or more refined context files to one or more h-LLMs via a cloud service API, the one or more h-LLMs being hosted in a cloud container environment; receiving a user prompt via the API at the input broker from the user interface; generating a plurality of derived prompts from the user prompt at the input broker; transmitting the plurality of derived prompts to the one or more h-LLMs via the cloud service API; receiving a plurality of h-LLM results at an output broker, the plurality of h-LLM results being generated responsive to both the one or more refined context files and the plurality of derived prompts; processing the plurality of h-LLM results at the output broker to generate a responsive result; and transmitting the responsive result to the user interface via the API. 2. The method of claim 1 wherein generating the one or more refined context files comprises: splitting content of the one or more context files into a plurality of token blocks; reordering the plurality of token blocks; grouping token block subsets of the plurality of token blocks into a plurality of token block batches; and iteratively processing the plurality of token block batches to generate one or more refined context files. 3. The method of claim 2 wherein iteratively processing the plurality token block batches comprises: processing each token block batch of the plurality of token block batches through the one or more refining LLMs; ranking outputs of the one or more refining LLMs for each token block of the plurality of token blocks; determining whether an iteration completion criterion has been met; upon determining the iteration completion criterion has been met, generate the one or more refined context files from the plurality of token block batches; and upon determining the iteration completion criterion has not been met: selecting a subset of high-ranking token blocks from which a predetermined number of the highest ranked outputs were generated; performing a clustering operation on the subset of high-ranking token blocks; and generating a plurality of refined token block batches responsive to the clustering operation; wherein the plurality of refined token block batches is iteratively processed in the same manner as the plurality of token block batches until the iteration completion criterion is met. 4. The method of claim 1 wherein generating one or more refined context files comprises identifying one or more categories of information comprised by the one or more context files using topic modeling techniques. 5. The method of claim 4 wherein the plurality of derived prompts is generated responsive to the one or more categories of information identified in the one or more context files. 6. The method of claim 4 , wherein at least one of the input broker and the output broker is configured to generate one or more advertisements to be provided as part of the responsive result responsive to the one or more identified categories of information. 7. The method of claim 1 wherein the one or more refined context files are generated offline and stored prior to receiving the user prompt. 8. The method of claim 1 wherein the one or more refined context files are retrieved from a cached memory storage. 9. The method of claim 1 further comprising performing pre-prompt processing on the one or more refined context files using one or more RAG enhancement engines or RAG enhancement modules, wherein generating the plurality of h-LLM results comprises: parsing the user prompt; identifying one or more relevant information items to be retrieved; checking an information cache for the one or more relevant information items; retrieving the one or more relevant information items from the cache responsive to locating the one or more relevant information items in the cache; conducting a hybrid search to locate the one or more relevant information items responsive to not locating the one or more relevant information items in the cache; generating an augmented user prompt by combining the user prompt with the one or more relevant information items; and providing the augmented user prompt to the one or more h-LLMs to generate the plurality of h-LLM results. 10. The method of claim 9 wherein performing pre-prompt processing on the one or more refined context files comprises: identifying one or more topics comprised by the one or more refined context files using a topic modeling engine; segmenting the one or more refined context files into one or more semantic chunks using a file chunking module; identifying cited sources within the one or more refined context files using a citation analyzer module; selecting one or more selected chunks responsive to the one or more topics and the cited sources using a chunk selection and ranking module; generating enriched chunks by adding metadata to the one or more selected chunks using a metadata enrichment module; and indexing the enriched chunks using an indexing engine. 11. The method of claim 10 wherein the enriched chunks are saved in a graph database. 12. The method of claim 1 wherein generating the plurality of h-LLM results comprises: performing a prefill process that simultaneously generates a key vector and a value vector from each derived prompt of the plurality of derived prompts in parallel; and performing a decode process that serially generates each h-LLM result of the plurality of h-LLM results from the key vector and the value vector generated for each derived prompt. 13. The method of claim 12 wherein the prefill process is performed by at least one h-LLM of the one or more h-LLMs and comprises: receiving the plurality of derived prompts; generating a plurality of tokens by tokenizing each derived prompt of the plurality of derived prompts; performing an embedding lookup for the plurality of tokens; processing the plurality of tokens through a series of neural network layers comprised by the at least one h-LLM, comprising: computing a query vector, a key vector, and a value vector for each token of the plurality of tokens at each neural network layer of the series of neural network layers, resulting in a plurality of query vectors, a plurality of key vectors, and a plurality of value vectors; performing a self-attention mechanism using the query vector, the key vector, and the value vector for each token; and processing each of the query vector, the key vector, and the value vector for each token through a feed-forward network comprised by the at least one h-LLM; generating a normalized key vector and a normalized value vector by performing a layer normalization process on the plurality of key vectors and the plurality of value vectors; and storing the normalized key vector and the normalized value vector in a key-value cache. 14. The method of claim 13 wherein the decode process is performed by at least one h-LLM of the one or more h-LLMs and comprises: receiving a first derived prompt of the plurality of derived prompts, the prefill process having been completed for the first derived prompt; generating a first decode token from the first derived prompt; performing an embedding lookup on the first decode token; processing the first decode token through the series of neural network layers, where

Assignees

Inventors

Classifications

  • Vector or matrix data · CPC title

  • with dedicated cache, e.g. instruction or stack · CPC title

  • in dialogue systems · CPC title

  • Semantic analysis · CPC title

  • Lexical analysis, e.g. tokenisation or collocates · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12405977B1 cover?
A system and method of improving performance of LLMs including receiving context files, generating refined context files from the context files, sending the refined context files to h-LLMs, receiving a user prompt, generating a plurality of derived prompts from the user prompt, transmitting the plurality of derived prompts to the h-LLMs, receiving a plurality of h-LLM results, processing the pl…
Who is the assignee on this patent?
Madisetti Vijay
What technology area does this patent fall under?
Primary CPC classification G06F16/33295. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 02 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).