Caching large language model (LLM) responses using hybrid retrieval and reciprocal rank fusion

US12259913B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-12259913-B1
Application numberUS-202418441863-A
CountryUS
Kind codeB1
Filing dateFeb 14, 2024
Priority dateFeb 14, 2024
Publication dateMar 25, 2025
Grant dateMar 25, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A system and method for improving computer functionality by retrieving answers/responses to questions/input from a cache such as those used with chatbots and generative AI systems. Disclosed is a multi-layered caching strategy that focuses on the relevance of a cache hit by improving the quality of the answer. The approach demonstrates that response latency is significantly reduced when using caching and how a caching strategy could be applied in various layers of increasing relevance for a simple Question-and-Answer system with the possibility of extending to more complex generative AI interactions.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for improving computer functionality by retrieving answers to questions from a cache, the method comprising: using a hardware processor communicatively coupled to memory to perform accessing a question stored in primary storage communicatively coupled to a cache, in a text format; accessing metadata associated with the question in the text format; vectorizing the question in the text format into a high dimensional vector using a text embedding algorithm, wherein the high dimensional vector is greater than or equal to 1024 dimensions; using the high dimensional vector to search a question portion of the cache using a plurality of retriever types to create a hybrid search, in which the hybrid search combines one or more text format queries using the metadata with one or more high dimensional vector queries in a single search request; performing query filtering with the metadata associated with the question to provide a semantic layer set of semantic answers in a text format with metadata associated with an answer and semantic relevance values; using the question in the text format to search an answer portion of the cache and performing query filtering with the metadata associated with the question to provide a lexical layer set of lexical answers in the text format with the metadata associated with the answer and lexical relevance values; using the semantic layer set in order of the semantic relevance values from highest to lowest and the lexical layer set from highest to lowest; and applying a reciprocal rank fusion algorithm to compute a combined ranking set for the semantic answers in the text format and the lexical answers in the text format to provide an identified answer. 2. The method of claim 1 , further comprising: in response to the semantic relevance values being above a settable value, returning the semantic answers in the text format with the highest semantic relevance values and, otherwise, sending the question in the text format to create a prompt. 3. The method of claim 1 , further comprising: in response to the combined ranking set being above a settable value, returning the identified answer and, otherwise, sending the question in the text format to create a prompt. 4. The method of claim 1 , wherein the performing query filtering with the metadata associated with the question provides an exact match result, wherein the high dimensional vector to search the question portion of the cache provides an approximate match ranked by the semantic relevance values. 5. The method of claim 1 , wherein the performing query filtering with Q-metadata provides an exact match result, wherein the question in the text format to search a question portion of the cache provides an approximate match ranked by the lexical relevance values. 6. The method of claim 1 , further comprising: in response to a subsequent question is received, the cache is first checked to see if a similar request has already been made and, in response, retrieving the answer from the cache. 7. The method of claim 1 , wherein the accessing a question in a text format includes accessing a question that originated from a human user or from a computer process. 8. A method for improving relevancy of a cache hit, the method comprising: operating a cache communicatively coupled to primary storage in an information retrieval system; accessing a question in a text format; accessing metadata associated with the question in text format; vectorizing the question in the text format into a high dimensional vector using a text embedding algorithm, wherein the high dimensional vector is greater than or equal to 1024 dimensions; using the high dimensional vector to search a question portion of the cache using a plurality of retriever types to create a hybrid search, in which the hybrid search combines one or more text format queries using the metadata with one or more high dimensional vector queries in a single search request; performing query filtering with the metadata associated with the question to provide a semantic layer set of semantic answers in a text format with metadata associated with an answer and semantic relevance values; using question in the text format to search an answer portion of the cache and performing query filtering with the metadata associated with the question to provide a lexical layer set of lexical answers in the text format with the metadata associated with the answer and lexical relevance values; using the semantic layer set in order of the semantic relevance values from highest to lowest and the lexical layer set from highest to lowest; applying a reciprocal rank fusion algorithm to compute a combined ranking set for the semantic answers in the text format and the lexical answers in the text format to provide an identified answer; and in response to the combined ranking set being above a settable value, returning the identified answer and, otherwise, sending the question in the text format to create a prompt. 9. The method of claim 8 , wherein the performing query filtering with the metadata associated with the question provides an exact match result, wherein the high dimensional vector to search the question portion of the cache provides an approximate match ranked by the semantic relevance values. 10. The method of claim 9 , wherein the performing query filtering with the metadata associated with the question provides an exact match result, wherein the high dimensional vector to search a question portion of the cache provides an approximate match ranked by the lexical relevance values. 11. A system for improving computer functionality by retrieving answers to questions from a cache, the system comprising the cache communicatively coupled to primary storage in an information retrieval system; memory; at least one processor communicatively coupled to memory and the information retrieval system, programmed to perform; accessing metadata associated with the question in text format; vectorizing the question in the text format into a high dimensional vector using a text embedding algorithm, wherein the high dimensional vector is greater than or equal to 1024 dimensions; using the high dimensional vector to search a question portion of the cache using a plurality of retriever types to create a hybrid search, in which the hybrid search combines one or more text format queries using the metadata with one or more high dimensional vector queries in a single search request; performing query filtering with the metadata associated with the question to provide a semantic layer set of semantic answers in a text format with metadata associated with an answer and semantic relevance values; using the question in the text format to search an answer portion of the cache and performing query filtering with the metadata associated with the question to provide a lexical layer set of lexical answers in the text format with the metadata associated with the answer and lexical relevance values; using the semantic layer set in order of the semantic relevance values from highest to lowest and the lexical layer set from highest to lowest; and applying a reciprocal rank fusion algorithm to compute a combined ranking set for the semantic answers in the text format and the lexical answers in the text format to provide an identified answer. 12. The system of claim 11 , further comprising: in response to the semantic relevance values being above a settable value, returning the semantic answers in the text format with the highest semantic relevance values and, otherwise, sending the question in the text format to create a prompt. 13. The sys

Assignees

Inventors

Classifications

  • using relevance feedback from the user, e.g. relevance feedback on documents, documents sets, document terms or passages · CPC title

  • Filtering based on additional data, e.g. user or group profiles (filtering in web context G06F16/9535, G06F16/9536) · CPC title

  • using vector based model · CPC title

  • Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually · CPC title

  • Natural language query formulation · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12259913B1 cover?
A system and method for improving computer functionality by retrieving answers/responses to questions/input from a cache such as those used with chatbots and generative AI systems. Disclosed is a multi-layered caching strategy that focuses on the relevance of a cache hit by improving the quality of the answer. The approach demonstrates that response latency is significantly reduced when using c…
Who is the assignee on this patent?
Inventus Holdings Llc
What technology area does this patent fall under?
Primary CPC classification G06F16/3326. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 25 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).