Efficient machine learning caching via attention output-based token eviction

US12579063B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12579063-B2
Application numberUS-202418825897-A
CountryUS
Kind codeB2
Filing dateSep 5, 2024
Priority dateJul 9, 2024
Publication dateMar 17, 2026
Grant dateMar 17, 2026

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Certain aspects of the present disclosure provide techniques and apparatus for machine learning. In an example method, an input prompt comprising a set of tokens is accessed as input to a generative machine learning model. A first key tensor and a first value tensor are generated for a first token of the set of tokens, and the first key tensor and the first value tensor are stored in a memory. A first retention score is generated, for the first token, based on the first key tensor, the first value tensor, and a second token of the set of tokens. The first key tensor and the first value tensor are evicted from the memory in response to determining that the first retention score is a lowest retention score of the memory.

First claim

Opening claim text (preview).

What is claimed is: 1 . A processing system for machine learning comprising: one or more memories comprising processor-executable instructions; and one or more processors coupled to the one or more memories and configured to execute the processor-executable instructions and cause the processing system to: access an input prompt comprising a set of tokens as input to a generative machine learning model; generate, for a first token of the set of tokens, a first key tensor and a first value tensor; store the first key tensor and the first value tensor in a memory; generate, for the first token, a first retention score based on the first key tensor, the first value tensor, and a second token of the set of tokens; and evict the first key tensor and the first value tensor from the memory in response to determining that the first retention score is a lowest retention score of the memory; wherein the first retention score corresponds to a change in attention output of the generative machine learning model if the first token is evicted from the memory; wherein the first retention score is defined as r i = ❘ "\[LeftBracketingBar]" a i ( 1 - a i ) ⁢ ( V i - O ) ❘ "\[RightBracketingBar]" 2 ,  wherein: r i is the first retention score, a i is an attention score between the first token and the second token, V i is the first value tensor, and O is the attention output prior to evicting the first token from the memory. 2 . The processing system of claim 1 , wherein the one or more processors are configured to execute the processor-executable instructions and further cause the processing system to store a second key tensor and a second value tensor corresponding to the second token in the memory. 3 . The processing system of claim 2 , wherein the one or more processors are configured to execute the processor-executable instructions and further cause the processing system to: generate, for the second token, a second retention score based on the second key tensor, the second value tensor, and a third token of the set of tokens; determine not to evict the second key tensor and the second value tensor from the memory in response to determining that the second retention score is not the lowest retention score of the memory; and store a third key tensor and a third value tensor corresponding to the third token in the memory. 4 . The processing system of claim 3 , wherein the one or more processors are configured to execute the processor-executable instructions and further cause the processing system to: generate, for a fourth token, a third retention score based on a fourth key tensor, a fourth value tensor, and the third token of the set of tokens; and evict the fourth key tensor and the fourth value tensor from the memory in response to determining that the third retention score is the lowest retention score of the memory. 5 . The processing system of claim 1 , wherein the one or more processors are configured to execute the processor-executable instructions and further cause the processing system to evict the first key tensor and the first value tensor in further response to determining that a size of the memory satisfies a maximum memory size. 6 . The processing system of claim 1 , wherein the one or more processors are configured to execute the processor-executable instructions and further cause the processing system to, subsequent to generating a respective key tensor and a respective value tensor for each respective token of the set of tokens, generate a new token using the generative machine learning model and based on at least a subset of the respective key tensors and the respective value tensors. 7 . The processing system of claim 6 , wherein the one or more processors are configured to execute the processor-executable instructions and further cause the processing system to: generate, for the second token, a second retention score based on a second key tensor corresponding to the second token, a second value tensor corresponding to the second token, and the new token; and evict the second key tensor and the second value tensor from the memory in response to determining that the second retention score is a lowest retention score of the memory. 8 . The processing system of claim 7 , wherein the one or more processors are configured to execute the processor-executable instructions and further cause the processing system to store a new key tensor and a new value tensor corresponding to the new token in the memory. 9 . The processing system of claim 6 , wherein the one or more processors are configured to execute the processor-executable instructions and further cause the processing system to generate an output of the generative machine learning model including the new token. 10 . A processor-implemented method for generative machine learning, comprising: accessing an input prompt comprising a set of tokens as input to a generative machine learning model; generating, for a first token of the set of tokens, a first key tensor and a first value tensor; storing the first key tensor and the first value tensor in a memory; generating, for the first token, a first retention score based on the first key tensor, the first value tensor, and a second token of the set of tokens; and evicting the first key tensor and the first value tensor from the memory in response to determining that the first retention score is a lowest retention score of the memory; wherein the first retention score corresponds to a change in attention output of the generative machine learning model if the first token is evicted from the memory; wherein the first retention score is defined as r i = ❘ "\[LeftBracketingBar]" a i ( 1

Assignees

Inventors

Classifications

  • G06N3/0475Primary

    Generative networks · CPC title

  • using electronic means · CPC title

  • Garbage collection, i.e. reclamation of unreferenced memory · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12579063B2 cover?
Certain aspects of the present disclosure provide techniques and apparatus for machine learning. In an example method, an input prompt comprising a set of tokens is accessed as input to a generative machine learning model. A first key tensor and a first value tensor are generated for a first token of the set of tokens, and the first key tensor and the first value tensor are stored in a memory. …
Who is the assignee on this patent?
Qualcomm Inc
What technology area does this patent fall under?
Primary CPC classification G06N3/0475. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 17 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).