Artificial intelligence query processing by processing-near-memory storage
US-2025190394-A1 · Jun 12, 2025 · US
US12579063B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12579063-B2 |
| Application number | US-202418825897-A |
| Country | US |
| Kind code | B2 |
| Filing date | Sep 5, 2024 |
| Priority date | Jul 9, 2024 |
| Publication date | Mar 17, 2026 |
| Grant date | Mar 17, 2026 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Certain aspects of the present disclosure provide techniques and apparatus for machine learning. In an example method, an input prompt comprising a set of tokens is accessed as input to a generative machine learning model. A first key tensor and a first value tensor are generated for a first token of the set of tokens, and the first key tensor and the first value tensor are stored in a memory. A first retention score is generated, for the first token, based on the first key tensor, the first value tensor, and a second token of the set of tokens. The first key tensor and the first value tensor are evicted from the memory in response to determining that the first retention score is a lowest retention score of the memory.
Opening claim text (preview).
What is claimed is: 1 . A processing system for machine learning comprising: one or more memories comprising processor-executable instructions; and one or more processors coupled to the one or more memories and configured to execute the processor-executable instructions and cause the processing system to: access an input prompt comprising a set of tokens as input to a generative machine learning model; generate, for a first token of the set of tokens, a first key tensor and a first value tensor; store the first key tensor and the first value tensor in a memory; generate, for the first token, a first retention score based on the first key tensor, the first value tensor, and a second token of the set of tokens; and evict the first key tensor and the first value tensor from the memory in response to determining that the first retention score is a lowest retention score of the memory; wherein the first retention score corresponds to a change in attention output of the generative machine learning model if the first token is evicted from the memory; wherein the first retention score is defined as r i = ❘ "\[LeftBracketingBar]" a i ( 1 - a i ) ( V i - O ) ❘ "\[RightBracketingBar]" 2 , wherein: r i is the first retention score, a i is an attention score between the first token and the second token, V i is the first value tensor, and O is the attention output prior to evicting the first token from the memory. 2 . The processing system of claim 1 , wherein the one or more processors are configured to execute the processor-executable instructions and further cause the processing system to store a second key tensor and a second value tensor corresponding to the second token in the memory. 3 . The processing system of claim 2 , wherein the one or more processors are configured to execute the processor-executable instructions and further cause the processing system to: generate, for the second token, a second retention score based on the second key tensor, the second value tensor, and a third token of the set of tokens; determine not to evict the second key tensor and the second value tensor from the memory in response to determining that the second retention score is not the lowest retention score of the memory; and store a third key tensor and a third value tensor corresponding to the third token in the memory. 4 . The processing system of claim 3 , wherein the one or more processors are configured to execute the processor-executable instructions and further cause the processing system to: generate, for a fourth token, a third retention score based on a fourth key tensor, a fourth value tensor, and the third token of the set of tokens; and evict the fourth key tensor and the fourth value tensor from the memory in response to determining that the third retention score is the lowest retention score of the memory. 5 . The processing system of claim 1 , wherein the one or more processors are configured to execute the processor-executable instructions and further cause the processing system to evict the first key tensor and the first value tensor in further response to determining that a size of the memory satisfies a maximum memory size. 6 . The processing system of claim 1 , wherein the one or more processors are configured to execute the processor-executable instructions and further cause the processing system to, subsequent to generating a respective key tensor and a respective value tensor for each respective token of the set of tokens, generate a new token using the generative machine learning model and based on at least a subset of the respective key tensors and the respective value tensors. 7 . The processing system of claim 6 , wherein the one or more processors are configured to execute the processor-executable instructions and further cause the processing system to: generate, for the second token, a second retention score based on a second key tensor corresponding to the second token, a second value tensor corresponding to the second token, and the new token; and evict the second key tensor and the second value tensor from the memory in response to determining that the second retention score is a lowest retention score of the memory. 8 . The processing system of claim 7 , wherein the one or more processors are configured to execute the processor-executable instructions and further cause the processing system to store a new key tensor and a new value tensor corresponding to the new token in the memory. 9 . The processing system of claim 6 , wherein the one or more processors are configured to execute the processor-executable instructions and further cause the processing system to generate an output of the generative machine learning model including the new token. 10 . A processor-implemented method for generative machine learning, comprising: accessing an input prompt comprising a set of tokens as input to a generative machine learning model; generating, for a first token of the set of tokens, a first key tensor and a first value tensor; storing the first key tensor and the first value tensor in a memory; generating, for the first token, a first retention score based on the first key tensor, the first value tensor, and a second token of the set of tokens; and evicting the first key tensor and the first value tensor from the memory in response to determining that the first retention score is a lowest retention score of the memory; wherein the first retention score corresponds to a change in attention output of the generative machine learning model if the first token is evicted from the memory; wherein the first retention score is defined as r i = ❘ "\[LeftBracketingBar]" a i ( 1
Generative networks · CPC title
using electronic means · CPC title
Garbage collection, i.e. reclamation of unreferenced memory · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.