Granular neural network architecture search over low-level primitives
US-2024428071-A1 · Dec 26, 2024 · US
US2025278270A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2025278270-A1 |
| Application number | US-202418958633-A |
| Country | US |
| Kind code | A1 |
| Filing date | Nov 25, 2024 |
| Priority date | Mar 4, 2024 |
| Publication date | Sep 4, 2025 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A transformer acceleration device configured to execute a transformer comprising a first plurality of decoder layers and a second plurality of decoder layers is disclosed. The transformer acceleration device comprises a first memory bank array configured to store a first plurality of weight matrices corresponding to the first plurality of decoder layers and a first plurality of key-value vector pairs corresponding to the second plurality of decoder layers, and a second memory bank array configured to store a second plurality of weight matrices corresponding to the second plurality of decoder layers, and a second plurality of key-value vector pairs corresponding to the first plurality of decoder layers.
Opening claim text (preview).
What is claimed is: 1 . A transformer acceleration device configured to execute a transformer comprising a first plurality of decoder layers and a second plurality of decoder layers, the transformer acceleration device comprising: a first memory bank array configured to store a first plurality of weight matrices corresponding to the first plurality of decoder layers and a first plurality of key-value vector pairs corresponding to the second plurality of decoder layers; and a second memory bank array configured to store a second plurality of weight matrices corresponding to the second plurality of decoder layers, and a second plurality of key-value vector pairs corresponding to the first plurality of decoder layers. 2 . The transformer acceleration device of claim 1 , further comprising a first processing circuit configured to: calculate the first plurality of key-value vector pairs based on the first plurality of weight matrices; and calculate the second plurality of key-value vector pairs based on the second plurality of weight matrices. 3 . The transformer acceleration device of claim 2 , further comprising a second processing circuit configured to: perform attention calculation for the second plurality of decoder layers based on the first plurality of key-value vector pairs; and perform attention calculation for the first plurality of decoder layers based on the second plurality of key-value vector pairs. 4 . The transformer acceleration device of claim 1 , wherein: a number of the first plurality of decoder layers and a number of the second plurality of decoder layers correspond to each other. 5 . The transformer acceleration device of claim 3 , wherein: an order of the first plurality of decoder layers and the second plurality of decoder layers is alternating with each other. 6 . The transformer acceleration device of claim 1 , wherein: a first capacity of the first plurality of weight matrices and the first plurality of key-value vector pairs corresponds to a second capacity of the second plurality of weight matrices and the second plurality of key-value vector pairs. 7 . The transformer acceleration device of claim 1 , wherein: the first plurality of key-value vector pairs comprises: a first plurality of key vectors corresponding to a first token stream provided from an external device; and a first plurality of value vectors corresponding to the first token stream, and the second plurality of key-value vector pairs comprises: a second plurality of key vectors corresponding to the first token stream; and a second plurality of value vectors corresponding to the first token stream. 8 . The transformer acceleration device of claim 7 , wherein: the first plurality of key-value vector pairs further comprises: a third plurality of key vectors corresponding to a second token stream provided from the external device; and a third plurality of value vectors corresponding to the second token stream, and the second plurality of key-value vector pairs further comprises: a fourth plurality of key vectors corresponding to the second token stream; and a fourth plurality of value vectors corresponding to the second token stream. 9 . The transformer acceleration device of claim 7 , wherein: each of the first plurality of key vectors, the first plurality of value vectors, the second plurality of key vectors, and second plurality of value vectors has smaller capacity than that of one memory cell row of memory banks included in the first memory bank array and the second memory bank array. 10 . A transformer acceleration device configured to operate based on a first activation vector corresponding to a first input token, the transformer acceleration device comprising: a first memory bank array configured to store a first plurality of weight matrices; a second memory bank array configured to store a first plurality of preceding key-value vector pairs corresponding to a first plurality of preceding tokens for the first input token; a first processing circuit configured to generate a first key-value vector pair based on the first activation vector and the first plurality of weight matrices; and a second processing circuit configured to perform first attention calculation based on the first key-value vector pair and the first plurality of preceding key-value vector pairs. 11 . The transformer acceleration device of claim 10 , wherein: the first key-value vector pair comprises a first key vector and a first value vector, the first plurality of preceding key-value vector pairs comprises a first plurality of preceding key vectors and a first plurality of preceding value vectors, and the second processing circuit is configured to: generate a first plurality of attention scores respectively corresponding to the first key vector and the first plurality of preceding key vectors; and generate a first attention vector by accumulating the first value vector and the first plurality of preceding value vectors based on the first plurality of attention scores. 12 . The transformer acceleration device of claim 11 , wherein the first processing circuit is further configured to store the first key vector and the first value vector in the second memory bank array. 13 . The transformer acceleration device of claim 12 , wherein: the transformer acceleration device is further configured to operate based on a second activation vector, the second activation vector being generated based on the first attention vector, the second memory bank array is further configured to store a second plurality of weight matrices, the first memory bank array is further configured to store a second plurality of preceding key-value vector pairs corresponding to the first plurality of preceding tokens, respectively, the first processing circuit is further configured to generate a second key-value vector pair based on the second activation vector and the second plurality of weight matrices, and the second processing circuit is further configured to perform a second attention calculation based on the second key-value vector pair and the second plurality of preceding key-value vector pairs. 14 . The transformer acceleration device of claim 13 , wherein the second processing circuit is configured to store the second key-value vector pair in the first memory bank array. 15 . The transformer acceleration device of claim 11 , wherein the transformer acceleration device is further configured to generate a first output token corresponding to the first input token and the first plurality of preceding tokens based on the first attention vector. 16 . An operation method of a transformer acceleration device including a plurality of memory bank arrays, the method comprising: generating a first activation vector based on a first input token; reading a first plurality of weight matrices from a first memory bank array, the first memory bank array being a memory bank array of the plurality of memory bank arrays; generating a first key-value vector pair based on the first plurality of weight matrices and the first activation vector; storing the first key-value vector pair in a second memory bank array, the second memory bank array being a memory bank array, of the plurality of memory bank arrays, different from the first memory bank array; generating a second activation vector based on the first key-value vector pair; and generating a first output token corresponding to the first input token based on the second activation vector. 17 . The operation method of cl
Auto-encoder networks; Encoder-decoder networks · CPC title
using electronic means · CPC title
Generative networks · CPC title
Combinations of networks · CPC title
Details of memory controller · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.