Broadcast channel architectures for block-based processors
US-2017083335-A1 · Mar 23, 2017 · US
US11726912B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11726912-B2 |
| Application number | US-202117216563-A |
| Country | US |
| Kind code | B2 |
| Filing date | Mar 29, 2021 |
| Priority date | Jan 30, 2018 |
| Publication date | Aug 15, 2023 |
| Grant date | Aug 15, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Systems and methods are disclosed for performing wide memory operations for a wide data cache line. In some examples of the disclosed technology, a processor having two or more execution lanes includes a data cache coupled to memory, a wide memory load circuit that concurrently loads two or more words from a cache line of the data cache, and a writeback circuit situated to send a respective word of the concurrently-loaded words to a selected execution lane of the processor, either into an operand buffer or bypassing the operand buffer. In some examples, a sharding circuit is provided that allows bitwise, byte-wise, and/or word-wise manipulation of memory operation data. In some examples, wide cache loads allows for concurrent execution of plural execution lanes of the processor.
Opening claim text (preview).
What is claimed is: 1. A method of operating a processor, the method comprising: receiving object code for an instruction group; scheduling one or more operations specified in the instruction group to be executed by two or more execution lanes of a processor core; and executing the scheduled operations by the processor, the executing comprising: performing a memory operation for a cache line of a data cache, each of plural words of the cache line memory operation being associated with one of the execution lanes of the processor core; and performing sharding operations for the plural words (a) after loading the words when performing the memory operation, or (b) before storing the words in the data cache when performing the memory operation. 2. The method of claim 1 , wherein the memory operation is a store operation, and wherein the performing the memory operation comprises sending each of respective plural words from its associated execution lane to the same cache line of the data cache for writing to memory coupled to the processor. 3. The method of claim 1 , wherein the scheduling comprises assigning one of the execution lanes as a leader lane, and wherein the remaining execution lanes concurrently follow the leader lane when executing the scheduled operations. 4. The method of claim 1 , wherein the operations are scheduled based on arrangement of instructions in the instruction group, instruction identifiers encoded in the instruction group, or dependencies encoded in instructions in the instruction group. 5. The method of claim 1 , further comprising scheduling a multiply operation for calculating an inner product in the instruction group prior to scheduling an add operation for calculating the inner product in the instruction group. 6. The method of claim 1 , wherein the sharding operations comprise at least one of: shift, rotate, reverse, move, swap, transpose, extract, or extend. 7. The method of claim 1 , wherein the scheduling is performed responsive to identifying a vector instruction in the instruction group. 8. The method of claim 1 , wherein each of the execution lanes comprises a distinct at least one of: an integer arithmetic and logic unit (ALU), an adder, a subtractor, a multiplier, a divider, a shifter, a rotator, or a floating point unit (FPU). 9. The method of claim 1 , wherein each of the execution lanes is configurable to execute a respective context distinct from a context of any other execution lane. 10. A method of operating a processor, the method comprising: receiving object code for an instruction group; scheduling one or more operations specified in the instruction group to be executed by two or more execution lanes of a processor core; and executing the scheduled operations by the processor, the executing comprising: performing a first load operation for a cache line of a data cache, each of plural words of the cache line memory operation being associated with one of the execution lanes of the processor core and being stored in an operand buffer, and performing a second load operation, each of plural words for the second load operation not being stored in the operand buffer but being immediately combined with a result calculated based on the plural words stored in the operand buffer. 11. An apparatus, comprising: a data cache coupled to memory, the data cache having at least one cache line and providing plural output words from the cache line; an operand buffer; and a plurality of execution lanes of a processor core, each of the plurality of execution lanes being configured to receive a different word of the plural output words; the processor core being configured to: store plural words for a first load operation in the operand buffer coupled to the execution lanes, and immediately combine plural words for a second load operation with a result calculated based on the plural words stored in the operand buffer. 12. The apparatus of claim 11 , wherein each of the execution lanes is configured to send a respective word to a same cache line of the data cache for writing to the memory. 13. The apparatus of claim 11 , wherein one of the execution lanes is assigned to be a leader lane, and wherein at least one of the remaining execution lanes concurrently follows the leader lane when executing the scheduled operations. 14. The apparatus of claim 11 , further comprising: an operand buffer; additional multiplexer logic configured to select either a high portion of words from the operand buffer or a low portion of words from the operand buffer; and wherein the processor core is configured to use only one half of the execution lanes during a first clock cycle and to use only one half of the execution lanes in a second clock cycle subsequent to the first clock cycle. 15. The apparatus of claim 11 , further comprising a writeback path adapted to select and send an output word from at least one execution lane to an input word of at least one other execution lane. 16. An apparatus, comprising: a plurality of execution lanes; means for receiving object code for at least one instruction group; means for scheduling one or more operations specified in the at least one instruction group; means for executing the scheduled operations by performing a memory operation for a cache line of a data cache; and sharding means for performing word swap operations with output of the execution lanes. 17. The apparatus of claim 16 , wherein the means for executing the scheduled operations associates each of plural words of the cache line memory operation with one of the execution lanes of the processor core. 18. The apparatus of claim 16 , further comprising: bypass means for bypassing an operand buffer coupled to the execution lanes and sending a word directly to a selected one of the execution lanes.
using a mask · CPC title
Instruction completion, e.g. retiring, committing or graduating · CPC title
Result writeback, i.e. updating the architectural state or memory · CPC title
controlled by a single instruction for multiple data lanes [SIMD] · CPC title
Instructions to perform operations on packed data, e.g. vector, tile or matrix operations · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.