Coupling wide memory interface to wide write back paths

US11726912B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11726912-B2
Application numberUS-202117216563-A
CountryUS
Kind codeB2
Filing dateMar 29, 2021
Priority dateJan 30, 2018
Publication dateAug 15, 2023
Grant dateAug 15, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems and methods are disclosed for performing wide memory operations for a wide data cache line. In some examples of the disclosed technology, a processor having two or more execution lanes includes a data cache coupled to memory, a wide memory load circuit that concurrently loads two or more words from a cache line of the data cache, and a writeback circuit situated to send a respective word of the concurrently-loaded words to a selected execution lane of the processor, either into an operand buffer or bypassing the operand buffer. In some examples, a sharding circuit is provided that allows bitwise, byte-wise, and/or word-wise manipulation of memory operation data. In some examples, wide cache loads allows for concurrent execution of plural execution lanes of the processor.

First claim

Opening claim text (preview).

What is claimed is: 1. A method of operating a processor, the method comprising: receiving object code for an instruction group; scheduling one or more operations specified in the instruction group to be executed by two or more execution lanes of a processor core; and executing the scheduled operations by the processor, the executing comprising: performing a memory operation for a cache line of a data cache, each of plural words of the cache line memory operation being associated with one of the execution lanes of the processor core; and performing sharding operations for the plural words (a) after loading the words when performing the memory operation, or (b) before storing the words in the data cache when performing the memory operation. 2. The method of claim 1 , wherein the memory operation is a store operation, and wherein the performing the memory operation comprises sending each of respective plural words from its associated execution lane to the same cache line of the data cache for writing to memory coupled to the processor. 3. The method of claim 1 , wherein the scheduling comprises assigning one of the execution lanes as a leader lane, and wherein the remaining execution lanes concurrently follow the leader lane when executing the scheduled operations. 4. The method of claim 1 , wherein the operations are scheduled based on arrangement of instructions in the instruction group, instruction identifiers encoded in the instruction group, or dependencies encoded in instructions in the instruction group. 5. The method of claim 1 , further comprising scheduling a multiply operation for calculating an inner product in the instruction group prior to scheduling an add operation for calculating the inner product in the instruction group. 6. The method of claim 1 , wherein the sharding operations comprise at least one of: shift, rotate, reverse, move, swap, transpose, extract, or extend. 7. The method of claim 1 , wherein the scheduling is performed responsive to identifying a vector instruction in the instruction group. 8. The method of claim 1 , wherein each of the execution lanes comprises a distinct at least one of: an integer arithmetic and logic unit (ALU), an adder, a subtractor, a multiplier, a divider, a shifter, a rotator, or a floating point unit (FPU). 9. The method of claim 1 , wherein each of the execution lanes is configurable to execute a respective context distinct from a context of any other execution lane. 10. A method of operating a processor, the method comprising: receiving object code for an instruction group; scheduling one or more operations specified in the instruction group to be executed by two or more execution lanes of a processor core; and executing the scheduled operations by the processor, the executing comprising: performing a first load operation for a cache line of a data cache, each of plural words of the cache line memory operation being associated with one of the execution lanes of the processor core and being stored in an operand buffer, and performing a second load operation, each of plural words for the second load operation not being stored in the operand buffer but being immediately combined with a result calculated based on the plural words stored in the operand buffer. 11. An apparatus, comprising: a data cache coupled to memory, the data cache having at least one cache line and providing plural output words from the cache line; an operand buffer; and a plurality of execution lanes of a processor core, each of the plurality of execution lanes being configured to receive a different word of the plural output words; the processor core being configured to: store plural words for a first load operation in the operand buffer coupled to the execution lanes, and immediately combine plural words for a second load operation with a result calculated based on the plural words stored in the operand buffer. 12. The apparatus of claim 11 , wherein each of the execution lanes is configured to send a respective word to a same cache line of the data cache for writing to the memory. 13. The apparatus of claim 11 , wherein one of the execution lanes is assigned to be a leader lane, and wherein at least one of the remaining execution lanes concurrently follows the leader lane when executing the scheduled operations. 14. The apparatus of claim 11 , further comprising: an operand buffer; additional multiplexer logic configured to select either a high portion of words from the operand buffer or a low portion of words from the operand buffer; and wherein the processor core is configured to use only one half of the execution lanes during a first clock cycle and to use only one half of the execution lanes in a second clock cycle subsequent to the first clock cycle. 15. The apparatus of claim 11 , further comprising a writeback path adapted to select and send an output word from at least one execution lane to an input word of at least one other execution lane. 16. An apparatus, comprising: a plurality of execution lanes; means for receiving object code for at least one instruction group; means for scheduling one or more operations specified in the at least one instruction group; means for executing the scheduled operations by performing a memory operation for a cache line of a data cache; and sharding means for performing word swap operations with output of the execution lanes. 17. The apparatus of claim 16 , wherein the means for executing the scheduled operations associates each of plural words of the cache line memory operation with one of the execution lanes of the processor core. 18. The apparatus of claim 16 , further comprising: bypass means for bypassing an operand buffer coupled to the execution lanes and sending a word directly to a selected one of the execution lanes.

Assignees

Inventors

Classifications

  • using a mask · CPC title

  • Instruction completion, e.g. retiring, committing or graduating · CPC title

  • Result writeback, i.e. updating the architectural state or memory · CPC title

  • controlled by a single instruction for multiple data lanes [SIMD] · CPC title

  • Instructions to perform operations on packed data, e.g. vector, tile or matrix operations · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11726912B2 cover?
Systems and methods are disclosed for performing wide memory operations for a wide data cache line. In some examples of the disclosed technology, a processor having two or more execution lanes includes a data cache coupled to memory, a wide memory load circuit that concurrently loads two or more words from a cache line of the data cache, and a writeback circuit situated to send a respective wor…
Who is the assignee on this patent?
Microsoft Technology Licensing Llc
What technology area does this patent fall under?
Primary CPC classification G06F9/30036. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Aug 15 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).