Efficient performance of inner loops on a multi-lane processor

US10936320B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-10936320-B1
Application numberUS-201916543540-A
CountryUS
Kind codeB1
Filing dateAug 17, 2019
Priority dateAug 17, 2019
Publication dateMar 2, 2021
Grant dateMar 2, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A processor core and methods for managing the processor core. The processor core comprises of a plurality of lanes, each lane comprising a copy of a register file logically shared across the plurality lanes and a plurality of functional units, at least two of the functional units sharing a common cache and a common control unit, where the common control unit concurrently dispatches multiple consecutive instances of an instruction corresponding to multiple successive instances of an inner loop to the plurality of functional units of at least a proper subset of the plurality of lanes; and one or more registers of each copy of the register file, each register being configurable to write a data result from at least one of the functional units to a register in a lane-local mode, a lane-forward mode, and a normal mode.

First claim

Opening claim text (preview).

What is claimed is: 1. A processor core comprising: a plurality of lanes, each lane comprising a copy of a register file logically shared across the plurality of lanes and a plurality of functional units, each copy of the register file having one or more registers, at least two of the functional units sharing a common cache and a common control unit, where: the common control unit concurrently dispatches multiple consecutive instances of an instruction corresponding to multiple successive instances of an inner loop to the plurality of functional units of at least a proper subset of the plurality of lanes; and each register is configurable to write a data result from at least one of the functional units to a register of a same lane in a lane-local mode, to write the data result to a register of a next lane in a lane-forward mode, and to write a data result to corresponding registers of all of the lanes of the proper subset of the plurality of lanes in a normal mode. 2. The processor core of claim 1 , wherein the dispatched instructions are performed asynchronously by the corresponding functional units based on readiness of an input operand of the instruction. 3. The processor core of claim 2 , the processor core further comprising one or more instruction mask registers for selecting a subset of the plurality of functional units to perform the dispatched instruction. 4. The processor core of claim 1 , wherein a value defined and used only in one iteration of the inner loop that produced the value is stored in a lane-local register, the lane-local register being configured for the lane-local mode; wherein a value defined in a first given iteration of the inner loop and used only in a next iteration of the inner loop is stored in a lane-forward register, the lane-forward register being configured for the lane-forward mode; and wherein a value defined in a second given iteration of the inner loop and used in a plurality of other iterations of the inner loop is stored in a normal register, the normal register being configured for the normal mode. 5. The processor core of claim 1 , wherein an instruction tag accompanies each instruction or IOP and indicates that the corresponding instruction is being broadcast to all of the plurality of lanes. 6. The processor core of claim 5 , wherein the instruction tag acts as a mask indicating which of the plurality of lanes are to perform the broadcast instruction. 7. The processor core of claim 5 , wherein an instruction completion table (ICT) indicates the instructions which are complete and an ICT tag indicates a count of lanes that received the corresponding instruction. 8. The processor core of claim 7 , wherein the count in the ICT of lanes that received the instruction is decremented each time a performance of the corresponding broadcast instruction is completed by one of the plurality of lanes. 9. A method of performing an inner loop on a processor core, the method comprising: dispatching an instruction simultaneously to one or more functional units of at least a proper subset of lanes of the processor core, each lane having a copy of a register file logically shared across the plurality of lanes; configuring one or more registers of each copy of the register file to write a data result from at least one functional unit to one of the registers of a same lane in a lane-local mode, to write the data result to a register of a next lane in a lane-forward mode, or to write the data result to corresponding registers of all lanes of the proper subset of the lanes in a normal mode; and performing the instruction via the one or more functional units. 10. The method of claim 9 , wherein the dispatched instruction is performed asynchronously based on readiness of an input operand of the instruction. 11. The method of claim 10 , wherein one or more instruction mask registers are configured to select a subset of functional units to perform the dispatched instruction. 12. The method of claim 9 , wherein a value defined and used only in an iteration of the inner loop that produced the value is stored in a lane-local register, the lane-local register being configured for the lane-local mode; wherein a value defined in a first given iteration of the inner loop and used only in a next iteration of the inner loop is stored in a lane-forward register, the lane-forward register being configured for the lane-forward mode; and wherein a value defined in a second given iteration of the inner loop and used in a plurality of other iterations of the inner loop is stored in a normal register, the normal register being configured for the normal mode. 13. The method of claim 9 , wherein an instruction tag accompanies each instruction or IOP sent from a front-end to back-end functional units indicating that the instruction is being broadcast to all of the plurality of lanes. 14. The method of claim 13 , wherein the instruction tag acts as a mask indicating which of the plurality of lanes are to perform the broadcast instruction. 15. The method of claim 9 , wherein an instruction completion table (ICT) indicates the instructions which are complete and an ICT tag indicates a count of lanes that received the corresponding instruction. 16. The method of claim 15 , wherein the count in the ICT of lanes that received the instruction is decremented each time a performance of the corresponding broadcast instruction is completed by one of the plurality of lanes. 17. A method for determining a count of loop instances present in an instruction buffer, the method comprising: locating a first instruction address of a first branch instruction in the instruction buffer; locating an address of a next matching instruction in the instruction buffer; establishing a loop length as a difference between the first instruction address and the address of the next matching instruction; counting instructions in one of the loop instances, the counting of instructions comprising: making a copy of the instruction buffer; shifting down one or more instructions of the copy by a loop length; comparing instruction addresses of instructions in the instruction buffer with the shifted instruction addresses in the shifted copy; labeling matches from the comparison as a one and labeling mismatches from the comparison as a zero; and counting contiguous strings of ones in the labels starting from a location of a first branch instruction in the instruction buffer; dividing the count of instructions in the loop instance by the loop length and adding one to the result to compute the count of loop instances; configuring a lane of a processor core for each loop instance based on the count of loop instances; and performing the first instruction via the processor core. 18. The method of claim 17 , the method further comprising configuring one or more registers of each copy of a register file of each lane of the processor core to write a data result from at least one functional unit to one of the registers of a same lane in a lane-local mode, to write the data result to a register of a next lane in a lane-forward mode, or to write the data result to corresponding registers of all lanes of the proper subset of the lanes in a normal mode. 19. The method of claim 17 , wherein the first instruction is performed asynchronously based on readiness of an input operand of the first instruction. 20. The method of claim 17 , wherein one or more instruction mask registers are configured to select a subset of functional units to perform the first instruction

Assignees

Inventors

Classifications

  • controlled by a single instruction for multiple data lanes [SIMD] · CPC title

  • Iterative single instructions for multiple data lanes [SIMD] · CPC title

  • controlled by a single instruction for multiple threads [SIMT] in parallel · CPC title

  • G06F9/3851Primary

    from multiple instruction streams, e.g. multistreaming · CPC title

  • G06F9/3891Primary

    organised in groups of units sharing resources, e.g. clusters · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10936320B1 cover?
A processor core and methods for managing the processor core. The processor core comprises of a plurality of lanes, each lane comprising a copy of a register file logically shared across the plurality lanes and a plurality of functional units, at least two of the functional units sharing a common cache and a common control unit, where the common control unit concurrently dispatches multiple con…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F9/3851. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 02 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).