Apparatus and method for low-latency invocation of accelerators
US-2016246597-A1 · Aug 25, 2016 · US
US10140129B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10140129-B2 |
| Application number | US-201213730719-A |
| Country | US |
| Kind code | B2 |
| Filing date | Dec 28, 2012 |
| Priority date | Dec 28, 2012 |
| Publication date | Nov 27, 2018 |
| Grant date | Nov 27, 2018 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A processor having one or more processing cores is described. Each of the one or more processing cores has front end logic circuitry and a plurality of processing units. The front end logic circuitry is to fetch respective instructions of threads and decode the instructions into respective micro-code and input operand and resultant addresses of the instructions. Each of the plurality of processing units is to be assigned at least one of the threads, is coupled to said front end unit, and has a respective buffer to receive and store microcode of its assigned at least one of the threads. Each of the plurality of processing units also comprises: i) at least one set of functional units corresponding to a complete instruction set offered by the processor, the at least one set of functional units to execute its respective processing unit's received microcode; ii) registers coupled to the at least one set of functional units to store operands and resultants of the received microcode; iii) data fetch circuitry to fetch input operands for the at least one functional units' execution of the received microcode.
Opening claim text (preview).
What is claimed is: 1. A processor having one or more processing cores, each of said one or more processing cores comprising: a front end unit to fetch respective instructions of threads and decode said instructions into respective decoded instructions and input operand and resultant addresses of said instructions; and a plurality of processing units, each of said processing units to be assigned a plurality of said threads, each processing unit coupled to said front end unit and having a respective buffer to receive and store decoded instructions of its assigned plurality of said threads, each of said plurality of processing units comprising: i) a plurality of functional units comprising at least one integer functional unit and at least one floating point functional unit, said plurality of functional units to simultaneously execute its respective processing unit's received, decoded instructions for two or more of its assigned plurality of said threads, ii) registers coupled to said plurality of functional units to store operands and resultants of said received, decoded instructions of its assigned plurality of said threads, iii) data fetch circuitry to fetch input data operands for said plurality of functional units' execution of said received, decoded instructions of its assigned plurality of said threads, and iv) register allocation circuitry to allocate a respective register partition of the registers for each assigned thread of its assigned plurality of said threads. 2. The processor of claim 1 wherein said plurality of functional units are not coupled to any logic circuitry to perform out-of-order execution of said received, decoded instructions. 3. The processor of claim 1 wherein the register allocation circuitry of each of the plurality of processing units is to allocate the respective register partition of the registers for each assigned thread of its assigned plurality of said threads that are to be concurrently executed. 4. The processor of claim 1 wherein said plurality of functional units are not coupled to any logic circuitry to perform speculative execution of said received, decoded instructions. 5. The processor of claim 4 wherein the register allocation circuitry of each of the plurality of processing units is to allocate the respective register partition of the registers for each assigned thread of its assigned plurality of said threads that are to be concurrently executed. 6. The processor of claim 1 wherein said processor does not include circuitry for any of said threads to issue instructions in parallel for any one of said threads. 7. The processor of claim 1 wherein each of the plurality of processing units further comprise register allocation circuitry to allocate a register partition of less than all of the registers for each assigned thread. 8. A method performed by a processor comprising: fetching respective instructions of threads with a front end unit of the processor; decoding said instructions into respective decoded instructions and input operand and resultant addresses of said instructions with the front end unit of the processor; assigning a plurality of said threads to each of a plurality of processing units of a processing core of the processor, each processing unit coupled to said front end unit and having a respective buffer to receive and store decoded instructions of its assigned plurality of said threads; simultaneously executing each respective processing unit's received, decoded instructions for two or more of its assigned plurality of threads with a plurality of functional units of each respective processing unit, the plurality of functional units comprising at least one integer functional unit and at least one floating point functional unit; storing operands and resultants of said received, decoded instructions of its assigned plurality of said threads in registers coupled to said plurality of functional units; fetching input data operands with data fetch circuitry of each respective processing unit for said plurality of functional units' execution of said received, decoded instructions of its assigned plurality of said threads; and allocating a respective register partition of the registers, with register allocation circuitry of each processing unit, for each assigned thread of its assigned plurality of said threads. 9. The method of claim 8 further comprising, at each processing unit performing the following: allocating the respective register partition of the registers for each assigned thread of its assigned plurality of said threads that are to be concurrently executed. 10. The method of claim 8 wherein software assigns a first thread to a first of the plurality of processing units and a second thread to a second of the plurality of processing units. 11. The method of claim 10 wherein said first and second threads are not processed with any speculative execution logic circuitry. 12. The method of claim 10 wherein said first and second threads are not processed with any out-of-order execution logic circuitry. 13. The method of claim 10 wherein said first and second threads do not issue their respective instructions in parallel. 14. A processor comprising: at least two processing cores each having: a front end unit to fetch respective instructions of threads to be processed by its processing core and decode said instructions into respective decoded instructions and input operand and resultant addresses of said instructions; said front end unit coupled to a plurality of processing units of its processing core, each of said plurality of processing units to be assigned a plurality of said threads, each processing unit coupled to said front end unit and having a respective buffer to receive and store decoded instructions and each processing unit to receive input operand and resultant addresses of its assigned plurality of said threads from the front end unit, each of said plurality of processing units comprising: i) a plurality of functional units comprising at least one integer functional unit and at least one floating point functional unit, said plurality of functional units to simultaneously execute its respective processing unit's received, decoded instructions for two or more of its assigned plurality of said threads, ii) registers coupled to said plurality of functional units to store operands and resultants of said received, decoded instructions of its assigned plurality of said threads, iii) data fetch circuitry to fetch input operands for said plurality of functional units' execution of said received, decoded instructions of its assigned plurality of said threads, and iv) register allocation circuitry to allocate a respective register partition of the registers for each assigned thread of its assigned plurality of said threads; an interconnection network coupled to said plurality of processing units; and a cache coupled to said interconnection network. 15. The processor of claim 14 wherein said plurality of functional units are not coupled to any logic circuitry to perform out-of-order execution of said received, decoded instructions. 16. The processor of claim 15 wherein the register allocation circuitry of each of the plurality of processing units is to allocate the respective register partition of the registers for each assigned thread of its assigned plurality of said threads that are to be concurrently executed. 17. The processor of claim 14 wherein said plurality of functional units are not coupled to any logic circuitry to perform speculative execution of said received, decoded instr
according to context, e.g. thread buffers · CPC title
from multiple instruction streams, e.g. multistreaming · CPC title
Decoding for concurrent execution · CPC title
organised in groups of units sharing resources, e.g. clusters · CPC title
Instruction prefetching · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.