Super single instruction multiple data (super-simd) for graphics processing unit (gpu) computing
US-2018121386-A1 · May 3, 2018 · US
US10699366B1 · US · B1
| Field | Value |
|---|---|
| Publication number | US-10699366-B1 |
| Application number | US-201816057794-A |
| Country | US |
| Kind code | B1 |
| Filing date | Aug 7, 2018 |
| Priority date | Aug 7, 2018 |
| Publication date | Jun 30, 2020 |
| Grant date | Jun 30, 2020 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Techniques are disclosed relating to sharing an arithmetic logic unit (ALU) between multiple threads. In some embodiments, the threads also have dedicated ALUs for other types of operations. In some embodiments, arbitration circuitry is configured to receive operations to be performed by the shared arithmetic logic unit from the set of threads and issue the received operations to the shared arithmetic logic unit. In some embodiments, the arbitration circuitry is configured to switch to a different one of the set of threads for each instruction issued to the shared arithmetic logic unit. In some embodiments, the shared ALU is configured to perform 32-bit operations and the dedicated ALUs are configured to perform the same operations using 16-bit precision. In some embodiments, the shared ALU is shared between two threads and is physically located adjacent to other datapath circuitry for the two threads.
Opening claim text (preview).
What is claimed is: 1. An apparatus, comprising: a shared arithmetic logic unit that is shared for performing operations specified by a set of multiple different threads, wherein the shared arithmetic logic unit is configured to forward a result from a first operation from a given thread for use as an input for a dependent operation from the given thread, wherein the forwarded result is available to the dependent operation after a delay of one or more cycles subsequent to completion of the first operation; a set of arithmetic logic units each configured to perform only operations specified by a thread of the set of threads that is currently assigned to the arithmetic logic unit and not operations from other threads, wherein the arithmetic logic unit is configured to accept an operation to be performed from the assigned thread each clock cycle; and arbitration circuitry configured to: receive operations to be performed by the shared arithmetic logic unit from the set of threads; and issue the received operations to the shared arithmetic logic unit, including, in one or more modes of operation, switching to a different one of the set of threads for each instruction issued to the shared arithmetic logic unit. 2. The apparatus of claim 1 , wherein the apparatus is configured to perform a smaller number of operations per thread per clock cycle for operations performed by the shared arithmetic logic unit than for operations performed by the set of arithmetic logic units. 3. The apparatus of claim 1 , wherein the shared arithmetic logic unit is configured to forward a result from an operation that uses the shared arithmetic logic unit to an immediately subsequent dependent operation from the same thread without stalling. 4. The apparatus of claim 1 , wherein datapath processing elements corresponding to the multiple different threads are physically located adjacent to the shared arithmetic logic unit. 5. The apparatus of claim 4 , wherein the datapath processing elements include the set of arithmetic logic units and dedicated operand caches for ones of the multiple different threads. 6. The apparatus of claim 1 , wherein two threads share the shared arithmetic logic unit and the arbitration circuitry is configured to switch between the two threads each time it issues an operation to the shared arithmetic logic unit. 7. The apparatus of claim 1 , wherein the set of threads also shares another arithmetic logic unit that is configured to perform a different set of operations than the shared arithmetic logic unit. 8. The apparatus of claim 1 , wherein the apparatus is a graphics processing unit (GPU) and wherein the set of multiple different threads execute at least a portion of the same instructions using different data and are controlled using one or more shared control signals. 9. The apparatus of claim 1 , wherein the set of arithmetic logic units are configured to perform a set of operations with input operands of a first precision and the shared arithmetic logic unit is configured to perform one or more of the same set of operations with input operands of a second, greater precision. 10. A method, comprising: performing, by a computing device, operations for different ones of a set of multiple different threads using respective arithmetic logic units that are each configured to perform only operations for a thread that is currently assigned to the arithmetic logic unit and not operations from other threads; receiving, by arbitration circuitry of the computing device, operations from the set of threads to be performed by a shared arithmetic logic unit, wherein the shared arithmetic logic unit forwards a result from a first operation from a given thread for use as an input for a dependent operation from the given thread, wherein the forwarded result is available to the dependent operation after a delay of one or more cycles subsequent to completion of the first operation; issuing, by the arbitration circuitry, the received operations to the shared arithmetic logic unit, including switching to a different one of the set of threads for each instruction issued to the shared arithmetic logic unit; and performing, by the shared arithmetic logic unit, operations issued by the arbitration circuitry. 11. The method of claim 10 , further comprising: forwarding, by the shared arithmetic logic unit, a result from an operation that uses the shared arithmetic logic unit to an immediately subsequent dependent operation from the same thread without stalling. 12. The method of claim 10 , wherein the set of threads consists of two threads and the arbitration circuitry switches between the two threads each time it issues an operation to the shared arithmetic logic unit. 13. The method of claim 10 , further comprising: receiving, by arbitration circuitry of the computing device, operations from the set of threads to be performed by second shared arithmetic logic unit that is configured to perform a different set of operations than the shared arithmetic logic unit; and issuing, by the arbitration circuitry, the received operations to the second shared arithmetic logic unit, including switching to a different one of the set of threads for each instruction issued to the second shared arithmetic logic unit. 14. The method of claim 10 , wherein the set of multiple different threads execute at least a portion of the same instructions using different data and are controlled using one or more shared control signals. 15. A non-transitory computer readable storage medium having stored thereon design information that specifies a design of at least a portion of a hardware integrated circuit in a format recognized by a semiconductor fabrication system that is configured to use the design information to produce the circuit according to the design, including: a shared arithmetic logic unit that is shared for performing operations specified by a set of multiple different threads, wherein the shared arithmetic logic unit is configured to forward a result from a first operation from a given thread for use as an input for a dependent operation from the given thread, wherein the forwarded result is available to the dependent operation after a delay of one or more cycles subsequent to completion of the first operation; a set of arithmetic logic units each configured to perform only operations specified by a thread of the set of threads that is currently assigned to the arithmetic logic unit and not operations from other threads, wherein the arithmetic logic unit is configured to accept an operation to be performed from the assigned thread each clock cycle; and arbitration circuitry configured to: receive operations to be performed by the shared arithmetic logic unit from the set of threads; and issue the received operations to the shared arithmetic logic unit, including, in one or more modes of operation, switching to a different one of the set of threads for each instruction issued to the shared arithmetic logic unit. 16. The non-transitory computer readable storage medium of claim 15 , wherein the circuit is configured to perform a smaller number of operations per thread per clock cycle for operations performed by the shared arithmetic logic unit than for operations performed by the set of arithmetic logic units. 17. The non-transitory computer readable storage medium of claim 15 , wherein the shared arithmetic logic unit is configured to forward a result from an operation that uses the shared arithmetic logic unit to an immediately subsequent dependent operation from the same thread without stalling.
controlled by a single instruction for multiple data lanes [SIMD] · CPC title
controlled by a single instruction for multiple threads [SIMT] in parallel · CPC title
from multiple instruction streams, e.g. multistreaming · CPC title
Bypassing or forwarding of data results, e.g. locally between pipeline stages or within a pipeline stage · CPC title
General purpose rendering architectures · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.