What technology area does this patent fall under?

Primary CPC classification G06T1/20. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Jun 30 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Techniques for ALU sharing between threads

US10699366B1 · US · B1

Patent metadata
Field	Value
Publication number	US-10699366-B1
Application number	US-201816057794-A
Country	US
Kind code	B1
Filing date	Aug 7, 2018
Priority date	Aug 7, 2018
Publication date	Jun 30, 2020
Grant date	Jun 30, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Techniques are disclosed relating to sharing an arithmetic logic unit (ALU) between multiple threads. In some embodiments, the threads also have dedicated ALUs for other types of operations. In some embodiments, arbitration circuitry is configured to receive operations to be performed by the shared arithmetic logic unit from the set of threads and issue the received operations to the shared arithmetic logic unit. In some embodiments, the arbitration circuitry is configured to switch to a different one of the set of threads for each instruction issued to the shared arithmetic logic unit. In some embodiments, the shared ALU is configured to perform 32-bit operations and the dedicated ALUs are configured to perform the same operations using 16-bit precision. In some embodiments, the shared ALU is shared between two threads and is physically located adjacent to other datapath circuitry for the two threads.

First claim

Opening claim text (preview).

What is claimed is: 1. An apparatus, comprising: a shared arithmetic logic unit that is shared for performing operations specified by a set of multiple different threads, wherein the shared arithmetic logic unit is configured to forward a result from a first operation from a given thread for use as an input for a dependent operation from the given thread, wherein the forwarded result is available to the dependent operation after a delay of one or more cycles subsequent to completion of the first operation; a set of arithmetic logic units each configured to perform only operations specified by a thread of the set of threads that is currently assigned to the arithmetic logic unit and not operations from other threads, wherein the arithmetic logic unit is configured to accept an operation to be performed from the assigned thread each clock cycle; and arbitration circuitry configured to: receive operations to be performed by the shared arithmetic logic unit from the set of threads; and issue the received operations to the shared arithmetic logic unit, including, in one or more modes of operation, switching to a different one of the set of threads for each instruction issued to the shared arithmetic logic unit. 2. The apparatus of claim 1 , wherein the apparatus is configured to perform a smaller number of operations per thread per clock cycle for operations performed by the shared arithmetic logic unit than for operations performed by the set of arithmetic logic units. 3. The apparatus of claim 1 , wherein the shared arithmetic logic unit is configured to forward a result from an operation that uses the shared arithmetic logic unit to an immediately subsequent dependent operation from the same thread without stalling. 4. The apparatus of claim 1 , wherein datapath processing elements corresponding to the multiple different threads are physically located adjacent to the shared arithmetic logic unit. 5. The apparatus of claim 4 , wherein the datapath processing elements include the set of arithmetic logic units and dedicated operand caches for ones of the multiple different threads. 6. The apparatus of claim 1 , wherein two threads share the shared arithmetic logic unit and the arbitration circuitry is configured to switch between the two threads each time it issues an operation to the shared arithmetic logic unit. 7. The apparatus of claim 1 , wherein the set of threads also shares another arithmetic logic unit that is configured to perform a different set of operations than the shared arithmetic logic unit. 8. The apparatus of claim 1 , wherein the apparatus is a graphics processing unit (GPU) and wherein the set of multiple different threads execute at least a portion of the same instructions using different data and are controlled using one or more shared control signals. 9. The apparatus of claim 1 , wherein the set of arithmetic logic units are configured to perform a set of operations with input operands of a first precision and the shared arithmetic logic unit is configured to perform one or more of the same set of operations with input operands of a second, greater precision. 10. A method, comprising: performing, by a computing device, operations for different ones of a set of multiple different threads using respective arithmetic logic units that are each configured to perform only operations for a thread that is currently assigned to the arithmetic logic unit and not operations from other threads; receiving, by arbitration circuitry of the computing device, operations from the set of threads to be performed by a shared arithmetic logic unit, wherein the shared arithmetic logic unit forwards a result from a first operation from a given thread for use as an input for a dependent operation from the given thread, wherein the forwarded result is available to the dependent operation after a delay of one or more cycles subsequent to completion of the first operation; issuing, by the arbitration circuitry, the received operations to the shared arithmetic logic unit, including switching to a different one of the set of threads for each instruction issued to the shared arithmetic logic unit; and performing, by the shared arithmetic logic unit, operations issued by the arbitration circuitry. 11. The method of claim 10 , further comprising: forwarding, by the shared arithmetic logic unit, a result from an operation that uses the shared arithmetic logic unit to an immediately subsequent dependent operation from the same thread without stalling. 12. The method of claim 10 , wherein the set of threads consists of two threads and the arbitration circuitry switches between the two threads each time it issues an operation to the shared arithmetic logic unit. 13. The method of claim 10 , further comprising: receiving, by arbitration circuitry of the computing device, operations from the set of threads to be performed by second shared arithmetic logic unit that is configured to perform a different set of operations than the shared arithmetic logic unit; and issuing, by the arbitration circuitry, the received operations to the second shared arithmetic logic unit, including switching to a different one of the set of threads for each instruction issued to the second shared arithmetic logic unit. 14. The method of claim 10 , wherein the set of multiple different threads execute at least a portion of the same instructions using different data and are controlled using one or more shared control signals. 15. A non-transitory computer readable storage medium having stored thereon design information that specifies a design of at least a portion of a hardware integrated circuit in a format recognized by a semiconductor fabrication system that is configured to use the design information to produce the circuit according to the design, including: a shared arithmetic logic unit that is shared for performing operations specified by a set of multiple different threads, wherein the shared arithmetic logic unit is configured to forward a result from a first operation from a given thread for use as an input for a dependent operation from the given thread, wherein the forwarded result is available to the dependent operation after a delay of one or more cycles subsequent to completion of the first operation; a set of arithmetic logic units each configured to perform only operations specified by a thread of the set of threads that is currently assigned to the arithmetic logic unit and not operations from other threads, wherein the arithmetic logic unit is configured to accept an operation to be performed from the assigned thread each clock cycle; and arbitration circuitry configured to: receive operations to be performed by the shared arithmetic logic unit from the set of threads; and issue the received operations to the shared arithmetic logic unit, including, in one or more modes of operation, switching to a different one of the set of threads for each instruction issued to the shared arithmetic logic unit. 16. The non-transitory computer readable storage medium of claim 15 , wherein the circuit is configured to perform a smaller number of operations per thread per clock cycle for operations performed by the shared arithmetic logic unit than for operations performed by the set of arithmetic logic units. 17. The non-transitory computer readable storage medium of claim 15 , wherein the shared arithmetic logic unit is configured to forward a result from an operation that uses the shared arithmetic logic unit to an immediately subsequent dependent operation from the same thread without stalling.

Assignees

Apple Inc

Inventors

Kenney Robert D

Classifications

G06F9/3887
controlled by a single instruction for multiple data lanes [SIMD] · CPC title
G06F9/3888
controlled by a single instruction for multiple threads [SIMT] in parallel · CPC title
G06F9/3851
from multiple instruction streams, e.g. multistreaming · CPC title
G06F9/3826
Bypassing or forwarding of data results, e.g. locally between pipeline stages or within a pipeline stage · CPC title
G06T15/005
General purpose rendering architectures · CPC title

Patent family

Related publications grouped by family.

View patent family 71125084

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10699366B1 cover?: Techniques are disclosed relating to sharing an arithmetic logic unit (ALU) between multiple threads. In some embodiments, the threads also have dedicated ALUs for other types of operations. In some embodiments, arbitration circuitry is configured to receive operations to be performed by the shared arithmetic logic unit from the set of threads and issue the received operations to the shared ari…
Who is the assignee on this patent?: Apple Inc
What technology area does this patent fall under?: Primary CPC classification G06T1/20. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Jun 30 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).