Architecture and execution for efficient mixed precision computations in single instruction multiple data/thread (simd/t) devices
US-2015378741-A1 · Dec 31, 2015 · US
US10061591B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10061591-B2 |
| Application number | US-201514632651-A |
| Country | US |
| Kind code | B2 |
| Filing date | Feb 26, 2015 |
| Priority date | Jun 27, 2014 |
| Publication date | Aug 28, 2018 |
| Grant date | Aug 28, 2018 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method for reducing execution of redundant threads in a processing environment. The method includes detecting threads that include redundant work among many different threads. Multiple threads from the detected threads are grouped into one or more thread clusters based on determining same thread computation results. Execution of all but a particular one thread in each of the one or more thread clusters is suppressed. The particular one thread in each of the one or more thread clusters is executed. Results determined from execution of the particular one thread in each of the one or more thread clusters are broadcasted to other threads in each of the one or more thread clusters.
Opening claim text (preview).
What is claimed is: 1. A method for reducing execution of redundant threads in a processing environment, the method comprising: detecting threads that include a non-preemptive trace code and comprise an identical sequence of instructions among a plurality of different threads; grouping multiple threads from the detected threads into one or more thread clusters based on the non-preemptive trace code; suppressing execution of all but a particular one thread in each of the one or more thread clusters; executing the particular one thread in each of the one or more thread clusters; and broadcasting results determined from execution of the non-preemptive trace code included in the particular one thread in each of the one or more thread clusters to other threads in each of the one or more thread clusters, wherein the grouping further comprises: identifying any cluster intersections across different identical sequences of instructions with a scheduler; and re-grouping intersected thread clusters by concatenating cluster identification (ID) and updating a bit-vector of threads in the one or more thread clusters. 2. The method of claim 1 , wherein the detecting comprises analyzing dependencies of the plurality of different threads to determine if any identical sequences of instructions from different threads depend on values computed by other threads during compilation by a compiler. 3. The method of claim 1 , further comprising analyzing potential reduction in power consumption in the processing environment based on size of identical sequences of thread instructions without dependencies and redundancy computations and based on number of inputs for the plurality of different threads. 4. The method of claim 3 , wherein the non-preemptive trace code comprising the identical sequence of instructions has an identical set of input values. 5. The method of claim 4 , wherein the identifying identical sequences of instructions comprises: reading from one of input registers for each of the plurality of different threads or from output registers of a texture interpolator; and implementing a data structure with following pair-wise compares of input registers of each of the plurality of different threads for avoiding false positive results. 6. The method of claim 5 , wherein the detecting further comprises: mapping the input registers of each of the plurality of different threads that have identical values with the implemented data structure. 7. The method of claim 4 , wherein the grouping further comprises: grouping the multiple threads with the identical sequences of instructions and the identical set of input values that will compute exactly same results into the one or more thread clusters. 8. The method of claim 7 , wherein: the particular one thread in each of the one or more thread clusters is designated as a cluster leader thread, wherein each cluster leader thread has a minimum ID of the threads in a thread cluster; and the suppressing execution comprises clock-gating off the execution of all non-cluster leader threads in the one or more thread clusters. 9. The method of claim 8 , wherein the broadcasting of the results comprises broadcasting the results computed by each cluster leader thread to other threads in its thread cluster using an output register map from the compiler. 10. The method of claim 1 , wherein the processing environment comprises a single instruction multiple thread (SIMT) or single instruction multiple data (SIMD) processing architecture. 11. The method of claim 10 , wherein the processing environment is included in a graphics processing unit (GPU) of a mobile electronic device. 12. A non-transitory computer-readable medium having instructions which when executed on a computer perform a method comprising: detecting threads that include a non-preemptive trace code and comprise an identical sequence of instructions among a plurality of different threads; grouping multiple threads from the detected threads into one or more thread clusters based on the non-preemptive trace code; suppressing execution of all but a particular one thread in each of the one or more thread clusters; executing the particular one thread in each of the one or more thread clusters; and broadcasting results determined from execution of the non-preemptive trace code included in the particular one thread in each of the one or more thread clusters to other threads in each of the one or more thread clusters, wherein the grouping further comprises: identifying any cluster intersections across different identical sequences of instructions with a scheduler, and re-grouping intersected thread clusters by concatenating cluster identification (ID) and updating a bit-vector of threads in the one or more thread clusters. 13. The medium of claim 12 , wherein the detecting comprises analyzing dependencies of the plurality of different threads to determine if any identical sequences of instructions from different threads depend on values computed by other threads during compilation by a compiler. 14. The medium of claim 12 , further comprising analyzing potential reduction in power consumption in the processing environment based on size of identical sequences of thread instructions without dependencies and redundancy computations and based on number of inputs for the plurality of different threads, wherein the detecting threads that include the non-preemptive trace code comprises identifying the identical sequence of instructions from the plurality of different threads that has an identical set of input values. 15. The medium of claim 14 , wherein the identifying identical sequences of instructions comprises: reading from one of input registers for each of the plurality of different threads or from output registers of a texture interpolator; and implementing a data structure with following pair-wise compares of input registers of each of the plurality of different threads for avoiding false positive results. 16. The medium of claim 15 , wherein the detecting further comprises: mapping the input registers of each of the plurality of different threads that have identical values with the implemented data structure. 17. The medium of claim 14 , wherein the grouping further comprises: grouping the multiple threads with the identical sequences of instructions and the identical set of input values that will compute exactly same results into the one or more thread clusters. 18. The medium of claim 17 , wherein the particular one thread in each of the one or more thread clusters is designated as a cluster leader thread, wherein each cluster leader thread has a minimum ID of the threads in a thread cluster; and the suppressing execution comprises clock-gating off the execution of all non-cluster leader threads in the one or more thread clusters. 19. The medium of claim 18 , wherein the broadcasting of the results comprises broadcasting the results computed by each cluster leader thread to other threads in its thread cluster using an output register map from the compiler. 20. The medium of claim 12 , wherein the processing environment comprises a single instruction multiple thread (SIMT) or single instruction multiple data (SIMD) processing architecture. 21. The medium of claim 20 , wherein the processing environment is included in a graphics processing unit (GPU) of a mobile electronic device. 22. A graphics processor for an electronic device comprising: one or more processing element
controlled by a single instruction for multiple data lanes [SIMD] · CPC title
with global bypass, e.g. between pipelines, between clusters · CPC title
from multiple instruction streams, e.g. multistreaming · CPC title
Multiprogramming arrangements · CPC title
controlled by a single instruction for multiple threads [SIMT] in parallel · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.