Redundancy elimination in single instruction multiple data/thread (SIMD/T) execution processing

US10061591B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10061591-B2
Application numberUS-201514632651-A
CountryUS
Kind codeB2
Filing dateFeb 26, 2015
Priority dateJun 27, 2014
Publication dateAug 28, 2018
Grant dateAug 28, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method for reducing execution of redundant threads in a processing environment. The method includes detecting threads that include redundant work among many different threads. Multiple threads from the detected threads are grouped into one or more thread clusters based on determining same thread computation results. Execution of all but a particular one thread in each of the one or more thread clusters is suppressed. The particular one thread in each of the one or more thread clusters is executed. Results determined from execution of the particular one thread in each of the one or more thread clusters are broadcasted to other threads in each of the one or more thread clusters.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for reducing execution of redundant threads in a processing environment, the method comprising: detecting threads that include a non-preemptive trace code and comprise an identical sequence of instructions among a plurality of different threads; grouping multiple threads from the detected threads into one or more thread clusters based on the non-preemptive trace code; suppressing execution of all but a particular one thread in each of the one or more thread clusters; executing the particular one thread in each of the one or more thread clusters; and broadcasting results determined from execution of the non-preemptive trace code included in the particular one thread in each of the one or more thread clusters to other threads in each of the one or more thread clusters, wherein the grouping further comprises: identifying any cluster intersections across different identical sequences of instructions with a scheduler; and re-grouping intersected thread clusters by concatenating cluster identification (ID) and updating a bit-vector of threads in the one or more thread clusters. 2. The method of claim 1 , wherein the detecting comprises analyzing dependencies of the plurality of different threads to determine if any identical sequences of instructions from different threads depend on values computed by other threads during compilation by a compiler. 3. The method of claim 1 , further comprising analyzing potential reduction in power consumption in the processing environment based on size of identical sequences of thread instructions without dependencies and redundancy computations and based on number of inputs for the plurality of different threads. 4. The method of claim 3 , wherein the non-preemptive trace code comprising the identical sequence of instructions has an identical set of input values. 5. The method of claim 4 , wherein the identifying identical sequences of instructions comprises: reading from one of input registers for each of the plurality of different threads or from output registers of a texture interpolator; and implementing a data structure with following pair-wise compares of input registers of each of the plurality of different threads for avoiding false positive results. 6. The method of claim 5 , wherein the detecting further comprises: mapping the input registers of each of the plurality of different threads that have identical values with the implemented data structure. 7. The method of claim 4 , wherein the grouping further comprises: grouping the multiple threads with the identical sequences of instructions and the identical set of input values that will compute exactly same results into the one or more thread clusters. 8. The method of claim 7 , wherein: the particular one thread in each of the one or more thread clusters is designated as a cluster leader thread, wherein each cluster leader thread has a minimum ID of the threads in a thread cluster; and the suppressing execution comprises clock-gating off the execution of all non-cluster leader threads in the one or more thread clusters. 9. The method of claim 8 , wherein the broadcasting of the results comprises broadcasting the results computed by each cluster leader thread to other threads in its thread cluster using an output register map from the compiler. 10. The method of claim 1 , wherein the processing environment comprises a single instruction multiple thread (SIMT) or single instruction multiple data (SIMD) processing architecture. 11. The method of claim 10 , wherein the processing environment is included in a graphics processing unit (GPU) of a mobile electronic device. 12. A non-transitory computer-readable medium having instructions which when executed on a computer perform a method comprising: detecting threads that include a non-preemptive trace code and comprise an identical sequence of instructions among a plurality of different threads; grouping multiple threads from the detected threads into one or more thread clusters based on the non-preemptive trace code; suppressing execution of all but a particular one thread in each of the one or more thread clusters; executing the particular one thread in each of the one or more thread clusters; and broadcasting results determined from execution of the non-preemptive trace code included in the particular one thread in each of the one or more thread clusters to other threads in each of the one or more thread clusters, wherein the grouping further comprises: identifying any cluster intersections across different identical sequences of instructions with a scheduler, and re-grouping intersected thread clusters by concatenating cluster identification (ID) and updating a bit-vector of threads in the one or more thread clusters. 13. The medium of claim 12 , wherein the detecting comprises analyzing dependencies of the plurality of different threads to determine if any identical sequences of instructions from different threads depend on values computed by other threads during compilation by a compiler. 14. The medium of claim 12 , further comprising analyzing potential reduction in power consumption in the processing environment based on size of identical sequences of thread instructions without dependencies and redundancy computations and based on number of inputs for the plurality of different threads, wherein the detecting threads that include the non-preemptive trace code comprises identifying the identical sequence of instructions from the plurality of different threads that has an identical set of input values. 15. The medium of claim 14 , wherein the identifying identical sequences of instructions comprises: reading from one of input registers for each of the plurality of different threads or from output registers of a texture interpolator; and implementing a data structure with following pair-wise compares of input registers of each of the plurality of different threads for avoiding false positive results. 16. The medium of claim 15 , wherein the detecting further comprises: mapping the input registers of each of the plurality of different threads that have identical values with the implemented data structure. 17. The medium of claim 14 , wherein the grouping further comprises: grouping the multiple threads with the identical sequences of instructions and the identical set of input values that will compute exactly same results into the one or more thread clusters. 18. The medium of claim 17 , wherein the particular one thread in each of the one or more thread clusters is designated as a cluster leader thread, wherein each cluster leader thread has a minimum ID of the threads in a thread cluster; and the suppressing execution comprises clock-gating off the execution of all non-cluster leader threads in the one or more thread clusters. 19. The medium of claim 18 , wherein the broadcasting of the results comprises broadcasting the results computed by each cluster leader thread to other threads in its thread cluster using an output register map from the compiler. 20. The medium of claim 12 , wherein the processing environment comprises a single instruction multiple thread (SIMT) or single instruction multiple data (SIMD) processing architecture. 21. The medium of claim 20 , wherein the processing environment is included in a graphics processing unit (GPU) of a mobile electronic device. 22. A graphics processor for an electronic device comprising: one or more processing element

Assignees

Inventors

Classifications

  • G06F9/3887Primary

    controlled by a single instruction for multiple data lanes [SIMD] · CPC title

  • with global bypass, e.g. between pipelines, between clusters · CPC title

  • from multiple instruction streams, e.g. multistreaming · CPC title

  • Multiprogramming arrangements · CPC title

  • controlled by a single instruction for multiple threads [SIMT] in parallel · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10061591B2 cover?
A method for reducing execution of redundant threads in a processing environment. The method includes detecting threads that include redundant work among many different threads. Multiple threads from the detected threads are grouped into one or more thread clusters based on determining same thread computation results. Execution of all but a particular one thread in each of the one or more threa…
Who is the assignee on this patent?
Samsung Electronics Co Ltd
What technology area does this patent fall under?
Primary CPC classification G06F9/3887. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Aug 28 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).