Uniform load processing for parallel thread sub-sets

US10007527B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10007527-B2
Application numberUS-201213412438-A
CountryUS
Kind codeB2
Filing dateMar 5, 2012
Priority dateMar 5, 2012
Publication dateJun 26, 2018
Grant dateJun 26, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

One embodiment of the present invention sets forth a technique for processing load instructions for parallel threads of a thread group when a sub-set of the parallel threads request the same memory address. The load/store unit determines if the memory addresses for each sub-set of parallel threads match based on one or more uniform patterns. When a match is achieved for at least one of the uniform patterns, the load/store unit transmits a read request to retrieve data for the sub-set of parallel threads. The number of read requests transmitted is reduced compared with performing a separate read request for each thread in the sub-set. A variety of uniform patterns may be defined based on common access patterns present in program instructions. A variety of uniform patterns may also be defined based on interconnect constraints between the load/store unit and the memory when a full crossbar interconnect is not available.

First claim

Opening claim text (preview).

The invention claimed is: 1. A method for retrieving from memory data associated with a load instruction, the method comprising: receiving a first load instruction for parallel execution by each thread in a thread group, wherein the first load instruction specifies an individual memory address for each respective thread in the thread group; identifying a parallel thread sub-set that includes only a portion of the threads in the thread group; for each thread included in the parallel thread sub-set, comparing an individual memory address specified in the first load instruction for the thread with at least one other individual memory address specified in the first load instruction for at least one other thread included in the parallel thread sub-set to generate a comparison result; determining that the comparison result indicates that the individual memory addresses of the parallel thread sub-set are distributed according to a uniform pattern; and upon determining that the comparison results indicate the uniform pattern, transmitting a read request to the memory to retrieve data stored at a first memory address, wherein the first memory address is specified in the first load instruction for at least the thread and the at least one other thread included in the parallel thread sub-set. 2. The method of claim 1 , wherein the comparison result indicates that the first load instruction specifies the first memory address for at least two threads within the parallel thread sub-set. 3. The method of claim 1 , wherein the comparison result indicates that the first load instruction specifies a second memory address for at least two threads within the parallel thread sub-set, and the read request specifies the first memory address and the second memory address. 4. The method of claim 1 , further comprising, prior to the identifying the parallel thread sub-set, determining that the first load instruction specifies a hint that the first load instruction may be processed as a uniform load instruction for the parallel thread sub-set and additional parallel thread sub-sets of the threads in the thread group. 5. The method of claim 1 , wherein the comparing comprises comparing individual memory addresses specified by the first load instruction for pairs of adjacent threads within the parallel thread sub-set with each other. 6. The method of claim 1 , wherein the comparing comprises comparing individual memory addresses specified by the first load instruction for pairs of threads offset by a predetermined number of threads within the parallel thread sub-set with each other. 7. The method of claim 1 , further comprising the steps of: receiving an active mask for the thread group that indicates threads in the thread group that should execute the first load instruction; and using the active mask to generate the comparison result. 8. The method of claim 1 , further comprising the steps of: receiving a second load instruction for parallel execution by each thread in a second thread group, wherein the second load instruction specifies an additional individual memory address for each respective thread in the second thread group; identifying a second parallel thread sub-set that includes only a portion of the threads in the second thread group; for each thread included in the second parallel thread sub-set, comparing an additional individual memory address specified in the second load instruction for the thread with at least one other additional individual memory address specified in the second load instruction for at least one other thread included in the second parallel thread sub-set to generate a second comparison result; determining that the comparison result indicates that the additional individual memory addresses of the second parallel thread sub-set are not distributed according to the uniform pattern; and transmitting additional read requests to the memory to retrieve data stored at each one of the additional individual memory addresses. 9. The method of claim 1 , further comprising the steps of: identifying a second parallel thread sub-set that includes the remaining threads in the thread group; for each thread included in the second parallel thread sub-set, comparing an individual memory address specified in the first load instruction for the thread with at least one other individual memory address specified in the first load instruction for at least one other thread included in the second parallel thread sub-set to generate a second comparison result; and determining that the second comparison result indicates that the individual memory addresses of the second parallel thread sub-set are distributed according to the uniform pattern, wherein the read request specifies the first memory address and a second memory address, wherein the second memory address is specified for at least one thread included in the second parallel thread sub-set in the first load instruction. 10. The method of claim 1 , further comprising the steps of: identifying a second parallel thread sub-set that includes the remaining threads in the thread group; for each thread included in the second parallel thread sub-set, comparing an individual memory address specified in the first load instruction for the thread with at least one other individual memory address specified in the first load instruction for at least one other thread included in the second parallel thread sub-set to generate a second comparison result and a third comparison result; determining that the second comparison result indicates that the individual memory addresses of the second parallel thread sub-set are not distributed according to the uniform pattern; and determining that the third comparison result indicates that the individual memory addresses of the second parallel thread sub-set are distributed according to a second uniform pattern, wherein the read request specifies the first memory address and a second memory address, wherein the second memory address is specified for at least one thread included in the second parallel thread sub-set in the first load instruction. 11. The method of claim 1 , further comprising the steps of: identifying a second parallel thread sub-set that includes the remaining threads in the thread group; for each thread included in the second parallel thread sub-set, comparing an individual memory address specified in the first load instruction for the thread with at least one other individual memory address specified in the first load instruction for at least one other thread included in the second parallel thread sub-set to generate a second comparison result; determining that the second comparison result indicates that the individual memory addresses of the second parallel thread sub-set not distributed according to the uniform pattern; and transmitting additional read requests to the memory to retrieve data stored at the individual memory addresses associated with the remaining threads. 12. A processing subsystem comprising: a uniform load unit that is configured to: receive a first load instruction for parallel execution by each thread in a thread group, wherein the first load instruction specifies an individual memory address for each respective thread in the thread group; identify a parallel thread sub-set that includes only a portion of the threads in the thread group; for each thread included in the parallel thread sub-set, compare an individual memory address specified in the first load instruction for the thread with at least one other individual memory address specified in the first load instruction for at least one other thread included in the parallel thread sub-set to generate a compa

Assignees

Inventors

Classifications

  • Operand prefetching (cache prefetching G06F12/0862) · CPC title

  • G06F9/3887Primary

    controlled by a single instruction for multiple data lanes [SIMD] · CPC title

  • G06F9/3851Primary

    from multiple instruction streams, e.g. multistreaming · CPC title

  • controlled by a single instruction for multiple threads [SIMT] in parallel · CPC title

  • Divergence aspects · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10007527B2 cover?
One embodiment of the present invention sets forth a technique for processing load instructions for parallel threads of a thread group when a sub-set of the parallel threads request the same memory address. The load/store unit determines if the memory addresses for each sub-set of parallel threads match based on one or more uniform patterns. When a match is achieved for at least one of the unif…
Who is the assignee on this patent?
Fetterman Michael, Carlton Stewart Glenn, Hahn Douglas J, and 4 more
What technology area does this patent fall under?
Primary CPC classification G06F9/3887. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jun 26 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).