Temporal SIMT execution optimization through elimination of redundant operations

US9830156B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9830156-B2
Application numberUS-201113209189-A
CountryUS
Kind codeB2
Filing dateAug 12, 2011
Priority dateAug 12, 2011
Publication dateNov 28, 2017
Grant dateNov 28, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

One embodiment of the present invention sets forth a technique for optimizing parallel thread execution in a temporal single-instruction multiple thread (SIMT) architecture. When the threads in a parallel thread group execute temporally on a common processing pipeline rather than spatially on parallel processing pipelines, execution cycles may be reduced when some threads in the parallel thread group are inactive due to divergence. Similarly, an instruction can be dispatched for execution by only one thread in the parallel thread group when the threads in the parallel thread group are executing a scalar instruction. Reducing the number of threads that execute an instruction removes unnecessary or redundant operations for execution by the processing pipelines. Information about scalar operands and operations and divergence of the threads is used in the instruction dispatch logic to eliminate unnecessary or redundant activity in the processing pipelines.

First claim

Opening claim text (preview).

The invention claimed is: 1. A method of executing an instruction for a thread group, the method comprising: receiving, by a single-instruction multiple-thread (SIMT) processor, the instruction for execution by the thread group comprising a plurality of threads, wherein the instruction includes one or more flags indicating that the instruction includes at least one of a scalar opcode and a scalar operand; evaluating the one or more flags included in the instruction to identify the instruction as a scalar instruction; and in response to identifying the instruction as a scalar instruction, dispatching, by the SIMT processor, the scalar instruction for execution by a portion of the threads in the thread group, wherein the portion of threads comprises at least one but not all threads in the thread group. 2. The method of claim 1 , wherein the evaluating includes identification of a source operand as a scalar operand. 3. The method of claim 1 , wherein the evaluating comprises identifying the instruction as a scalar instruction based on when an opcode included in the instruction is a scalar opcode. 4. The method of claim 1 , wherein source operands included in the instruction are scalar operands. 5. The method of claim 1 , wherein the evaluating comprises identifying the instruction as a scalar instruction based on operands included in the instruction. 6. The method of claim 1 , wherein the evaluating comprises identifying a source operand included in the instruction as a scalar operand that is read from one source operand register for all of the threads in the thread group. 7. The method of claim 1 , further comprising reading a source operand included in the instruction from a source operand register only for a first thread in the thread group that is active when a first flag included in the one or more flags indicates that the source operand is a scalar operand. 8. The method of claim 1 , wherein the portion of the threads in the thread group includes only threads in the thread group that are active based on divergence information. 9. The method of claim 1 , wherein the portion of threads comprises a single thread in the thread group. 10. The method of claim 1 , wherein: due to divergence, at least one thread in the thread group is inactive and at least one thread in the thread group is active; and the portion of threads comprises a single active thread in the thread group. 11. The method of claim 1 , wherein the evaluating comprises identifying the instruction as a scalar instruction based on at least one of a first determination that an operand included in the instruction is a scalar operand and a second determination that an identifier included in the instruction indicates that the instruction is scalar. 12. The method of claim 1 , further comprising: storing in one or more registers a result of the execution by the portion of the threads in the thread group; and accessing the result stored in the one or more registers by a second portion of the threads in the thread group, wherein the second portion of the threads in the thread group is not included in the first portion of the threads in the thread group. 13. The method of claim 1 , wherein: the divergence information for the thread group indicates at least one active thread and at least one inactive thread in the thread group; and the at least one active thread from the thread group is selected for executing the scalar instruction. 14. A non-transitory computer-readable storage medium storing instructions that, when executed by a processor, cause the processor to execute an instruction for a thread group, by performing the steps of: receiving the instruction for execution by the thread group comprising a plurality of threads, wherein the instruction includes one or more flags indicating that the instruction includes at least one of a scalar opcode and a scalar operand; evaluating the one or more flags included in the instruction to identify the instruction as a scalar instruction; and in response to identifying the instruction as a scalar instruction, dispatching the scalar instruction for execution by a portion of the threads in the thread group, wherein the portion of threads comprises at least one but not all threads in the thread group. 15. The non-transitory computer-readable storage medium of claim 14 , wherein evaluating comprises identifying the instruction as a scalar instruction when a first flag included in the one or more flags indicates that an opcode included in the instruction is a scalar opcode. 16. The non-transitory computer-readable storage medium of claim 14 , wherein evaluating comprises identifying the instruction as a scalar instruction when the one or more flags indicate that all destination operands included in the instruction are scalar operands. 17. A system for executing instructions, the system comprising: a memory that is configured to store instructions for execution by threads; and a single-instruction multiple-thread (SIMT) processor that is configured to: receive an instruction for execution by a thread group comprising a plurality of threads, wherein the instruction includes one or more flags indicating that the instruction includes at least one of a scalar opcode and a scalar operand; evaluate the one or more flags included in the instruction to identify the instruction as a scalar instruction; and in response to identifying the instruction as a scalar instruction, dispatch the scalar instruction for execution by a portion of the threads in the thread group, wherein the portion of threads comprises at least one but not all threads in the thread group. 18. The system of claim 17 , wherein the SIMT processor is further configured to identify a source operand as a scalar operand. 19. The system of claim 17 , wherein the SIMT processor is further configured to identify the instruction as a scalar instruction when an opcode included in the instruction is a scalar opcode. 20. The system of claim 17 , wherein all source operands included in the instruction are scalar operands. 21. The system of claim 17 , wherein the SIMT processor is further configured to identify the instruction as a scalar instruction based on operands included in the instruction. 22. The system of claim 17 , wherein the SIMT processor is further configured to identify a source operand included in the instruction as a scalar operand that is read from one source operand register for all of the threads in the thread group. 23. The system of claim 17 , wherein the SIMT processor is further configured to read a source operand included in the instruction from a source operand register only for a first thread in the thread group that is active when a first flag included in the one or more flags indicates that the source operand is a scalar operand. 24. The system of claim 17 , wherein the SIMT processor is further configured to write a destination operand register only once when a first flag included in the one or more flags indicates that the destination operand is a scalar operand. 25. The system of claim 17 , wherein the portion of the threads in the thread group includes only threads in the thread group that are active based on divergence information. 26. A method of executing an instruction across a thread group comprising a plurality of threads, the method comprising: receiving, by a processor, the instruction for execution across

Assignees

Inventors

Classifications

  • Decoding the operand specifier, e.g. specifier format · CPC title

  • G06F9/3851Primary

    from multiple instruction streams, e.g. multistreaming · CPC title

  • G06F9/3836Primary

    Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution · CPC title

  • controlled by a single instruction for multiple data lanes [SIMD] · CPC title

  • controlled by a single instruction for multiple threads [SIMT] in parallel · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9830156B2 cover?
One embodiment of the present invention sets forth a technique for optimizing parallel thread execution in a temporal single-instruction multiple thread (SIMT) architecture. When the threads in a parallel thread group execute temporally on a common processing pipeline rather than spatially on parallel processing pipelines, execution cycles may be reduced when some threads in the parallel thread…
Who is the assignee on this patent?
Krashinsky Ronny M, Nvidia Corp
What technology area does this patent fall under?
Primary CPC classification G06F9/3851. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Nov 28 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).