Technique for computational nested parallelism

US9513975B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9513975-B2
Application numberUS-201213462649-A
CountryUS
Kind codeB2
Filing dateMay 2, 2012
Priority dateMay 2, 2012
Publication dateDec 6, 2016
Grant dateDec 6, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

One embodiment of the present invention sets forth a technique for performing nested kernel execution within a parallel processing subsystem. The technique involves enabling a parent thread to launch a nested child grid on the parallel processing subsystem, and enabling the parent thread to perform a thread synchronization barrier on the child grid for proper execution semantics between the parent thread and the child grid. This technique advantageously enables the parallel processing subsystem to perform a richer set of programming constructs, such as conditionally executed and nested operations and externally defined library functions without the additional complexity of CPU involvement.

First claim

Opening claim text (preview).

The invention claimed is: 1. A computer-implemented method for executing a child thread grid that is associated with a parent thread within a parallel processor, the method comprising: receiving a first launch request from the parent thread for executing the child thread grid, wherein the parent thread executes within a first streaming multiprocessor within the parallel processor; launching the child thread grid within a second streaming multiprocessor within the parallel processor independently of a central processing unit coupled to the parallel processor by performing a memory barrier operation to flush all pending write data from the parent thread to memory in order to ensure memory consistency between the parent thread and the child thread grid; receiving a thread synchronization barrier request from the parent thread, wherein the parent thread is configured to block a first programming instruction of the parent thread corresponding to the thread synchronization barrier request from executing; suspending execution of the parent thread; receiving a notification that the child thread grid has completed executing; and causing the parent thread to resume executing. 2. The method of claim 1 , wherein launching the child thread grid further comprises: transmitting the first launch request to a first task descriptor queue; and causing the child thread grid to begin executing within the second streaming multiprocessor based on the first launch request. 3. The method of claim 2 , wherein causing the child thread grid to begin executing comprises: selecting the first launch request from the first task descriptor queue; loading the child thread grid into the second streaming multiprocessor; and initiating execution of the child thread grid at a predetermined instruction within the child thread grid. 4. The method of claim 2 , wherein suspending execution of the parent thread comprises: saving execution state for the parent thread to a continuation buffer; and de-allocating computation resources in the first streaming multiprocessor associated with the parent thread. 5. The method of claim 4 , wherein causing the parent thread to resume executing comprises: transmitting a second launch request for executing the parent thread to the first task descriptor queue; loading the parent thread within a third streaming multiprocessor based on the second launch request; invalidating one or more caches associated with the third streaming multiprocessor; restoring execution state for the parent thread from the continuation buffer into the third streaming multiprocessor; causing the parent thread to continue executing, at a second programming instruction following to the first programming instruction. 6. The method of claim 5 , wherein transmitting the second launch request comprises transmitting a third launch request for executing a scheduler thread to a second task descriptor queue in response to the child thread grid completing; and causing the scheduler thread to execute, wherein the scheduler thread is configured to generate and transmit the second launch request. 7. The method of claim 5 , wherein restoring execution state for the parent thread comprises executing a restoration program associated with the parent thread that reads the continuation buffer and restores execution state for the parent thread within the third streaming multiprocessor. 8. A parallel processing subsystem configured to execute a child thread grid that is associated with a parent thread within a parallel processor, the parallel processing subsystem comprising: a memory system configured to store a plurality of task descriptor queues and a plurality of continuation buffers; and an execution subsystem coupled to the memory system and configured to perform nested operations by: receiving a first launch request from the parent thread for executing the child thread grid, wherein the parent thread executes within a first streaming multiprocessor within the parallel processor; launching the child thread grid within a second streaming multiprocessor within the parallel processor independently of a central processing unit coupled to the parallel processor by performing a memory barrier operation to flush all pending write data from the parent thread to memory in order to ensure memory consistency between the parent thread and the child thread grid; receiving a thread synchronization barrier request from the parent thread, wherein the parent thread is configured to block a first programming instruction of the parent thread corresponding to the thread synchronization barrier request from executing; suspending execution of the parent thread; receiving a notification that the child thread grid has completed executing; and causing the parent thread to resume executing. 9. The parallel processing subsystem of claim 8 , wherein to launch the child grid, the execution subsystem is further configured to: transmit the first launch request to a first task descriptor queue included in the plurality of task descriptor queues; and cause the child thread grid to begin executing within the second streaming multiprocessor based on the first launch request. 10. The parallel processing subsystem of claim 9 , wherein to cause the child thread grid to begin executing, the execution subsystem is further configured to: select the first launch request from the first task descriptor queue; load the child thread grid into the second streaming multiprocessor; and initiate executing the child thread grid at a predetermined instruction within the child thread grid. 11. The parallel processing subsystem of claim 9 , wherein to cause the child thread grid to begin executing, the execution subsystem is further configured to: save execution state for the parent thread to a continuation buffer residing within the plurality of continuation buffers; and de-allocate computation resources in the first streaming multiprocessor associated with the parent thread. 12. The parallel processing subsystem of claim 8 , wherein to cause the parent thread to resume executing, the execution subsystem is further configured to: transmit a second launch request for executing the parent thread to the first task descriptor queue; load the parent thread within a third streaming multiprocessor within the parallel processor based on the second launch request; invalidate one or more caches associated with the third streaming multiprocessor; restore execution state for the parent thread from the continuation buffer into the third streaming multiprocessor; cause the parent thread to continue executing at a second programming instruction subsequent to the first programming instruction. 13. The parallel processing subsystem of claim 12 , wherein to transmit the second launch request, the execution subsystem is further configured to: transmit a third launch request for executing a scheduler thread to a second task descriptor queue included in the plurality of task descriptor queues, in response to the child thread grid completing; and cause the scheduler thread to execute, wherein the scheduler thread is configured to generate and transmit the second launch request. 14. The parallel processing subsystem of claim 12 , wherein to restore execution state for the parent thread, the execution subsystem is further configured to execute a restoration program associated with the parent thread that reads the continuation buffer and restores execution state for the parent thread within the third streaming multiprocessor. 15. The parallel processing subsystem of claim 8 ,

Assignees

Inventors

Classifications

  • Multiproc · CPC title

  • Processor architectures; Processor configuration, e.g. pipelining · CPC title

  • G06F9/5027Primary

    the resource being a machine, e.g. CPUs, Servers, Terminals · CPC title

  • G06F9/522Primary

    Barrier synchronisation · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9513975B2 cover?
One embodiment of the present invention sets forth a technique for performing nested kernel execution within a parallel processing subsystem. The technique involves enabling a parent thread to launch a nested child grid on the parallel processing subsystem, and enabling the parent thread to perform a thread synchronization barrier on the child grid for proper execution semantics between the par…
Who is the assignee on this patent?
Jones Stephen, Cuadra Philip Alexander, Wexler Daniel Elliot, and 5 more
What technology area does this patent fall under?
Primary CPC classification G06F9/5027. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Dec 06 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).