What technology area does this patent fall under?

Primary CPC classification G06F8/45. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue May 09 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Technique for grouping instructions into independent strands

US9645802B2 · US · B2

Patent metadata
Field	Value
Publication number	US-9645802-B2
Application number	US-201313961097-A
Country	US
Kind code	B2
Filing date	Aug 7, 2013
Priority date	Aug 7, 2013
Publication date	May 9, 2017
Grant date	May 9, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A device compiler and linker is configured to group instructions into different strands for execution by different threads based on the dependence of those instructions on other, long-latency instructions. A thread may execute a strand that includes long-latency instructions, and then hardware resources previously allocated for the execution of that thread may be de-allocated from the thread and re-allocated to another thread. The other thread may then execute another strand while the long-latency instructions are in flight. With this approach, the other thread is not required to wait for the long-latency instructions to complete before acquiring hardware resources and initiating execution of the other strand, thereby eliminating at least a portion of the time that the other thread would otherwise spend waiting.

First claim

Opening claim text (preview).

The invention claimed is: 1. A computer-implemented method for compiling program code for execution on a processing unit, the method comprising: generating a weight value for each program instruction included in a basic block of the program code, wherein the weight value associated with a given program instruction reflects a number of long-latency program instructions upon which the given program instruction depends; grouping the program instructions included in the basic block into a first strand and a second strand based on the weight values generated for the program instructions, wherein the first strand includes a first set of program instructions, each program instruction included in the first set of program instructions has a first weight value, and the second strand includes a second set of program instructions, and each program instruction included in the second set of program instructions has a second weight value; causing a first thread to process the first strand; and causing a second thread to process the second strand. 2. The computer-implemented method of claim 1 , further comprising inserting a first yield instruction into the first strand, wherein the first thread is configured to process the first strand by: acquiring a hardware resource associated with a processing core included in the processing unit; executing the first set of program instructions on the processing core; and executing the first yield instruction on the processing core to release the hardware resource associated with the processing core. 3. The computer-implemented method of claim 2 , further comprising inserting a second yield instruction into the second strand, wherein the second thread is configured to process the second strand by: acquiring the hardware resource associated with the processing core in response to the first thread executing the first yield instruction; executing the second set of program instructions on the processing core; and executing the second yield instruction on the processing core to release the hardware resource associated with the processing core. 4. The computer-implemented method of claim 1 , wherein generating a weight value for each program instruction included in the basic block comprises: generating a graph of the program instructions included in the basic block that includes a different node for each program instruction in the basic block and a different edge for each dependency between at least two different program instructions; initializing to a value of zero a plurality of edge weights, wherein each edge weight corresponds to a different edge in the graph; setting to a value of one each edge weight corresponding to an edge that originates at a node associated with a long-latency program instruction; and for a given program instruction included in the basic block, accumulating one or more edge weights along a set of edges that couple the node associated with the given program instruction to a root of the graph. 5. The computer-implemented method of claim 4 , wherein grouping the program instructions in the basic block into the first strand comprises: determining that N or fewer long-latency program instructions should be included in a given strand based on an amount of available memory bandwidth associated with a processing core included in the processing unit, wherein N is a positive integer; removing from the graph a first set of nodes, wherein each node in the first set of nodes is associated with a long-latency program instruction having the first weight value; and inserting the long-latency program instructions associated with the first set of nodes into the first strand. 6. The computer-implemented method of claim 5 , wherein grouping the program instructions in the basic block into the second strand comprises: removing from the graph each edge coupled to a node included in the first set of nodes; generating updated weight values for program instructions associated with a set of remaining nodes in the graph; removing from the graph a second set of nodes, wherein each node included in the second set of nodes is associated with a long-latency program instruction having the second weight value; and inserting the long-latency program instructions associated with the second set of nodes into the second strand. 7. The computer-implemented method of claim 6 , wherein generating the updated weight value for a given program instruction comprises accumulating one or more edge weights along a set of remaining edges in the graph that couple a node included in the set of remaining nodes and associated with the given program instruction to the root of the graph. 8. The computer-implemented method of claim 5 , further comprising executing a portion of the program code on the processing core to determine the amount of available memory bandwidth associated with the processing core. 9. A non-transitory computer-readable medium storing program instructions that, when executed by a processing unit, cause the processing unit to compile program code for execution on a processing unit by performing the steps of: generating a weight value for each program instruction included in a basic block of the program code, wherein the weight value associated with a given program instruction reflects a number of long-latency program instructions upon which the given program instruction depends; grouping the program instructions included in the basic block into a first strand and a second strand based on the weight values generated for the program instructions, wherein the first strand includes a first set of program instructions, each program instruction included in the first set of program instructions has a first weight value, and the second strand includes a second set of program instructions, and each program instruction included in the second set of program instructions has a second weight value; causing a first thread to process the first strand; and causing a second thread to process the second strand. 10. The non-transitory computer-readable medium of claim 9 , further comprising the step of inserting a first yield instruction into the first strand, wherein the first thread is configured to process the first strand by: acquiring a hardware resource associated with a processing core included in the processing unit; executing the first set of program instructions on the processing core; and executing the first yield instruction on the processing core to release the hardware resource associated with the processing core. 11. The non-transitory computer-readable medium of claim 10 , further comprising the step of inserting a second yield instruction into the second strand, wherein the second thread is configured to process the second strand by: acquiring the hardware resource associated with the processing core in response to the first thread executing the first yield instruction; executing the second set of program instructions on the processing core; and executing the second yield instruction on the processing core to release the hardware resource associated with the processing core. 12. The non-transitory computer-readable medium of claim 9 , wherein the step of generating a weight value for each program instruction included in the basic block comprises: generating a graph of the program instructions included in the basic block that includes a different node for each program instruction in the basic block and a different edge for each dependency between at least two different program instructions; initializing to a value of zero a plurality of edge weights, wherein each edge weight corresponds to a different edge in the graph; settin

Assignees

Nvidia Corp

Inventors

Classifications

G06F8/45Primary
Exploiting coarse grain parallelism in compilation, i.e. parallelism between groups of instructions · CPC title
G06F9/3851
from multiple instruction streams, e.g. multistreaming · CPC title
G06F8/433
Dependency analysis; Data or control flow analysis · CPC title
G06F8/41
Compilation · CPC title
G06F9/3888
controlled by a single instruction for multiple threads [SIMT] in parallel · CPC title

Patent family

Related publications grouped by family.

View patent family 52449649

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9645802B2 cover?: A device compiler and linker is configured to group instructions into different strands for execution by different threads based on the dependence of those instructions on other, long-latency instructions. A thread may execute a strand that includes long-latency instructions, and then hardware resources previously allocated for the execution of that thread may be de-allocated from the thread an…
Who is the assignee on this patent?: Nvidia Corp
What technology area does this patent fall under?: Primary CPC classification G06F8/45. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue May 09 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).