What technology area does this patent fall under?

Primary CPC classification G06F8/41. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Dec 01 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Device profiling in GPU accelerators by using host-device coordination

US10853044B2 · US · B2

Patent metadata
Field	Value
Publication number	US-10853044-B2
Application number	US-201816154560-A
Country	US
Kind code	B2
Filing date	Oct 8, 2018
Priority date	Oct 6, 2017
Publication date	Dec 1, 2020
Grant date	Dec 1, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

System and method of compiling a program having a mixture of host code and device code to enable Profile Guided Optimization (PGO) for device code execution. An exemplary integrated compiler can compile source code programmed to be executed by a host processor (e.g., CPU) and a co-processor (e.g., a GPU) concurrently. The compilation can generate an instrumented executable code which includes: profile instrumentation counters for the device functions; and instructions for the host processor to allocate and initialize device memory for the counters and to retrieve collected profile information from the device memory to generate instrumentation output. The output is fed back to the compiler for compiling the source code a second time to generate optimized executable code for the device functions defined in the source code.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: compiling a program a first time, wherein the program is to be performed by a co-processor and a host processor, and the compiling the program the first time generates instrumented executable code, the instrumented executable code being operable to cause the host processor to initialize one or more profile counters for updates to be made during a performance of the program; causing the performance of the program by the co-processor and the host processor after compiling the program the first time and storing profile information associated with the program resulting from the performance, wherein at least a portion of the profile information is based, at least in part, on the one or more profile counters that reflect the updates; and compiling the program a second time after storing the profile information, wherein the compiling the program the second time results in the program being executable by the co-processor and the host processor according to the profile information. 2. The method of claim 1 , wherein the host processor is a Central Processing Unit (CPU) and the co-processor is a Graphics Processing Unit (GPU). 3. The method of claim 1 , wherein the compiling the program the first time and the compiling the program the second time each comprise generating a representation of a Control Flow Graph (CFG) for the program and constructing a Minimum Spanning Tree (MST) of the Control Flow Graph (CFG). 4. The method of claim 3 , wherein the constructing the MST of the CFG is for a function of the co-processor; and the method further comprises instrumenting edges of the MST with profile counters of the one or more profile counters that are configured to increment in atomic operations when the co-processor executes the instrumented executable code. 5. The method of claim 1 , wherein the instrumented executable code is further operable when executed by the host processor to: cause the host processor to allocate a co-processor memory for the one or more profile counters. 6. The method of claim 5 , wherein the one or more profile counters are associated with functions of a kernel, wherein the instrumented executable code is further operable to cause, after the host processor initializes the one or more profile counters, the host processor to invoke the kernel for execution by the co-processor. 7. The method of claim 6 , wherein the instrumented executable code is operable to cause the host processor to copy the one or more profile counters from the co-processor memory to a host processor memory after execution completion of the kernel. 8. The method of claim 1 , wherein the instrumented executable code is operable to cause the host processor to call a library to write the profile information into a file. 9. The method of claim 1 , wherein the compiling the program the first time comprises performing a set of separate compilations for multiple portions of source code of the program, wherein the performing a separate compilation comprises: inserting instrumentation code for a portion of the source code in a separate compilation; and generating an initialized constant variable for the separate compilation, wherein the initialized constant variable comprises a partial function call list associated with the separate compilation. 10. The method of claim 9 , the compiling the program the first time further comprises linking the instrumented code resulting from the set of separate compilations to generate the instrumented executable code, and wherein the linking comprises: generating a combined call list from partial function call lists; and generating a representation of a combined Call Graph comprising partial call graphs associated with the multiple portions of the source code respectively. 11. The method of claim 9 , wherein the performing the separate compilation further comprises: sending instrumentation information for the portion from a co-processor compiler to a host-processor compiler; and declaring mirrors for counters at the host-processor compiler. 12. The method of claim 4 , wherein the compiling the program the second time comprises: setting values of profile counters for the edges in the MST; populating profile counters of edges and basic blocks of the function using instrumented counts; and during the compiling the program the second time, querying the profile information to obtain counts for the edges and the basic blocks of the function. 13. A system comprising: at least one processor; and at least one memory coupled to the at least one processor and storing instructions that, when executed by the at least one processor, cause the system to perform a method comprising: compiling a program a first time, wherein the program is to be performed by a co-processor and a host processor, and the compiling the program the first time generates instrumented executable code, the instrumented executable code being operable to cause the host processor to as part of a performance of the program: initialize one or more profile counters corresponding to a kernel for updates to be made during the performance of the program; and invoke the kernel for execution by the co-processor after initializing the one or more profile counters; causing the performance of the program by the co-processor after compiling the program the first time and storing profile information associated with the program resulting from the performance, wherein at least a portion of the profile information is based, at least in part, on the one or more profile counters that reflect the updates; and compiling the program a second time after storing the profile information, wherein the compiling the program the second time results in the program being executable by the co-processor and the host processor according to the profile information. 14. The system of claim 13 , wherein the compiling the program the first time and the compiling the program the second time each comprise generating a representation of a Control Flow Graph (CFG) for the source code and generating a Minimum Spanning Tree (MST) of the CFG for a function of the co-processor, the generating the MST including instrumenting edges of the MST with profile counters that are configured to increment in atomic operations. 15. The system of claim 13 , wherein the instrumented executable code is operable when executed by the host processor to cause the host processor to allocate a co-processor memory for the one or more profile counters, and wherein the instrumented executable code when executed by the co-processor is operable to cause the co-processor to update one or more the profile counters during execution of the kernel. 16. The system of claim 15 , wherein the instrumented executable code is operable to cause the host processor to copy the profile counters from said the co-processor memory to a host processor memory after execution completion of the kernel. 17. The system of claim 13 , wherein the compiling the program the first time comprises performing a set of separate compilations for multiple portions of source code of the program, wherein performing a separate compilation comprises: inserting instrumentation code for a portion of the source code in a separate compilation; and generating an initialized constant variable for the separate compilation, wherein the initialized constant variable comprises a partial function call list associated with the separate compilation. 18. The system of claim 17 , wherein the compiling the program the first time further comprises linking

Assignees

Nvidia Corp

Inventors

Classifications

G06F8/41Primary
Compilation · CPC title
G06F8/443
Optimisation · CPC title
G06F16/9024
Graphs; Linked lists (G06F16/9027 takes precedence) · CPC title
G06F11/3624
by performing operations on the source code, e.g. via a compiler · CPC title
G06F8/458
Synchronisation, e.g. post-wait, barriers, locks (synchronisation among tasks G06F9/52) · CPC title

Patent family

Related publications grouped by family.

View patent family 65993196

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10853044B2 cover?: System and method of compiling a program having a mixture of host code and device code to enable Profile Guided Optimization (PGO) for device code execution. An exemplary integrated compiler can compile source code programmed to be executed by a host processor (e.g., CPU) and a co-processor (e.g., a GPU) concurrently. The compilation can generate an instrumented executable code which includes: …
Who is the assignee on this patent?: Nvidia Corp
What technology area does this patent fall under?: Primary CPC classification G06F8/41. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Dec 01 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Automated software program repair

Framework for user-directed profile-driven optimizations

Multiphased profile guided optimization

Profile guided optimization in the presence of stale profile data

Collecting profile data for modified global variables

Automated adaptive compiler optimization

Efficient implementation of RSA using GPU/CPU architecture

Frequently asked questions