User-space emulation framework for heterogeneous soc design
US-2024004776-A1 · Jan 4, 2024 · US
US9535815B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-9535815-B2 |
| Application number | US-201414296311-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jun 4, 2014 |
| Priority date | Jun 4, 2014 |
| Publication date | Jan 3, 2017 |
| Grant date | Jan 3, 2017 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A system, method, and computer program product are provided for collecting trace information based on a computational workload. The method includes the steps of compiling source code to generate a program, launching a workload to be executed by the parallel processing unit, collecting one or more records of trace information associated with a plurality of threads configured to execute the program, and correlating the one or more records to one or more corresponding instructions included in the source code. Each record in the one or more records includes at least a value of a program counter and a scheduler state of the thread.
Opening claim text (preview).
What is claimed is: 1. A method comprising: compiling source code to generate a program; launching a workload to be executed by a parallel processing unit, wherein the workload includes one or more tasks to be executed by the parallel processing unit, and at least one task of the one or more tasks executes a thread block configured to execute the program; collecting one or more records of trace information associated with a plurality of threads configured to execute the program; and correlating the one or more records to one or more corresponding instructions included in the source code, wherein each record in the one or more records includes a value of a program counter, a thread block identifier, and a scheduler state that comprises a stall vector having at least two bits, each bit in the at least two bits representing a different reason for a thread block to be stalled. 2. The method of claim 1 , wherein each record is associated with a thread block comprising a plurality of related threads in a single-instruction, multiple-thread (SIMT) architecture. 3. The method of claim 1 , further comprising allocating an event buffer in a memory to store the one or more records. 4. The method of claim 1 , further comprising enabling a replay mechanism prior to launching the workload. 5. The method of claim 4 , further comprising: determining that the workload should be executed one or more additional times to generate additional records; and replaying an Application Programming Interface (API) stream captured by the replay mechanism in order to re-launch the workload on the parallel processing unit. 6. The method of claim 1 , further comprising generating a table that associates each instruction in the program with a corresponding instruction in the source code. 7. The method of claim 6 , wherein correlating the one or more records to the one or more corresponding instructions included in the source code comprises: mapping the value in the record to a corresponding instruction in the program; and looking up the corresponding instruction in the table to determine an associated instruction in the source code. 8. The method of claim 7 , wherein mapping the value in the record to the corresponding instruction in the program comprises determining an offset between the value and a base address of a location where the program is stored in a memory. 9. The method of claim 1 , wherein the one or more records are generated by a trace cell coupled to a scheduler unit configured to maintain thread state information for a plurality of thread blocks. 10. The method of claim 9 , wherein the trace cell comprises a buffer configured to temporarily store one or more records and logic for collecting trace information from the scheduler unit. 11. The method of claim 9 , wherein the trace cell may be programmed to sample trace information at variable frequencies. 12. The method of claim 9 , wherein the trace cell may be programmed to sample trace information based on one or more events. 13. The method of claim 12 , wherein the one or more events comprise at least one of a cache miss, a function call, and execution of a branch instruction. 14. A non-transitory computer-readable storage medium storing instructions that, when executed by a processor, cause the processor to perform steps comprising: compiling source code to generate a program; launching a workload to be executed by the parallel processing unit, wherein the workload includes one or more tasks to be executed by the parallel processing unit, and at least one task of the one or more tasks executes a thread block configured to execute the program; collecting one or more records of trace information associated with a plurality of threads configured to execute the program; and correlating the one or more records to one or more corresponding instructions included in the source code, wherein each record in the one or more records includes a value of a program counter, a thread block identifier, and a scheduler state that comprises a stall vector having at least two bits, each bit in the at least two bits representing a different reason for a thread block to be stalled. 15. The non-transitory computer-readable storage medium of claim 14 , wherein each record is associated with a thread block comprising a plurality of related threads in a single-instruction, multiple-thread (SIMT) architecture. 16. A system comprising: a hardware parallel processing unit; a scheduler unit configured to manage execution of a plurality of thread blocks; and a trace cell configured to generate one or more records of trace information associated with the plurality of thread blocks, wherein each record in the one or more records includes a value of a program counter, a thread block identifier, and a scheduler state that comprises a stall vector having at least two bits, each bit in the at least two bits representing a different reason for a thread block to be stalled. 17. The system of claim 16 , the system further comprising a host processor configured to execute a development platform configured to: compile a source code to generate a program; transmit the program to the parallel processing unit; launch a workload to be executed by the parallel processing unit, wherein the workload includes one or more tasks to be executed by the parallel processing unit, and at least one task of the one or more tasks executes a thread block configured to execute the program; collect the one or more records; and correlate the one or more records to one or more corresponding instructions included in the source code. 18. The system of claim 17 , wherein each record is associated with a thread block comprising a plurality of related threads in a single-instruction, multiple-thread (SIMT) architecture. 19. The system of claim 17 , further comprising a driver configured to generate a table that associates each instruction in the program with a corresponding instruction in the source code.
by tracing the execution of the program · CPC title
Saving or restoring of program or task context · CPC title
the resource being the memory · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.