Local memory sharing between kernels
US-2020293367-A1 · Sep 17, 2020 · US
US11288765B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11288765-B2 |
| Application number | US-202016861049-A |
| Country | US |
| Kind code | B2 |
| Filing date | Apr 28, 2020 |
| Priority date | Apr 28, 2020 |
| Publication date | Mar 29, 2022 |
| Grant date | Mar 29, 2022 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods for graphics processing are provided. One example method includes executing a plurality of kernels using a plurality of graphics processing units (GPUs), wherein responsibility for executing a corresponding kernel is divided into one or more portions each of which being assigned to a corresponding GPU. The method includes generating a plurality of dependency data at a first kernel as each of a first plurality of portions of the first kernel completes processing. The method includes checking dependency data from one or more portions of the first kernel prior to execution of a portion of a second kernel. The method includes delaying execution of the portion of the second kernel as long as the corresponding dependency data of the first kernel has not been met.
Opening claim text (preview).
What is claimed is: 1. A method for graphics processing, comprising: executing a plurality of kernels using a plurality of graphics processing units (GPUs), wherein responsibility for executing a corresponding kernel of the plurality of kernels is divided between one or more portions of the corresponding kernel each of which being assigned to a corresponding GPU of the plurality of GPUs; generating a plurality of dependency data at a first kernel as each of a first plurality of portions of the first kernel completes processing; checking first dependency data from one or more portions of the first kernel prior to execution of a portion of a second kernel; and delaying the execution of the portion of the second kernel as long as the first dependency data from the one or more portions of the first kernel has not been met, wherein the first dependency data from the one or more portions of the first kernel indicates whether the first kernel has finished executing the one or more portions of the first kernel, wherein the execution of the portion of the second kernel begins before the first plurality of portions of the first kernel has finished processing, wherein the one or more portions of the first kernel includes less portions than the first plurality of portions of the first kernel. 2. The method of claim 1 , wherein dependency data generated by a portion of the first kernel indicates completion of one or more writes to one or more regions of a resource. 3. The method of claim 2 , wherein a region corresponds to a subset of the resource, wherein the subset of the resource includes a tile of an image or a buffer range. 4. The method of claim 1 , wherein the first dependency data from the one or more portions of the first kernel indicates completion of writing to a region of a resource. 5. The method of claim 4 , wherein the first dependency data from the one or more portions of the first kernel is stored per portion, or wherein the first dependency data from the one or more portions of the first kernel is stored per region per portion. 6. The method of claim 1 , wherein each portion of the first plurality of portions of the first kernel corresponds to index ranges of an index space defined by one or more dimensions, wherein index ranges of the each portion of the first plurality of portions of the first kernel may entirely span the index space or may span a subset of the index space in each of the one or more dimensions utilized by the first kernel. 7. The method of claim 6 , wherein the first dependency data from the one or more portions of the first kernel is checked prior to the execution of the portion of the second kernel and is based on first index ranges for dimensions corresponding to the portion of the second kernel, the method including: checking second dependency data generated by a first portion of the first kernel defined by the first index ranges for the dimensions corresponding to the portion of the second kernel, or an offset thereof defining an offset index range, or checking third dependency data generated by multiple portions of the first kernel defined by second index ranges for dimensions that are, taken together, a superset of the first index ranges for the dimensions corresponding to the portion of the second kernel; or checking fourth dependency data generated by the one or more portions of the first kernel defined by third index ranges for dimensions derived from a function calculated using the first index ranges for the dimensions corresponding to the portion of the second kernel. 8. The method of claim 7 , wherein if the offset index range, the superset of the first index ranges for the dimensions corresponding to the portion of the second kernel, or the third index ranges for the dimensions derived from the function calculated using the first index ranges for the dimensions corresponding to the portion of the second kernel is outside of the index space, then: the first dependency data that is checked prior to the execution of the portion of the second kernel is ignored, or the first dependency data that is checked prior to the execution of the portion of the second kernel is checked for a second portion of the first kernel corresponding to an index range that is clamped so that the second portion of the first kernel corresponding to the index range that is clamped is inside of the index space; or the first dependency data that is checked prior to the execution of the portion of the second kernel is checked for a third portion of the first kernel corresponding to an index range that is wrapped in the index space. 9. The method of claim 1 , further comprising: executing a portion of the first kernel on a first GPU; and upon completion of execution of the portion of the first kernel by the first GPU, sending data generated by the portion of the first kernel to local memory of a second GPU. 10. The method of claim 1 , further comprising: executing a portion of the first kernel on a first GPU; and prior to the execution of the portion of the second kernel by a second GPU, fetching into local memory of the second GPU data generated by the portion of the first kernel. 11. The method of claim 1 , further comprising: fetching, via direct memory access (DMA), into local memory of a second GPU executing the portion of the second kernel, data generated by a portion of the first kernel executing on a first GPU and written to local memory of the first GPU. 12. The method of claim 11 , further comprising: accessing, at the second GPU prior to the completion of the DMA, the data generated by the portion of the first kernel executing on the first GPU directly from the local memory of the first GPU by normal read operations; or accessing, at the second GPU after the completion of the DMA, the data generated by the portion of the first kernel executing on the first GPU from the local memory of the second GPU. 13. The method of claim 1 , wherein the first dependency data from the one or more portions of the first kernel indicates completion of execution of a portion of the first kernel. 14. The method of claim 1 , wherein responsibility for executing each portion of the first plurality of portions of the first kernel is assigned to one and only one GPU, wherein the first plurality of portions of the first kernel is statically assigned to the plurality of GPUs. 15. The method of claim 1 , wherein responsibility for executing each portion of the first plurality of portions of the first kernel is assigned to one and only one GPU; and wherein the first plurality of portions of the first kernel is dynamically allocated to the plurality of GPUs as the first kernel is executed. 16. The method of claim 15 , wherein allocation of the first plurality of portions of the first kernel to the plurality of GPUs references one or more predefined orders each of which is different for each GPU. 17. The method of claim 16 , wherein a predefined order that is referenced is a space filling curve in dimensions of an index space of the first kernel. 18. The method of claim 15 , further comprising: prefetching, based on a predefined order of the second kernel at a second GPU, into local memory of the second GPU data generated by the first kernel executing on a first GPU. 19. The method of claim 1 , further comprising: wherein the plurality of GPUs share a common command buffer that may contain one or more kernel invocations, or one or more draw calls, or a combination of the one or more kernel in
considering the load · CPC title
Program synchronisation; Mutual exclusion, e.g. by means of semaphores · CPC title
Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues · CPC title
General purpose rendering architectures · CPC title
Memory management · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.