Barrier synchronization with dynamic width calculation
US-2015052537-A1 · Feb 19, 2015 · US
US9916162B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-9916162-B2 |
| Application number | US-201414563601-A |
| Country | US |
| Kind code | B2 |
| Filing date | Dec 8, 2014 |
| Priority date | Dec 26, 2013 |
| Publication date | Mar 13, 2018 |
| Grant date | Mar 13, 2018 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods and systems may synchronize workloads across local thread groups. The methods and systems may provide for receiving, at a graphics processor, a workload from a host processor and receiving, at a plurality of processing elements, a plurality of threads that from one or more local thread groups. Additionally, the processing of the workload may be synchronized across the one or more thread groups. In one example, the global barrier determines that all threads across the thread groups have been completed without polling.
Opening claim text (preview).
We claim: 1. A system comprising: one or more transceivers; a host processor in communication with the one or more transceivers; a system memory associated with the host processor; a processor, in communication with the system memory, to receive a workload from the host processor, wherein the workload is partitioned into a plurality of kernels each containing a thread group, the processor including: a first plurality of processing elements to receive and process a first thread group having a plurality of threads, wherein each of the plurality of processing elements processes at least one of the plurality of threads, a second plurality of processing elements to receive and process a second thread group having a plurality of threads, wherein each of the plurality of processing elements processes at least one of the plurality of threads, and a global barrier in communication with the first plurality of processing elements and the second plurality of processing elements, the global barrier to enable the workload to be partitioned into the plurality of kernels and to synchronize the processing of the workload across the first thread group and the second thread group, wherein the plurality of kernels are to be processed concurrently and in parallel. 2. The system of claim 1 , wherein the first thread group and the second thread group form a global thread group. 3. The system of claim 1 , wherein the global barrier determines that all threads across the first thread group and the second thread group have been completed. 4. The system of claim 3 , wherein the determination is made without polling. 5. The system of claim 1 , wherein the processor is a graphics processor. 6. An apparatus comprising: a graphics processor to receive a workload from a host processor, wherein the workload is partitioned into a plurality of kernels each containing a thread group, the graphics processor including: a first plurality of processing elements to receive and process a first thread group having a plurality of threads, wherein each of the plurality of processing elements processes at least one of the plurality of threads, a second plurality of processing elements to receive and process a second thread group having a plurality of threads, wherein each of the plurality of processing elements processes at least one of the plurality of threads, and a global barrier in communication with the first plurality of processing elements and the second plurality of processing elements, the global barrier to enable the workload to be partitioned into the plurality of kernels and to synchronize the processing of the workload across the first thread group and the second thread group, wherein the plurality of kernels are to be processed concurrently and in parallel. 7. The apparatus of claim 6 , wherein the first thread group and the second thread group form a global thread group. 8. The apparatus of claim 6 , wherein the global barrier determines that all threads across the first thread group and the second thread group have been completed. 9. A method comprising: receiving, at a graphics processor, a workload from a host processor, wherein the workload is partitioned into a plurality of kernels each containing a thread group; receiving, at a first plurality of processing elements, a first thread group having plurality of threads, wherein each of the plurality of processing elements processes at least one of the plurality of threads, receiving, at a second plurality of processing elements, a second thread group having plurality of threads, wherein each of the plurality of processing elements processes at least one of the plurality of threads, synchronizing, at a global barrier in communication with the first plurality of processing elements and the second plurality of processing elements, the processing of the workload across the first thread group and the second thread group, wherein the global barrier enables the workload to be partitioned into the plurality of kernels and the plurality of kernels are to be processed concurrently and in parallel. 10. The method of claim 9 , wherein the first thread group and the second thread group form a global thread group. 11. The method of claim 9 , wherein the global barrier determines that all threads across the first thread group and the second thread group have been completed. 12. At least one non-transitory computer readable storage medium comprising a set of instructions which, if executed by a graphics processor, cause a computer to: receive, at a processor, a workload from a host processor, wherein the workload is partitioned into a plurality of kernels each containing a thread group; receive, at a first plurality of processing elements, a first thread group having a plurality of threads, wherein each of the plurality of processing elements processes at least one of the plurality of threads, receive, at a second plurality of processing elements, a second thread group having a plurality of threads, wherein each of the plurality of processing elements processes at least one of the plurality of threads, synchronize, at a global barrier in communication with the first plurality of processing elements and the second plurality of processing elements, the processing of the workload across the first thread group and the second thread group, wherein the global barrier enables the workload to be partitioned into the plurality of kernels and the plurality of kernels are to be processed concurrently and in parallel. 13. The at least one non-transitory computer readable storage medium of claim 12 , wherein the first thread group and the second thread group form a global thread group. 14. The at least one non-transitory computer readable storage medium of claim 12 , wherein the global barrier determines that all threads across the first thread group and the second thread group have been completed.
Program initiating; Program switching, e.g. by interrupt · CPC title
from multiple instruction streams, e.g. multistreaming · CPC title
Barrier synchronisation · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.