Using a global barrier to synchronize across local thread groups in general purpose programming on GPU

US9916162B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9916162-B2
Application numberUS-201414563601-A
CountryUS
Kind codeB2
Filing dateDec 8, 2014
Priority dateDec 26, 2013
Publication dateMar 13, 2018
Grant dateMar 13, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods and systems may synchronize workloads across local thread groups. The methods and systems may provide for receiving, at a graphics processor, a workload from a host processor and receiving, at a plurality of processing elements, a plurality of threads that from one or more local thread groups. Additionally, the processing of the workload may be synchronized across the one or more thread groups. In one example, the global barrier determines that all threads across the thread groups have been completed without polling.

First claim

Opening claim text (preview).

We claim: 1. A system comprising: one or more transceivers; a host processor in communication with the one or more transceivers; a system memory associated with the host processor; a processor, in communication with the system memory, to receive a workload from the host processor, wherein the workload is partitioned into a plurality of kernels each containing a thread group, the processor including: a first plurality of processing elements to receive and process a first thread group having a plurality of threads, wherein each of the plurality of processing elements processes at least one of the plurality of threads, a second plurality of processing elements to receive and process a second thread group having a plurality of threads, wherein each of the plurality of processing elements processes at least one of the plurality of threads, and a global barrier in communication with the first plurality of processing elements and the second plurality of processing elements, the global barrier to enable the workload to be partitioned into the plurality of kernels and to synchronize the processing of the workload across the first thread group and the second thread group, wherein the plurality of kernels are to be processed concurrently and in parallel. 2. The system of claim 1 , wherein the first thread group and the second thread group form a global thread group. 3. The system of claim 1 , wherein the global barrier determines that all threads across the first thread group and the second thread group have been completed. 4. The system of claim 3 , wherein the determination is made without polling. 5. The system of claim 1 , wherein the processor is a graphics processor. 6. An apparatus comprising: a graphics processor to receive a workload from a host processor, wherein the workload is partitioned into a plurality of kernels each containing a thread group, the graphics processor including: a first plurality of processing elements to receive and process a first thread group having a plurality of threads, wherein each of the plurality of processing elements processes at least one of the plurality of threads, a second plurality of processing elements to receive and process a second thread group having a plurality of threads, wherein each of the plurality of processing elements processes at least one of the plurality of threads, and a global barrier in communication with the first plurality of processing elements and the second plurality of processing elements, the global barrier to enable the workload to be partitioned into the plurality of kernels and to synchronize the processing of the workload across the first thread group and the second thread group, wherein the plurality of kernels are to be processed concurrently and in parallel. 7. The apparatus of claim 6 , wherein the first thread group and the second thread group form a global thread group. 8. The apparatus of claim 6 , wherein the global barrier determines that all threads across the first thread group and the second thread group have been completed. 9. A method comprising: receiving, at a graphics processor, a workload from a host processor, wherein the workload is partitioned into a plurality of kernels each containing a thread group; receiving, at a first plurality of processing elements, a first thread group having plurality of threads, wherein each of the plurality of processing elements processes at least one of the plurality of threads, receiving, at a second plurality of processing elements, a second thread group having plurality of threads, wherein each of the plurality of processing elements processes at least one of the plurality of threads, synchronizing, at a global barrier in communication with the first plurality of processing elements and the second plurality of processing elements, the processing of the workload across the first thread group and the second thread group, wherein the global barrier enables the workload to be partitioned into the plurality of kernels and the plurality of kernels are to be processed concurrently and in parallel. 10. The method of claim 9 , wherein the first thread group and the second thread group form a global thread group. 11. The method of claim 9 , wherein the global barrier determines that all threads across the first thread group and the second thread group have been completed. 12. At least one non-transitory computer readable storage medium comprising a set of instructions which, if executed by a graphics processor, cause a computer to: receive, at a processor, a workload from a host processor, wherein the workload is partitioned into a plurality of kernels each containing a thread group; receive, at a first plurality of processing elements, a first thread group having a plurality of threads, wherein each of the plurality of processing elements processes at least one of the plurality of threads, receive, at a second plurality of processing elements, a second thread group having a plurality of threads, wherein each of the plurality of processing elements processes at least one of the plurality of threads, synchronize, at a global barrier in communication with the first plurality of processing elements and the second plurality of processing elements, the processing of the workload across the first thread group and the second thread group, wherein the global barrier enables the workload to be partitioned into the plurality of kernels and the plurality of kernels are to be processed concurrently and in parallel. 13. The at least one non-transitory computer readable storage medium of claim 12 , wherein the first thread group and the second thread group form a global thread group. 14. The at least one non-transitory computer readable storage medium of claim 12 , wherein the global barrier determines that all threads across the first thread group and the second thread group have been completed.

Assignees

Inventors

Classifications

  • Program initiating; Program switching, e.g. by interrupt · CPC title

  • G06F9/3851Primary

    from multiple instruction streams, e.g. multistreaming · CPC title

  • Barrier synchronisation · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9916162B2 cover?
Methods and systems may synchronize workloads across local thread groups. The methods and systems may provide for receiving, at a graphics processor, a workload from a host processor and receiving, at a plurality of processing elements, a plurality of threads that from one or more local thread groups. Additionally, the processing of the workload may be synchronized across the one or more thread…
Who is the assignee on this patent?
Intel Corp
What technology area does this patent fall under?
Primary CPC classification G06F9/3851. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 13 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).