Apparatus and method for a high throughput parallel co-processor and interconnect with low offload latency

US10915328B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10915328-B2
Application numberUS-201816220528-A
CountryUS
Kind codeB2
Filing dateDec 14, 2018
Priority dateDec 14, 2018
Publication dateFeb 9, 2021
Grant dateFeb 9, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

An apparatus and method for offloading iterative, parallel work to a data parallel cluster. For example, one embodiment of a processor comprises: a host processor to execute a primary thread; a data parallel cluster coupled to the host processor over a high speed interconnect, the data parallel cluster comprising a plurality of execution lanes to perform parallel execution of one or more secondary threads related to the primary thread; and a data parallel cluster controller integral to the host processor to offload processing of the one or more secondary threads to the data parallel cluster in response to one of the cores executing a parallel processing call instruction from the primary thread.

First claim

Opening claim text (preview).

What is claimed is: 1. An apparatus comprising: a host processor comprising a plurality of cores, a first one or more of the plurality of cores to execute a primary thread comprising a sequence of instructions; a data parallel cluster coupled to the host processor over a high speed interconnect, the data parallel cluster comprising a plurality of execution lanes to perform parallel execution of one or more secondary threads related to the primary thread, the data parallel cluster including a scheduler to evaluate variables associated with the one or more secondary threads to schedule execution of the one or more secondary threads across the plurality of execution lanes; and a data parallel cluster controller integral to the host processor to offload processing of the one or more secondary threads to the data parallel cluster in response to one of the cores executing a parallel processing call instruction from the primary thread, wherein responsive to the parallel processing call instruction, the data parallel cluster controller is to transmit initial execution values to the data parallel cluster including a context identifier and a number of loop iterations to be performed during execution of the secondary threads on the execution lanes. 2. The apparatus of claim 1 wherein at least one of the execution lanes are to execute a parallel processing return instruction, wherein responsive to the parallel processing return instruction, the data parallel cluster is to transmit a notification to the host processor over the high speed interconnect. 3. The apparatus of claim 2 wherein the data parallel cluster further comprises a memory controller to access a designated region of system memory allocated by the host processor, the data parallel cluster to store results of the secondary thread in the designated region. 4. The apparatus of claim 3 wherein the notification to the host processor includes a pointer identifying a location of the results in the designated region. 5. The apparatus of claim 1 wherein the data parallel cluster controller is to transmit a pointer to the data parallel cluster over the high speed interconnect identifying a location in system memory from which the data parallel cluster is to fetch instructions of the secondary threads. 6. The apparatus of claim 1 wherein the secondary threads comprise microthreads comprising sequences of microoperations. 7. The apparatus of claim 1 wherein the secondary threads comprise sequences of macroinstructions, the data parallel cluster further comprising an instruction fetch unit to fetch the macroinstructions and a decoder to decode the macroinstructions into a plurality of microoperations. 8. The apparatus of claim 1 wherein parallel work resulting from the secondary threads is to be subdivided according to how the instructions of the secondary threads express parallel execution of a loop. 9. The apparatus of claim 1 wherein the secondary threads are ganged into fragments based on instruction pointer values to induce convergence. 10. The apparatus of claim 9 wherein a fragment comprises a collection of associated threads. 11. The apparatus of claim 9 wherein an order in which to execute the fragments is determined based on variables associated with each fragment. 12. A method comprising: executing a sequence of instructions of a primary thread on execution resources of a host processor; executing a parallel processing call executed on the execution resources and responsively offloading execution of one or more secondary threads to a data parallel cluster coupled to the host processor over a high speed interconnect; passing initialization values to the data parallel cluster including a thread context identifier and a number of loop iterations; scheduling execution of the secondary threads on a plurality of lanes of the data parallel cluster; executing the secondary threads on the data parallel cluster, implementing the loop iterations; and storing results of the execution of the secondary threads in a designated of region of memory configured by the host processor. 13. The method of claim 12 wherein at least one of the execution lanes are to execute a parallel processing return instruction, wherein responsive to the parallel processing return instruction, the data parallel cluster is to transmit a notification to the host processor over the high speed interconnect. 14. The method of claim 13 wherein the data parallel cluster further comprises a memory controller to access a designated region of system memory allocated by the host processor, the data parallel cluster to store results of the secondary thread in the designated region. 15. The method of claim 14 wherein the notification to the host processor includes a pointer identifying a location of the results in the designated region. 16. The method of claim 12 wherein a pointer to the data parallel cluster is to be transmitted over the high speed interconnect identifying a location in system memory from which the data parallel cluster is to fetch instructions of the secondary threads. 17. The method of claim 12 wherein the secondary threads comprise microthreads comprising sequences of microoperations. 18. The method of claim 12 wherein the secondary threads comprise sequences of macroinstructions, the data parallel cluster further comprising an instruction fetch unit to fetch the macroinstructions and a decoder to decode the macroinstructions into a plurality of microoperations. 19. The method of claim 12 wherein parallel work resulting from the secondary threads is subdivided according to how the instructions of the secondary threads express parallel execution of a loop. 20. The method of claim 12 wherein the secondary threads are ganged into fragments based on instruction pointer values to induce convergence. 21. The method of claim 20 wherein a fragment comprises a collection of associated threads. 22. The method of claim 20 wherein an order in which to execute the fragments is to be based on variables associated with each fragment. 23. A non-transitory machine-readable medium having program code stored thereon which, when executed by a machine, causes the machine to perform the operations of: executing a sequence of instructions of a primary thread on execution resources of a host processor; executing a parallel processing call executed on the execution resources and responsively offloading execution of one or more secondary threads to a data parallel cluster; passing initialization values to the data parallel cluster including a thread context identifier and a number of loop iterations; scheduling execution of the secondary threads on a plurality of lanes of the data parallel cluster coupled to the host processor over a high speed interconnect; executing the secondary threads on the data parallel cluster, implementing the loop iterations; and storing results of the execution of the secondary threads in a designated of region of memory configured by the host processor. 24. The non-transitory machine-readable medium of claim 23 wherein at least one of the execution lanes are to execute a parallel processing return instruction, wherein responsive to the parallel processing return instruction, the data parallel cluster is to transmit a notification to the host processor over the high speed interconnect. 25. The non-transitory machine-readable medium of claim 24 wherein the data parall

Assignees

Inventors

Classifications

  • controlled by a single instruction for multiple data lanes [SIMD] · CPC title

  • controlled by a single instruction for multiple threads [SIMT] in parallel · CPC title

  • Divergence aspects · CPC title

  • from multiple instruction streams, e.g. multistreaming · CPC title

  • using a secondary processor, e.g. coprocessor (peripheral processor G06F13/12) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10915328B2 cover?
An apparatus and method for offloading iterative, parallel work to a data parallel cluster. For example, one embodiment of a processor comprises: a host processor to execute a primary thread; a data parallel cluster coupled to the host processor over a high speed interconnect, the data parallel cluster comprising a plurality of execution lanes to perform parallel execution of one or more second…
Who is the assignee on this patent?
Intel Corp
What technology area does this patent fall under?
Primary CPC classification G06F13/4027. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Feb 09 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 9 related publications on this page (citations in our corpus or others sharing the same primary CPC).