Architecture and method for data parallel single program multiple data (SPMD) execution

US10831505B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10831505-B2
Application numberUS-201816147692-A
CountryUS
Kind codeB2
Filing dateSep 29, 2018
Priority dateSep 29, 2018
Publication dateNov 10, 2020
Grant dateNov 10, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

An apparatus and method for data parallel single program multiple data (SPMD) execution. For example, one embodiment of a processor comprises: instruction fetch circuitry to fetch instructions of one or more primary threads; a decoder to decode the instructions to generate uops; a data parallel cluster (DPC) to execute microthreads comprising a subset of the uops, the DPC further comprising: a plurality of execution lanes to perform parallel execution of the microthreads; an instruction decode queue (IDQ) to store the uops prior to execution; and a scheduler to evaluate the microthreads based on associated variables including instruction pointer (IP) values, the scheduler to gang microthreads into fragments for parallel execution on the execution lanes based on the evaluation.

First claim

Opening claim text (preview).

What is claimed is: 1. A processor comprising: instruction fetch circuitry to fetch instructions of one or more primary threads; a decoder to decode the instructions to generate uops; and a data parallel cluster (DPC) to execute microthreads comprising a subset of the uops, the DPC further comprising: a plurality of execution lanes to perform parallel execution of the microthreads; an instruction decode queue (IDQ) to store the uops prior to execution; and a scheduler to evaluate the microthreads based on associated variables including instruction pointer (IP) values, the scheduler to gang microthreads into fragments for parallel execution on the execution lanes based on the evaluation, wherein execution of at least one fragment during the parallel execution is stalled for a number of cycles at one or more reconvergence points provided by a cache. 2. The processor of claim 1 wherein the scheduler is to gang the microthreads into the fragments based on IP values to induce microthread convergence. 3. The processor of claim 1 wherein a fragment comprises a collection of associated microthreads. 4. The processor of claim 2 further comprising: reconvergence circuitry to be used by the scheduler to determine an order in which to execute the fragments, the reconvergence circuitry comprising a data structure to store variables associated with each fragment. 5. The processor of claim 4 wherein the reconvergence circuitry is configured to generate a signal to identify a next fragment to be executed based on a comparison of the variables of all fragments. 6. The processor of claim 5 wherein the comparison comprises a comparison of the IP values of the fragments and wherein the fragment having a minimum IP value is to be selected for execution by execution lanes. 7. The processor of claim 1 wherein the DPC further comprises: mask storage to store an execution mask having at least one value associated with each parallel execution lane. 8. The processor of claim 7 wherein the DPC is to enable or disable execution lanes for executing each fragment or microthread based on the values associated with the lanes. 9. The processor of claim 8 wherein the execution mask is to be updated dynamically for each fragment or microthread, thereby enabling a number of lanes required to execute the fragment or microthread. 10. The processor of claim 1 wherein the DPC further comprises: a data cache to store data to be used to execute the fragments; a translation lookaside buffer (TLB) to store virtual-to-physical address translations for accessing system memory. 11. The processor of claim 1 wherein each lane of the DPC further comprises: a register file to store data associated with an executing fragment; a tensor arithmetic logic unit (TALU) to process tensor data associated with an executing fragment; and an address generation unit to generate addresses required to execute each fragment. 12. A method comprising: fetching instructions of one or more primary threads; decoding the instructions to generate uops; identifying microthreads comprising a subset of the uops; evaluating the microthreads based on associated variables including instruction pointer (IP) values; and ganging the microthreads into fragments for parallel execution on a plurality of parallel execution lanes based on the evaluation, wherein execution of at least one fragment during the parallel execution is stalled for a number of cycles at one or more reconvergence points provided by a cache. 13. The method of claim 12 wherein the microthreads are ganged into the fragments based on the IP values to induce microthread convergence. 14. The method of claim 12 wherein a fragment comprises a collection of associated microthreads. 15. The method of claim 13 further comprising: determining an order in which to execute the fragments using a data structure storing variables associated with each fragment. 16. The method of claim 15 further comprising: generating a signal to identify a next fragment to be executed based on a comparison of the variables of all fragments. 17. The method of claim 16 wherein the comparison comprises a comparison of the IP values of the fragments and wherein the fragment having a minimum IP value is to be selected for execution on the parallel execution lanes. 18. The method of claim 12 further comprising: storing an execution mask having at least one value associated with each of the parallel execution lanes. 19. The method of claim 18 further comprising: enabling or disabling execution lanes for executing each fragment or microthread based on the values associated with the lanes. 20. The method of claim 19 further comprising: dynamically updating the execution mask for each fragment or microthread, thereby enabling a specified number of lanes required to execute the fragment or microthread. 21. A non-transitory machine-readable medium having program code stored thereon which, when executed by a machine, causes the machine to perform the operations of: fetching instructions of one or more primary threads; decoding the instructions to generate uops; identifying microthreads comprising a subset of the uops; evaluating the microthreads based on associated variables including instruction pointer (IP) values; and ganging the microthreads into fragments for parallel execution on a plurality of parallel execution lanes based on the evaluation, wherein execution of at least one fragment during the parallel execution is stalled for a number of cycles at one or more reconvergence points provided by a cache. 22. The non-transitory machine-readable medium of claim 21 wherein the microthreads are ganged into the fragments based on the IP values to induce microthread convergence. 23. The non-transitory machine-readable medium of claim 21 wherein a fragment comprises a collection of associated microthreads. 24. The non-transitory machine-readable medium of claim 22 further comprising program code to cause the machine to perform the operation of: determining an order in which to execute the fragments using a data structure storing variables associated with each fragment. 25. The non-transitory machine-readable medium of claim 24 further comprising program code to cause the machine to perform the operation of: generating a signal to identify a next fragment to be executed based on a comparison of the variables of all fragments. 26. The non-transitory machine-readable medium of claim 25 wherein the comparison comprises a comparison of the IP values of the fragments and wherein the fragment having a minimum IP value is to be selected for execution on the parallel execution lanes. 27. The non-transitory machine-readable medium of claim 21 further comprising program code to cause the machine to perform the operation of: storing an execution mask having at least one value associated with each of the parallel execution lanes. 28. The non-transitory machine-readable medium of claim 27 further comprising program code to cause the machine to perform the operation of: enabling or disabling execution lanes for executing each fragment or microthread based on the values associated with the lanes. 29. The non-transitory machine-readable medium of claim 28 further comprising program code to cause the machine to perform the operation

Assignees

Inventors

Classifications

  • controlled by a single instruction for multiple threads [SIMT] in parallel · CPC title

  • from multiple instruction streams, e.g. multistreaming · CPC title

  • G06F9/3891Primary

    organised in groups of units sharing resources, e.g. clusters · CPC title

  • Instruction analysis, e.g. decoding, instruction word fields · CPC title

  • Register arrangements · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10831505B2 cover?
An apparatus and method for data parallel single program multiple data (SPMD) execution. For example, one embodiment of a processor comprises: instruction fetch circuitry to fetch instructions of one or more primary threads; a decoder to decode the instructions to generate uops; a data parallel cluster (DPC) to execute microthreads comprising a subset of the uops, the DPC further comprising: a …
Who is the assignee on this patent?
Intel Corp
What technology area does this patent fall under?
Primary CPC classification G06F9/3891. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Nov 10 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 9 related publications on this page (citations in our corpus or others sharing the same primary CPC).