Data processing apparatus and method for performing vector processing

US9672035B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9672035-B2
Application numberUS-201414504947-A
CountryUS
Kind codeB2
Filing dateOct 2, 2014
Priority dateNov 26, 2013
Publication dateJun 6, 2017
Grant dateJun 6, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A data processing apparatus and method are provided for processing execution threads, where each execution thread specifies at least one instruction. The data processing apparatus has a vector processing unit providing a plurality M of lanes of parallel processing, within each lane the vector processing unit being configured to perform a processing operation on a data element input to that lane for each of one or more input operands. A vector instruction is received that is specified by a group of the execution threads, that vector instruction identifying an associated processing operation and also providing an indication of the data elements of each input operand that are to be subjected to that associated processing operation. Vector merge circuitry then determines, based on that information, a required number of lanes of parallel processing for performing the associated processing operation. If it is determined that the required number of lanes is less than or equal to half the available number of lanes within the vector processing unit, then the vector merge circuitry allocates a plurality of the execution threads of the group to the vector processing unit such that each execution thread in that plurality is allocated different lanes amongst the available lanes of parallel processing. As a result, the vector processing unit then performs the associated processing operation in parallel for each of the plurality of execution threads, significantly increasing performance.

First claim

Opening claim text (preview).

I claim: 1. A data processing apparatus for processing execution threads, each execution thread specifying at least one instruction, the data processing apparatus comprising: a vector processing unit providing M lanes of parallel processing, where M is a plural integer, within each lane the vector processing unit being configured to perform a processing operation on a data element input to that lane for each of one or more input operands; an input interface configured to receive a vector instruction that is specified by a group of said execution threads, the vector instruction identifying an associated processing operation and providing an indication of the data elements of each input operand to be subjected to said associated processing operation; vector merge circuitry configured to determine, having regard to the indication of the data elements, a required number of lanes of parallel processing for performing the associated processing operation of the vector instruction, the vector merge circuitry being further configured, if the required number of lanes of parallel processing is less than or equal to M/2, to allocate a plurality of the execution threads of the group to the vector processing unit such that each execution thread in said plurality of the execution threads is allocated different lanes amongst said M lanes of parallel processing; and the vector processing unit configured, responsive to the vector merge circuitry allocating a plurality of the execution threads of the group to the vector processing unit, to perform the associated processing operation in parallel for each of said plurality of execution threads. 2. The data processing apparatus as claimed in claim 1 , wherein said group of said execution threads comprises N execution threads, where N is a plural integer, and the vector merge circuitry is configured, if the required number of lanes of parallel processing is less than or equal to M/N, to allocate all of the execution threads in said group to the vector processing unit such that the vector processing unit then performs the associated processing operation in parallel for each of the execution threads in said group. 3. The data processing apparatus as claimed in claim 2 , wherein if the required number of lanes of parallel processing is less than or equal to M/2, but is not less than or equal to M/N, the vector merge circuitry is configured to allocate a plurality of execution threads of the group to the vector processing unit during a first allocation cycle, and then to allocate a different plurality of the execution threads of the group to the vector processing unit in one or more subsequent allocation cycles, until all of the execution threads of the group have been allocated to the vector processing unit. 4. The data processing apparatus as claimed in claim 1 , further comprising: a further vector processing unit also providing said M lanes of parallel processing; said input interface is configured to receive an instruction block containing said vector instruction whose associated processing operation is to be performed by the vector processing unit and a further vector instruction whose associated processing operation is to be performed by the further vector processing unit, both said vector instruction and said further vector instruction being specified by said group of said execution threads; said further vector instruction also providing an indication of the data elements of each input operand to be subjected to its associated processing operation; and said vector merge circuitry is configured to have regard to the indication of the data elements provided by said vector instruction and the indication of the data elements provided by said further vector instruction when determining said required number of lanes of parallel processing, such that said required number of lanes of parallel processing is dependent on which of said vector instruction and said further vector instruction requires the most lanes in order to perform its associated processing operation. 5. The data processing apparatus as claimed in claim 1 , further comprising: a group of scalar processing units, each scalar processing unit in the group being configured to perform the same scalar processing operation; said input interface is configured to receive an instruction block containing said vector instruction whose associated processing operation is to be performed by the vector processing unit, and a scalar instruction specifying an associated processing operation to be performed by one of said scalar processing units; and the vector merge circuitry further comprises scalar allocation circuitry, and responsive to the vector merge circuitry allocating a plurality of the execution threads of the group to the vector processing unit, the scalar allocation circuitry is configured to allocate each of said plurality of execution threads to a different one of the scalar processing units within said group of scalar processing units, such that the associated scalar operation is performed in parallel for each of said plurality of execution threads. 6. The data processing apparatus as claimed in claim 5 , wherein said group of said execution threads comprises N execution threads, where N is a plural integer, and the vector merge circuitry is configured, if the required number of lanes of parallel processing is less than or equal to M/N, to allocate all of the execution threads in said group to the vector processing unit such that the vector processing unit then performs the associated processing operation in parallel for each of the execution threads in said group, and wherein said group of scalar processing units comprises N scalar processing units. 7. The data processing apparatus as claimed in claim 5 , wherein said instruction block includes a field which, when set, indicates that said scalar instruction is to be treated as an additional vector instruction, and said group of scalar processing units are to be treated collectively as forming an additional vector processing unit for performing the associated processing operation of said additional vector instruction. 8. The data processing apparatus as claimed in claim 1 , further comprising: a set of registers for storage of the input operands associated with each execution thread in said group; the vector instruction being configured to provide an input operand identifier for each input operand to be subjected to the associated processing operation; and register determination circuitry configured, for each execution thread allocated to the vector processing circuitry, to determine from each input operand identifier a register in said set of registers containing the corresponding input operand associated with that execution thread, and to cause the set of registers to be accessed in order to output the corresponding input operand from the determined register. 9. The data processing apparatus as claimed in claim 8 , wherein: the vector merging circuitry is configured, when allocating a plurality of the execution threads of the group to the vector processing unit such that each execution thread in said plurality is allocated different lanes amongst said M lanes of parallel processing, to maintain lane allocation data identifying which lanes have been allocated to each execution thread in said plurality; the data processing apparatus further comprising operand routing circuitry configured to receive the input operand output from each determined register and to route the required data elements of that input operand to the lanes indicated by the lane allocation data as having been allocated to the execution thread that that input operand is associated with. 10. The data processing

Assignees

Inventors

Classifications

  • Instructions to perform operations on packed data, e.g. vector, tile or matrix operations · CPC title

  • Vector processors · CPC title

  • controlled by a single instruction for multiple threads [SIMT] in parallel · CPC title

  • controlled by a single instruction for multiple data lanes [SIMD] · CPC title

  • from multiple instruction streams, e.g. multistreaming · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9672035B2 cover?
A data processing apparatus and method are provided for processing execution threads, where each execution thread specifies at least one instruction. The data processing apparatus has a vector processing unit providing a plurality M of lanes of parallel processing, within each lane the vector processing unit being configured to perform a processing operation on a data element input to that lane…
Who is the assignee on this patent?
Advanced Risc Mach Ltd
What technology area does this patent fall under?
Primary CPC classification G06F9/30036. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jun 06 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).