Multi-functional execution lane for image processor
US-2017161064-A1 · Jun 8, 2017 · US
US10915318B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10915318-B2 |
| Application number | US-201916291176-A |
| Country | US |
| Kind code | B2 |
| Filing date | Mar 4, 2019 |
| Priority date | Mar 9, 2017 |
| Publication date | Feb 9, 2021 |
| Grant date | Feb 9, 2021 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A vector processing unit is described, and includes processor units that each include multiple processing resources. The processor units are each configured to perform arithmetic operations associated with vectorized computations. The vector processing unit includes a vector memory in data communication with each of the processor units and their respective processing resources. The vector memory includes memory banks configured to store data used by each of the processor units to perform the arithmetic operations. The processor units and the vector memory are tightly coupled within an area of the vector processing unit such that data communications are exchanged at a high bandwidth based on the placement of respective processor units relative to one another, and based on the placement of the vector memory relative to each processor unit.
Opening claim text (preview).
What is claimed is: 1. A circuit for performing a vector computation, the circuit comprising: a vector processor lane located in a vector processing unit of the circuit; and a sub-lane processor located in the vector processor lane, the sub-lane processor including a processor resource and a vector register, each of the processor resource and the vector register being used to perform the vector computation; wherein the processor resource and the vector register are tightly coupled within a threshold distance of each other in a sub-lane of the vector processor lane such that data communications between the processor resource and the vector register traverse the threshold distance in fewer than four clock cycles; and wherein the vector processor lane is configured to send multiple data structures of vector operands to a matrix unit of the circuit in one clock cycle as a result of the sub-lane processor being located within a threshold distance of a matrix data serializer in the vector processor lane. 2. The circuit of claim 1 , wherein: the vector processor lane provides a two-dimensional array of data paths that are tightly coupled within a threshold area of the circuit such that the vector processing unit is configured to execute thousands of data operations in one clock cycle; and at least one dimension corresponds to a data path between multiple distinct sub-lane processors located in the vector processor lane. 3. The circuit of claim 2 , further comprising: a vector memory located in the vector processor lane, the vector memory being configured to store data that is used to perform the vector computation; and a crossbar located intermediate the vector memory and the sub-lane processor, the crossbar being configured to provide a communication interface between the vector memory and the processor resource of the sub-lane processor. 4. The circuit of claim 3 , wherein: the vector memory includes multiple sets of memory banks, the vector processor lane includes multiple sub-lane processors, and the crossbar is configured to provide a data connection between each sub-lane processor and a respective set of memory banks included in the vector memory. 5. The circuit of claim 4 , wherein: the vector processor lane comprises multiple sub-lanes and is a section of an integrated hardware circuit die that corresponds to a portion of the vector processing unit; and each sub-lane of the multiple sub-lanes is one of multiple sub-sections of the vector processor lane. 6. The circuit of claim 5 , wherein: each sub-lane processor of the multiple sub-lane processors corresponds to a discrete processor unit that has multiple processor resources; and each processor resource is configured to execute vector operations for performing the vector computation. 7. The circuit of claim 1 , the matrix unit is coupled to the vector processor lane and configured to: receive vector operands from at least one sub-lane processor located in the vector processor lane; and execute matrix operations that cause the circuit to perform the vector computations using the vector operands. 8. The circuit of claim 7 , wherein the matrix operations include: matrix multiplication to train a neural network, or matrix multiplication to compute a neural network inference using at least a partially trained neural network. 9. A method implemented using a circuit for performing vector computations, the method comprising: receiving, by a vector processor lane in a vector processing unit of the circuit, data that is used to perform the vector computations; providing the data to a sub-lane processor located in the vector processor lane, wherein the sub-lane processor includes a processor resource and a vector register that communicate to perform the vector computations; generating, using the received data and based on data communications between the processor resource and the vector register, vector operands for performing the vector computations, wherein the processor resource and the vector register are tightly coupled within a threshold distance of each other in a sub-lane of the vector processor lane, and wherein the data communications traverse the threshold distance in fewer than four clock cycles based on the processor resource and the vector register being tightly coupled in the vector processor lane; providing, by the vector processor lane and using a matrix data serializer in the vector processor lane, the vector operands to a matrix unit in one clock cycle as a result of the sub-lane processor being located within a threshold distance of the matrix data serializer; and performing, at the circuit, the vector computations based on the data communications between the processor resource and the vector register, and matrix multiplication performed at the matrix unit using the vector operands. 10. The method of claim 9 , wherein: the vector processor lane provides a two-dimensional array of data paths that are tightly coupled within a threshold area of the circuit such that the vector processing unit is configured to execute thousands of data operations in one clock cycle; and at least one dimension corresponds to a data path between multiple distinct sub-lane processors located in the vector processor lane. 11. The method of claim 10 , wherein the vector processor lane includes a vector memory configured to store vector elements that correspond to the received data and the operations further comprise: providing the vector elements from the vector memory using a crossbar located intermediate the vector memory and the sub-lane processor, the crossbar being configured to provide a communication interface between the vector memory and the processor resource of the sub-lane processor. 12. The method of claim 11 , wherein: the vector memory includes multiple sets of memory banks, the vector processor lane includes multiple sub-lane processors, the crossbar is configured to provide a data connection between each sub-lane processor and a respective set of memory banks included in the vector memory, and providing the vector elements from the vector memory comprises providing the vector elements using the data connection between a particular sub-lane processor and a corresponding respective set of memory banks included in the vector memory. 13. The method of claim 12 , wherein: the vector processor lane comprises multiple sub-lanes and is a section of an integrated hardware circuit die that corresponds to a portion of the vector processing unit; and each sub-lane of the multiple sub-lanes is one of multiple sub-sections of the vector processor lane. 14. The method of claim 13 , wherein each sub-lane processor of the multiple sub-lane processors corresponds to a discrete processor unit that has multiple processor resources and the operations further comprise: executing, using each processor resource, vector operations for performing the vector computation. 15. The method of claim 9 , wherein the matrix unit is coupled to the vector processor lane and the operations further comprise: receiving, by the matrix unit, the vector operands from at least one sub-lane processor located in the vector processor lane; and executing, by the matrix unit, matrix operations that cause the circuit to perform the vector computations using the vector operands. 16. The method of claim 15 , wherein the matrix operations include: matrix multiplication to train a neural network, or matrix multiplication to compute a neural network inference using at least a partially trained neural network. 17. A no
Vector processors · CPC title
Instructions to perform operations on packed data, e.g. vector, tile or matrix operations · CPC title
controlled by a single instruction for multiple data lanes [SIMD] · CPC title
for complex operations, e.g. multidimensional or interleaved address generators, macros · CPC title
controlled in tandem, e.g. multiplier-accumulator · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.