Vector processing unit

US10915318B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10915318-B2
Application numberUS-201916291176-A
CountryUS
Kind codeB2
Filing dateMar 4, 2019
Priority dateMar 9, 2017
Publication dateFeb 9, 2021
Grant dateFeb 9, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A vector processing unit is described, and includes processor units that each include multiple processing resources. The processor units are each configured to perform arithmetic operations associated with vectorized computations. The vector processing unit includes a vector memory in data communication with each of the processor units and their respective processing resources. The vector memory includes memory banks configured to store data used by each of the processor units to perform the arithmetic operations. The processor units and the vector memory are tightly coupled within an area of the vector processing unit such that data communications are exchanged at a high bandwidth based on the placement of respective processor units relative to one another, and based on the placement of the vector memory relative to each processor unit.

First claim

Opening claim text (preview).

What is claimed is: 1. A circuit for performing a vector computation, the circuit comprising: a vector processor lane located in a vector processing unit of the circuit; and a sub-lane processor located in the vector processor lane, the sub-lane processor including a processor resource and a vector register, each of the processor resource and the vector register being used to perform the vector computation; wherein the processor resource and the vector register are tightly coupled within a threshold distance of each other in a sub-lane of the vector processor lane such that data communications between the processor resource and the vector register traverse the threshold distance in fewer than four clock cycles; and wherein the vector processor lane is configured to send multiple data structures of vector operands to a matrix unit of the circuit in one clock cycle as a result of the sub-lane processor being located within a threshold distance of a matrix data serializer in the vector processor lane. 2. The circuit of claim 1 , wherein: the vector processor lane provides a two-dimensional array of data paths that are tightly coupled within a threshold area of the circuit such that the vector processing unit is configured to execute thousands of data operations in one clock cycle; and at least one dimension corresponds to a data path between multiple distinct sub-lane processors located in the vector processor lane. 3. The circuit of claim 2 , further comprising: a vector memory located in the vector processor lane, the vector memory being configured to store data that is used to perform the vector computation; and a crossbar located intermediate the vector memory and the sub-lane processor, the crossbar being configured to provide a communication interface between the vector memory and the processor resource of the sub-lane processor. 4. The circuit of claim 3 , wherein: the vector memory includes multiple sets of memory banks, the vector processor lane includes multiple sub-lane processors, and the crossbar is configured to provide a data connection between each sub-lane processor and a respective set of memory banks included in the vector memory. 5. The circuit of claim 4 , wherein: the vector processor lane comprises multiple sub-lanes and is a section of an integrated hardware circuit die that corresponds to a portion of the vector processing unit; and each sub-lane of the multiple sub-lanes is one of multiple sub-sections of the vector processor lane. 6. The circuit of claim 5 , wherein: each sub-lane processor of the multiple sub-lane processors corresponds to a discrete processor unit that has multiple processor resources; and each processor resource is configured to execute vector operations for performing the vector computation. 7. The circuit of claim 1 , the matrix unit is coupled to the vector processor lane and configured to: receive vector operands from at least one sub-lane processor located in the vector processor lane; and execute matrix operations that cause the circuit to perform the vector computations using the vector operands. 8. The circuit of claim 7 , wherein the matrix operations include: matrix multiplication to train a neural network, or matrix multiplication to compute a neural network inference using at least a partially trained neural network. 9. A method implemented using a circuit for performing vector computations, the method comprising: receiving, by a vector processor lane in a vector processing unit of the circuit, data that is used to perform the vector computations; providing the data to a sub-lane processor located in the vector processor lane, wherein the sub-lane processor includes a processor resource and a vector register that communicate to perform the vector computations; generating, using the received data and based on data communications between the processor resource and the vector register, vector operands for performing the vector computations, wherein the processor resource and the vector register are tightly coupled within a threshold distance of each other in a sub-lane of the vector processor lane, and wherein the data communications traverse the threshold distance in fewer than four clock cycles based on the processor resource and the vector register being tightly coupled in the vector processor lane; providing, by the vector processor lane and using a matrix data serializer in the vector processor lane, the vector operands to a matrix unit in one clock cycle as a result of the sub-lane processor being located within a threshold distance of the matrix data serializer; and performing, at the circuit, the vector computations based on the data communications between the processor resource and the vector register, and matrix multiplication performed at the matrix unit using the vector operands. 10. The method of claim 9 , wherein: the vector processor lane provides a two-dimensional array of data paths that are tightly coupled within a threshold area of the circuit such that the vector processing unit is configured to execute thousands of data operations in one clock cycle; and at least one dimension corresponds to a data path between multiple distinct sub-lane processors located in the vector processor lane. 11. The method of claim 10 , wherein the vector processor lane includes a vector memory configured to store vector elements that correspond to the received data and the operations further comprise: providing the vector elements from the vector memory using a crossbar located intermediate the vector memory and the sub-lane processor, the crossbar being configured to provide a communication interface between the vector memory and the processor resource of the sub-lane processor. 12. The method of claim 11 , wherein: the vector memory includes multiple sets of memory banks, the vector processor lane includes multiple sub-lane processors, the crossbar is configured to provide a data connection between each sub-lane processor and a respective set of memory banks included in the vector memory, and providing the vector elements from the vector memory comprises providing the vector elements using the data connection between a particular sub-lane processor and a corresponding respective set of memory banks included in the vector memory. 13. The method of claim 12 , wherein: the vector processor lane comprises multiple sub-lanes and is a section of an integrated hardware circuit die that corresponds to a portion of the vector processing unit; and each sub-lane of the multiple sub-lanes is one of multiple sub-sections of the vector processor lane. 14. The method of claim 13 , wherein each sub-lane processor of the multiple sub-lane processors corresponds to a discrete processor unit that has multiple processor resources and the operations further comprise: executing, using each processor resource, vector operations for performing the vector computation. 15. The method of claim 9 , wherein the matrix unit is coupled to the vector processor lane and the operations further comprise: receiving, by the matrix unit, the vector operands from at least one sub-lane processor located in the vector processor lane; and executing, by the matrix unit, matrix operations that cause the circuit to perform the vector computations using the vector operands. 16. The method of claim 15 , wherein the matrix operations include: matrix multiplication to train a neural network, or matrix multiplication to compute a neural network inference using at least a partially trained neural network. 17. A no

Assignees

Inventors

Classifications

  • Vector processors · CPC title

  • Instructions to perform operations on packed data, e.g. vector, tile or matrix operations · CPC title

  • controlled by a single instruction for multiple data lanes [SIMD] · CPC title

  • for complex operations, e.g. multidimensional or interleaved address generators, macros · CPC title

  • G06F9/3893Primary

    controlled in tandem, e.g. multiplier-accumulator · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10915318B2 cover?
A vector processing unit is described, and includes processor units that each include multiple processing resources. The processor units are each configured to perform arithmetic operations associated with vectorized computations. The vector processing unit includes a vector memory in data communication with each of the processor units and their respective processing resources. The vector memor…
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification G06F15/8053. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Feb 09 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).