Vector reductions using shared scratchpad memory

US11182159B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11182159-B2
Application numberUS-202017007569-A
CountryUS
Kind codeB2
Filing dateAug 31, 2020
Priority dateFeb 26, 2020
Publication dateNov 23, 2021
Grant dateNov 23, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and apparatus, including computer-readable media, are described for performing vector reductions using a shared scratchpad memory of a hardware circuit having processor cores that communicate with the shared memory. For each of the processor cores, a respective vector of values is generated based on computations performed at the processor core. The shared memory receives the respective vectors of values from respective resources of the processor cores using a direct memory access (DMA) data path of the shared memory. The shared memory performs an accumulation operation on the respective vectors of values using an operator unit coupled to the shared memory. The operator unit is configured to accumulate values based on arithmetic operations encoded at the operator unit. A result vector is generated based on performing the accumulation operation using the respective vectors of values.

First claim

Opening claim text (preview).

What is claimed is: 1. A method performed using a hardware circuit having a shared memory and multiple processor cores that communicate with the shared memory, the method comprising: generating a first vector of values based on vector operands that are operated on by a vector processing unit of a first processor core; receiving, by the shared memory and using a direct memory access (DMA) data path of the shared memory, the first vector of values from the first processor core; performing an accumulation operation between the first vector of values and a vector stored in the shared memory, wherein the accumulation operation is performed using an operator unit that is: i) configured to accumulate respective values of one or more vectors, and ii) external to the vector processing unit and the first processor core such that the first vector of values is accumulated with the vector stored in the shared memory, outside of the first processor core, as the first vector of values is being routed to the shared memory; and generating a result vector based on the accumulation operation. 2. The method of claim 1 , wherein the vector stored in the shared memory was received from a second processor core and the method comprises: performing an accumulate-to-memory operation to accumulate respective values of the first vector of values using a memory location of the shared memory; and performing an accumulate-to-memory operation to accumulate respective values of a second vector of values using the memory location of the shared memory. 3. The method of claim 2 , wherein generating the result vector based on the accumulation operation comprises: generating the result vector without the first processor core performing a step of pre-accumulating products that result from computations performed at the first processor core; and generating the result vector without the second processor core performing a step of pre-accumulating products that result from computations performed at the second processor core. 4. The method of claim 1 , wherein generating the result vector comprises: generating a vector of accumulated values as a result of performing the accumulation operation on the first vector of values; applying an activation function to each value in the vector of accumulated values; and generating the result vector as a result of applying the activation function to each value in the vector of accumulated values. 5. The method of claim 2 , wherein a respective resource of the first processor core is a first matrix computation unit and the method further comprises: generating a first vector of accumulated values, corresponding to the first vector of values, based on matrix multiplies performed using the first matrix computation unit of the first processor core. 6. The method of claim 5 , wherein a respective resource of the second processor core is a second matrix computation unit and the method further comprises: generating a second vector of accumulated values, corresponding to the second vector of values, based on matrix multiplies performed using the second matrix computation unit of the second processor core. 7. The method of claim 1 , wherein: the hardware circuit is a hardware accelerator configured to implement a neural network comprising a plurality of neural network layers; and the method comprises generating an output for a layer of the neural network based on the result vector. 8. The method of claim 2 , further comprising: generating the first vector of values based on computations performed at the first processor core; and generating the second vector of values based on computations performed at the second processor core; wherein the computations performed at the first processor core and the computations performed at the second processor core are part of a mathematical operation governed by a commutative property. 9. The method of claim 8 , wherein the mathematical operation is: a floating-point multiplication operation; a floating-point addition operation; an integer addition operation; or a min-max operation. 10. The method of claim 8 , wherein the mathematical operation comprises a floating-point addition operation and an integer addition operation. 11. The method of claim 2 , wherein the first processor core and second processor core are the same processor core. 12. The method of claim 1 , wherein the shared memory is configured to function as a shared-global memory space comprising memory banks and registers that are shared between two or more processor cores of the hardware circuit. 13. A system comprising: a processing device; a hardware circuit having a shared memory and multiple processor cores that communicate with the shared memory; and a non-transitory machine-readable storage device for storing instructions that are executable by the processing device to cause performance of operations comprising: generating a first vector of values based on vector operands that are operated on by a vector processing unit of a first processor core; receiving, by the shared memory and using a direct memory access (DMA) data path of the shared memory, the first vector of values from the first processor core; performing an accumulation operation between the first vector of values and a vector stored in the shared memory, wherein the accumulation operation is performed using an operator unit that is: i) configured to accumulate respective values of one or more vectors, and ii) external to the vector processing unit and the first processor core such that the first vector of values is accumulated with the vector stored in the shared memory, outside of the first processor core, as the first vector of values is being routed to the shared memory; and generating a result vector based on the accumulation operation. 14. The system of claim 13 , wherein the vector stored in the shared memory was received from a second processor core and the operations comprise: performing an accumulate-to-memory operation to accumulate respective values of the first vector of values using a memory location of the shared memory; and performing an accumulate-to-memory operation to accumulate respective values of a second vector of values using the memory location of the shared memory. 15. The system of claim 14 , wherein generating the result vector based on the accumulation operation comprises: generating the result vector without the first processor core performing a step of pre-accumulating products that result from computations performed at the first processor core; and generating the result vector without the second processor core performing a step of pre-accumulating products that result from computations performed at the second processor core. 16. The system of claim 13 , wherein generating the result vector comprises: generating a vector of accumulated values as a result of performing the accumulation operation on the first vector of values; applying an activation function to each value in the vector of accumulated values; and generating the result vector as a result of applying the activation function to each value in the vector of accumulated values. 17. The system of claim 14 , wherein a respective resource of the first processor core is a first matrix computation unit and the operations further comprise: generating a first vector of accumulated values, corresponding to the first vector of values, based on matrix multiplies performed using the first matrix computation unit of the first processor core. 18. The system of

Assignees

Inventors

Classifications

  • G06N3/063Primary

    using electronic means · CPC title

  • Neural networks · CPC title

  • using burst mode transfer, e.g. direct memory access {DMA}, cycle steal (G06F13/32 takes precedence) · CPC title

  • LOAD or STORE instructions; Clear instruction · CPC title

  • Sum of products (for applications thereof, see the relevant places, e.g. G06F17/10, H03H17/00) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11182159B2 cover?
Methods, systems, and apparatus, including computer-readable media, are described for performing vector reductions using a shared scratchpad memory of a hardware circuit having processor cores that communicate with the shared memory. For each of the processor cores, a respective vector of values is generated based on computations performed at the processor core. The shared memory receives the r…
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification G06N3/063. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Nov 23 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 6 related publications on this page (citations in our corpus or others sharing the same primary CPC).