Efficient neural network accelerator dataflows
US-2020293867-A1 · Sep 17, 2020 · US
US11934826B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11934826-B2 |
| Application number | US-202117530869-A |
| Country | US |
| Kind code | B2 |
| Filing date | Nov 19, 2021 |
| Priority date | Feb 26, 2020 |
| Publication date | Mar 19, 2024 |
| Grant date | Mar 19, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods, systems, and apparatus, including computer-readable media, are described for performing vector reductions using a shared scratchpad memory of a hardware circuit having processor cores that communicate with the shared memory. For each of the processor cores, a respective vector of values is generated based on computations performed at the processor core. The shared memory receives the respective vectors of values from respective resources of the processor cores using a direct memory access (DMA) data path of the shared memory. The shared memory performs an accumulation operation on the respective vectors of values using an operator unit coupled to the shared memory. The operator unit is configured to accumulate values based on arithmetic operations encoded at the operator unit. A result vector is generated based on performing the accumulation operation using the respective vectors of values.
Opening claim text (preview).
What is claimed is: 1. A method performed using an integrated circuit for a hardware machine-learning accelerator that includes a plurality of cores and a shared memory that communicates with each of the plurality of cores, the method comprising: generating, by each of the plurality of cores, a respective vector of values; performing, across the plurality of cores and into a shared memory cell in the shared memory, a plurality of atomic vector reductions using each of the respective vectors and an operator unit of the shared memory without synchronization; and generating a result vector based on the plurality of atomic vector reductions. 2. The method of claim 1 , wherein performing the plurality of atomic vector reductions comprises: accumulating a first vector stored in the shared memory cell with a respective second vector generated by one or more of the plurality of cores. 3. The method of claim 1 , wherein: each of the plurality of cores comprises a respective vector-processing unit; and generating a respective vector of values comprises: generating, by each of the vector-processing units, a respective vector of values. 4. The method of claim 3 , wherein each of the operator unit and the shared memory is external to the respective vector-processing unit in each of the plurality of cores. 5. An integrated circuit for a hardware machine-learning accelerator, the integrated circuit comprising: a plurality of cores; a shared memory that communicates with each of the plurality of cores; and a non-transitory machine-readable storage device for storing instructions that are executable by a processor to cause performance of operations comprising: generating, by each of the plurality of cores, a respective vector of values; performing, across the plurality of cores and into a shared memory cell in the shared memory, a plurality of atomic vector reductions using each of the respective vectors and an operator unit of the shared memory without synchronization; and generating a result vector based on the plurality of atomic vector reductions. 6. The integrated circuit of claim 5 , wherein performing the plurality of atomic vector reductions comprises: accumulating a first vector stored in the shared memory cell with a respective second vector generated by one or more of the plurality of cores. 7. The integrated circuit of claim 5 , wherein: each of the plurality of cores comprises a respective vector-processing unit; and generating a respective vector of values comprises: generating, by each of the vector-processing units, a respective vector of values. 8. The integrated circuit of claim 7 , wherein each of the operator unit and the shared memory is external to the respective vector-processing unit in each of the plurality of cores.
Instructions to perform operations on packed data, e.g. vector, tile or matrix operations · CPC title
Convolutional networks [CNN, ConvNet] · CPC title
using a mask · CPC title
Arithmetic instructions · CPC title
to perform operations on memory · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.