Shared scratchpad memory with parallel load-store

US12367383B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12367383-B2
Application numberUS-202418423203-A
CountryUS
Kind codeB2
Filing dateJan 25, 2024
Priority dateJan 27, 2020
Publication dateJul 22, 2025
Grant dateJul 22, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and apparatus, including computer-readable media, are described for a hardware circuit configured to implement a neural network. The circuit includes a first memory, respective first and second processor cores, and a shared memory. The first memory provides data for performing computations to generate an output for a neural network layer. Each of the first and second cores include a vector memory for storing vector values derived from the data provided by the first memory. The shared memory is disposed generally intermediate the first memory and at least one core and includes: i) a direct memory access (DMA) data path configured to route data between the shared memory and the respective vector memories of the first and second cores and ii) a load-store data path configured to route data between the shared memory and respective vector registers of the first and second cores.

First claim

Opening claim text (preview).

What is claimed is: 1. A circuit configured to implement a neural network comprising a plurality of neural network layers, the circuit comprising: a first memory configured to provide data for performing computations to generate an output for a layer of the neural network; a first processor core comprising a first vector register located within the first processor core and configured to at least load data from or store data to a first vector memory; a second processor core comprising a second vector register located within the second processor core and configured to at least load data from or store data to a second vector memory; and a shared memory disposed intermediate the first memory and at least one of the first processor core or the second processor core, wherein the shared memory and the first memory are communicatively coupled by direct memory access (DMA), wherein the shared memory comprises a software-controlled staging resource that is formed from a subset of memory resources of the shared memory, wherein the software-controlled staging resource is configured to load data from the shared memory in a first phase, and provide the loaded data to the first or second vector register in a second phase; and a matrix computation unit, within the first processor core or the second processor core, configured to perform a subset of the computations to generate accumulated values that are used to generate the output of the layer of the neural network, wherein the software-controlled staging resource is used to manage the flow of the data values corresponding to vector arrays between the first memory and the matrix computation unit, wherein the vector arrays are derived from the data values provided by the first memory. 2. The circuit of claim 1 , wherein the first vector memory is located within the first processor core and configured to store first vector values derived from the data provided by the first memory, and the second vector memory is located within the second processor core and configured to store second vector values derived from the data provided by the first memory. 3. The circuit of claim 1 , wherein the shared memory further comprises: a first direct memory access (DMA) data path configured to route data communications between the shared memory and the first vector memory included in the first processor core, a second direct memory access (DMA) data path configured to route data communications between the shared memory and the second vector memory included in the second processor core; and a first load-store data path configured to route data communications between the shared memory and the first vector register included in the first processor core, and a second load-store data path configured to route data communications between the shared memory and the second vector register included in the second processor core. 4. The circuit of claim 1 , wherein: the circuit comprises a plurality of processor cores, the first processor core and the second processor core being among the plurality of processor cores; and the shared memory comprises a plurality of memory resources that are physically distributed about the circuit to exchange data communications with each of the plurality of processor cores at the circuit. 5. The circuit of claim 4 , wherein the shared memory comprises a shared memory control unit configured to: execute software instructions that cause a first portion of the plurality of memory resources to function as a DMA memory unit operable to move data between the first memory and each of the first processor core and the second processor core. 6. The circuit of claim 5 , wherein the plurality of memory resources comprises a second portion of resources that are configured to: receive data values that are routed along the first or second load-store data path; and temporarily store the data values for a threshold number of processor cycles. 7. The circuit of claim 6 , wherein the second portion of resources are configured to: provide the data values to the first vector register of the first processor core or the second vector register of the second processor core in response to temporarily storing the data values for the threshold number of processor cycles. 8. The circuit of claim 1 , wherein the software-controlled staging resource is used to manage the flow of data values from the first memory to the respective vector register of the first processor core or the second processor core. 9. The circuit of claim 1 , wherein: the circuit comprises a vector processing unit that communicates with the first memory; the vector processing unit is configured to generate a vector of activation values from accumulated values generated at the circuit; and the vector of activation values corresponds to the output for the layer of the neural network. 10. The circuit of claim 8 , wherein: the software-controlled staging resource is a FIFO (first-in-first-out) memory structure along a load section of the load-store data path; and the FIFO memory structure is configured to temporarily store a vector of values for a threshold number of processor cycles before routing the vector of values to the respective vector register of the first processor core or the second processor core. 11. The circuit of claim 1 , wherein the shared memory is configured to function as a shared-global memory space comprising memory resources corresponding to memory banks that are shared between one or more processor cores of a plurality of processor cores. 12. The circuit of claim 1 , wherein the data for performing computations to generate the output for a first layer of the neural network comprises: inputs to be processed through the first layer of the neural network; a respective set of weights for the first layer of the neural network; and instructions for processing one or more of the inputs through the first layer using the respective set of weights for the first layer to generate the output for the first layer. 13. The circuit of claim 1 , wherein each data path of the first and second DMA data paths and the first and second load-store data paths is assigned to a respective block of the shared memory of multiple blocks of the shared memory that are separated from one another. 14. A method for performing computations to generate an output for a layer of a neural network comprising a plurality of neural network layers using a circuit configured to implement the neural network, the method comprising: providing, from a first memory, data used to generate an output for a neural network layer; storing vectors of values at a first processor core of the circuit using a first vector memory of the first processor core, wherein the first processor core comprises a first vector register located within the first processor core and configured to at least load data from or store data to the first vector memory; storing vectors of values at a second processor core of the circuit using a second vector memory of the second processor core, wherein the second processor core comprises a second vector register located within the second processor core and configured to at least load data from or store data to the second vector memory; routing data communications comprising fourth vector values between the shared memory and the second vector register included in the second processor core, wherein the shared memory further comprises a software-controlled staging resource that is formed from a subset of memory resources of the shared memory, wherein the software controlled staging resource is configured to load data from the shared memory in a

Assignees

Inventors

Classifications

  • using burst mode transfer, e.g. direct memory access {DMA}, cycle steal (G06F13/32 takes precedence) · CPC title

  • Access to shared memory · CPC title

  • Distributed shared memory [DSM], e.g. remote direct memory access [RDMA] · CPC title

  • Convolutional networks [CNN, ConvNet] · CPC title

  • DMA · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12367383B2 cover?
Methods, systems, and apparatus, including computer-readable media, are described for a hardware circuit configured to implement a neural network. The circuit includes a first memory, respective first and second processor cores, and a shared memory. The first memory provides data for performing computations to generate an output for a neural network layer. Each of the first and second cores inc…
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification G06N3/063. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jul 22 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).