Deep learning hardware
US-2019392297-A1 · Dec 26, 2019 · US
US11663043B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11663043-B2 |
| Application number | US-201916701019-A |
| Country | US |
| Kind code | B2 |
| Filing date | Dec 2, 2019 |
| Priority date | Dec 2, 2019 |
| Publication date | May 30, 2023 |
| Grant date | May 30, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A system comprises a processor coupled to a plurality of memory units. Each of the plurality of memory units includes a request processing unit and a plurality of memory banks. The processor includes a plurality of processing elements and a communication network communicatively connecting the plurality of processing elements to the plurality of memory units. At least a first processing element of the plurality of processing elements includes a control logic unit and a matrix compute engine. The control logic unit is configured to access data from the plurality of memory units using a dynamically programmable distribution scheme.
Opening claim text (preview).
What is claimed is: 1. A system, comprising: a plurality of memory units, wherein each of the plurality of memory units includes a request processing unit and a plurality of memory banks, wherein the request processing unit of each of the plurality of memory units is configured to decompose a received broadcasted memory request into a corresponding plurality of partial requests and the request processing unit of each of the plurality of memory units is configured to provide a partial response associated with a different one of the corresponding plurality of partial requests; and a processor coupled to the plurality of memory units, wherein the processor includes a plurality of processing elements and a communication network communicatively connecting the plurality of processing elements to the plurality of memory units, and wherein at least a first processing element of the plurality of processing elements includes a control logic unit and a matrix compute engine, the control logic unit is configured to access data from the plurality of memory units using a dynamically programmable distribution scheme; wherein the dynamically programmable distribution scheme specifies a parameter of a configurable distribution pattern for dynamically changing mapping scheme of memory addresses specific to a corresponding one of the processing elements to ordering among memory locations of the plurality of memory units, and the dynamically programmable distribution scheme is included in a plurality of different dynamically programmable distribution schemes utilized by different processing elements of the plurality of processing elements that allow different workloads to distribute their corresponding workload data across the plurality of memory units using different corresponding distribution schemes of different corresponding mapping schemes included in the plurality of different dynamically programmable distribution schemes desired for corresponding workloads of the different workloads. 2. The system of claim 1 , wherein the broadcasted memory request references data stored in each of the plurality of memory units. 3. The system of claim 1 , wherein the request processing unit of each of the plurality of memory units is configured to determine whether each of the corresponding plurality of partial requests corresponds to data stored in a corresponding one of the plurality of memory banks associated with the corresponding request processing unit. 4. The system of claim 1 , wherein the control logic unit of the first processing element is configured to receive the partial responses and combine the partial responses to generate a complete response to the broadcasted memory request. 5. The system of claim 4 , wherein each of the partial responses includes a corresponding sequence identifier used to order the partial responses. 6. The system of claim 4 , wherein the complete response is stored in a local memory of the first processing element. 7. The system of claim 1 , wherein the plurality of memory units includes a north memory unit, an east memory unit, a south memory unit, and a west memory unit. 8. The system of claim 1 , wherein the dynamically programmable distribution scheme utilizes an identifier associated with a workload of the first processing element. 9. The system of claim 8 , wherein two or more processing elements of the plurality of processing elements share the identifier. 10. The system of claim 1 , wherein a second processing element of the plurality of processing elements is configured with a different dynamically programmable distribution scheme for accessing memory units than the first processing element. 11. The system of claim 1 , wherein the control logic unit of the first processing element is further configured with an access unit size for distributing data across the plurality of memory units. 12. The system of claim 1 , wherein data elements of a machine learning weight matrix are distributed across the plurality of memory units using the dynamically programmable distribution scheme. 13. A method comprising: receiving a memory configuration setting associated with a workload, wherein the workload is associated with a dynamically programmable distribution scheme; creating a memory access request that includes a workload identifier; broadcasting the memory access request to a plurality of memory units, wherein the request processing unit of each of the plurality of memory units is configured to decompose a received broadcasted memory request into a corresponding plurality of partial requests and the request processing unit of each of the plurality of memory units is configured to provide a partial response associated with a different one of the corresponding plurality of partial requests; receiving a plurality of partial responses associated with the memory access request; and combining the plurality of partial responses to create a complete response to the memory access request; wherein dynamically programmable distribution scheme specifies a parameter of a configurable distribution pattern for dynamically changing mapping scheme of memory addresses specific to a corresponding one processing element among a plurality of processing elements to ordering among memory locations of the plurality of memory units, and the dynamically programmable distribution scheme is included in a plurality of different dynamically programmable distribution schemes utilized by different processing elements of the plurality of processing elements that allow different workloads to distribute their corresponding workload data across the plurality of memory units using different corresponding distribution schemes of different corresponding mapping schemes included in the plurality of different dynamically programmable distribution schemes desired for corresponding workloads of the different workloads. 14. The method of claim 13 , further comprising receiving an access unit size configuration setting. 15. The method of claim 14 , wherein the memory access request has a memory request size that is a multiple of the access unit size configuration setting. 16. A method comprising: receiving a broadcasted memory request associated with a processing element workload wherein the processing element workload is associated with a dynamically programmable distribution scheme; decomposing the broadcasted memory request into a plurality of partial requests; determining for each of the plurality of partial requests whether the partial request is to be served from an associated memory bank of a plurality of memory units; discarding a first group of partial requests that is not to be served from the associated memory bank; for each partial request of a second group of partial requests that is to be served from the associated memory bank, retrieving data of the partial request; preparing one or more partial responses using the retrieved data; and providing the prepared one or more partial responses; wherein dynamically programmable distribution scheme specifies a parameter of a configurable distribution pattern for dynamically changing mapping scheme of memory addresses specific to a corresponding one processing element among a plurality of processing elements to ordering among memory locations of the plurality of memory units, and the dynamically programmable distribution scheme is included in a plurality of different dynamically programmable distribution schemes utilized by different processing elements of the plurality of processing elements that allow different workloads to distribute their corresponding workload data acro
Convolutional networks [CNN, ConvNet] · CPC title
in block erasable memory, e.g. flash memory · CPC title
and has means for transferring I/O instructions and statuses between control unit and main processor · CPC title
Performance improvement · CPC title
the resource being the memory · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.