Masking for coarse grained reconfigurable architecture
US-2023058355-A1 · Feb 23, 2023 · US
US12099453B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12099453-B2 |
| Application number | US-202217709031-A |
| Country | US |
| Kind code | B2 |
| Filing date | Mar 30, 2022 |
| Priority date | Mar 30, 2022 |
| Publication date | Sep 24, 2024 |
| Grant date | Sep 24, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Embodiments of the present disclosure relate to application partitioning for locality in a stacked memory system. In an embodiment, one or more memory dies are stacked on the processor die. The processor die includes multiple processing tiles and each memory die includes multiple memory tiles. Vertically aligned memory tiles are directly coupled to and comprise the local memory block for a corresponding processing tile. An application program that operates on dense multi-dimensional arrays (matrices) may partition the dense arrays into sub-arrays associated with program tiles. Each program tile is executed by a processing tile using the processing tile's local memory block to process the associated sub-array. Data associated with each sub-array is stored in a local memory block and the processing tile corresponding to the local memory block executes the program tile to process the sub-array data.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method of executing an application program by a stacked memory system, comprising: partitioning an N-dimensional array processed by an operation specified by the application program into a first number of N-dimensional sub-arrays; instantiating portions of the application program that include the operation to produce a second number of program tiles, wherein each program tile comprises instructions for a sub-loop of the operation; storing a portion of data associated with each sub-array in a local memory block comprising a memory tile of memory tiles, wherein the memory tiles are fabricated within a memory die that is stacked with a processor die within which a two-dimensional (2D) array of processing tiles are fabricated and conductive paths couple each processing tile in the 2D array to a corresponding one of the memory tiles for communication between each processing tile and the corresponding memory tile; executing the second number of the program tiles by the processing tiles to compute results for the operation, wherein a tile communication network for transmitting memory access requests from each processing tile to memory tiles coupled to a different processing tile is fabricated in the processor die and connects each processing tile in the 2D array with adjacent processing tiles in a first dimension of the 2D array and with adjacent processing tiles in a second dimension of the 2D array. 2. The computer-implemented method of claim 1 , wherein N=2, the array is M×P, the first number is X and equals a quantity of the processing tiles, and each 2-dimensional sub-array is M/X×P/X. 3. The computer-implemented method of claim 2 , wherein the second number of the program tiles is X 2 . 4. The computer-implemented method of claim 1 , wherein at least one additional memory die is stacked on the memory die and the local memory block for each processing tile comprises the memory tile and additional memory tiles fabricated within the additional memory die that are coupled to the processing tile by the conductive paths. 5. The computer-implemented method of claim 1 , wherein the tile communication network transmits data between the first processing tile of the processing tiles and the second memory tile of the memory tiles corresponding to a second processing tile of the processing tiles. 6. The computer-implemented method of claim 1 , further comprising: determining that data is not stored in a first memory tile of the memory tiles that is coupled to a first processing tile of the processing tiles and is stored in a second memory tile coupled to a second processing tile of the processing tiles; and migrating a thread including at least one of the instructions executing on the first processing tile to the second processing tile for processing of the data. 7. The computer-implemented method of claim 6 , wherein migrating the thread comprises: transmitting a message with thread state for the thread to the second processing tile; and activating a new thread by the second processing tile in response to receiving the message. 8. The computer-implemented method of claim 6 , wherein a thread remote procedure call migrates one or more parameters comprising state for the thread from the first processing tile to the second processing tile. 9. The computer-implemented method of claim 1 , wherein the communication network transmits data between a first processing tile of the processing tiles and a second memory die that is stacked on a second processor die. 10. The computer-implemented method of claim 9 , further comprising migrating a thread including at least one of the instructions executing on the first processing tile to a second processing tile fabricated within the second processor die for processing of data stored in a second memory tile fabricated within the second memory die. 11. The computer-implemented method of claim 1 , further comprising determining the portion of data associated with each program tile using a graph partitioner. 12. The computer-implemented method of claim 1 , wherein each processing tile comprises a mapping circuit configured to translate an address generated by the processing tile to a location in the local memory block or the local memory block of a different processing tile within the processor die. 13. The computer-implemented method of claim 1 , wherein each processing tile comprises a mapping circuit configured to translate an address generated by the processing tile to a location in the local memory block, the local memory block of a different processing tile within the processor die, an additional memory die that is stacked on an additional processor within a device that includes the processor die and the memory die, or an additional memory die that is stacked on an additional processor that is external to the device. 14. The computer-implemented method of claim 1 , wherein the conductive paths comprise a through-die via structure that is fabricated within the memory die. 15. The computer-implemented method of claim 14 , wherein the through-die via structure comprises at least one of through-silicon vias, solder bumps, or hybrid bonds. 16. The computer-implemented method of claim 1 , wherein the processor die comprises a graphics processing unit. 17. The computer-implemented method of claim 1 , wherein at least one of the steps of partitioning, instantiating, storing, and executing are performed on a server or in a data center to generate an image, and the image is streamed to a user device. 18. The computer-implemented method of claim 1 , wherein at least one of the steps of partitioning, instantiating, storing, and executing are performed within a cloud computing environment. 19. The computer-implemented method of claim 1 , wherein at least one of the steps of partitioning, instantiating, storing, and executing are performed for training, testing, or inferencing with a neural network employed in a machine, robot, or autonomous vehicle. 20. The computer-implemented method of claim 1 , wherein at least one of the steps of partitioning, instantiating, storing, and executing is performed on a virtual machine comprising a portion of a graphics processing unit. 21. A non-transitory computer-readable media storing computer instructions that, when executed by a stacked memory system, cause the one or more processors to perform the steps of: partitioning an N-dimensional array processed by an operation specified by the application program into a first number of N-dimensional sub-arrays; instantiating portions of the application program that include the operation to produce a second number of program tiles, wherein each program tile comprises instructions for a sub-loop of the operation; storing a portion of data associated with each sub-array in a local memory block comprising a memory tile of memory tiles, wherein the memory tiles are fabricated within a memory die that is stacked with a processor die within which a two-dimensional (2D) array of processing tiles are fabricated and conductive paths couple each processing tile in the 2D array to a corresponding one of the memory tiles for communication between each processing tile and the corresponding memory tile; and executing the second number of the program tiles by the processing tiles to compute results for the operation, wherein a tile communication network for transmitting memory access requests from each processing tile to memory tiles coupled to a different processing tile is fa
using buffers · CPC title
for memories · CPC title
Group selection circuits, e.g. for memory block selection, chip selection, array selection · CPC title
Synchronisation and timing concerns (synchronisation on a memory bus G06F13/4234) · CPC title
Convolutional networks [CNN, ConvNet] · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.