What technology area does this patent fall under?

Primary CPC classification H03K19/1776. Mapped technology areas include Electricity.

When was this patent published?

Publication date Tue Sep 24 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Application partitioning for locality in a stacked memory system

US12099453B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12099453-B2
Application number	US-202217709031-A
Country	US
Kind code	B2
Filing date	Mar 30, 2022
Priority date	Mar 30, 2022
Publication date	Sep 24, 2024
Grant date	Sep 24, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Embodiments of the present disclosure relate to application partitioning for locality in a stacked memory system. In an embodiment, one or more memory dies are stacked on the processor die. The processor die includes multiple processing tiles and each memory die includes multiple memory tiles. Vertically aligned memory tiles are directly coupled to and comprise the local memory block for a corresponding processing tile. An application program that operates on dense multi-dimensional arrays (matrices) may partition the dense arrays into sub-arrays associated with program tiles. Each program tile is executed by a processing tile using the processing tile's local memory block to process the associated sub-array. Data associated with each sub-array is stored in a local memory block and the processing tile corresponding to the local memory block executes the program tile to process the sub-array data.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method of executing an application program by a stacked memory system, comprising: partitioning an N-dimensional array processed by an operation specified by the application program into a first number of N-dimensional sub-arrays; instantiating portions of the application program that include the operation to produce a second number of program tiles, wherein each program tile comprises instructions for a sub-loop of the operation; storing a portion of data associated with each sub-array in a local memory block comprising a memory tile of memory tiles, wherein the memory tiles are fabricated within a memory die that is stacked with a processor die within which a two-dimensional (2D) array of processing tiles are fabricated and conductive paths couple each processing tile in the 2D array to a corresponding one of the memory tiles for communication between each processing tile and the corresponding memory tile; executing the second number of the program tiles by the processing tiles to compute results for the operation, wherein a tile communication network for transmitting memory access requests from each processing tile to memory tiles coupled to a different processing tile is fabricated in the processor die and connects each processing tile in the 2D array with adjacent processing tiles in a first dimension of the 2D array and with adjacent processing tiles in a second dimension of the 2D array. 2. The computer-implemented method of claim 1 , wherein N=2, the array is M×P, the first number is X and equals a quantity of the processing tiles, and each 2-dimensional sub-array is M/X×P/X. 3. The computer-implemented method of claim 2 , wherein the second number of the program tiles is X 2 . 4. The computer-implemented method of claim 1 , wherein at least one additional memory die is stacked on the memory die and the local memory block for each processing tile comprises the memory tile and additional memory tiles fabricated within the additional memory die that are coupled to the processing tile by the conductive paths. 5. The computer-implemented method of claim 1 , wherein the tile communication network transmits data between the first processing tile of the processing tiles and the second memory tile of the memory tiles corresponding to a second processing tile of the processing tiles. 6. The computer-implemented method of claim 1 , further comprising: determining that data is not stored in a first memory tile of the memory tiles that is coupled to a first processing tile of the processing tiles and is stored in a second memory tile coupled to a second processing tile of the processing tiles; and migrating a thread including at least one of the instructions executing on the first processing tile to the second processing tile for processing of the data. 7. The computer-implemented method of claim 6 , wherein migrating the thread comprises: transmitting a message with thread state for the thread to the second processing tile; and activating a new thread by the second processing tile in response to receiving the message. 8. The computer-implemented method of claim 6 , wherein a thread remote procedure call migrates one or more parameters comprising state for the thread from the first processing tile to the second processing tile. 9. The computer-implemented method of claim 1 , wherein the communication network transmits data between a first processing tile of the processing tiles and a second memory die that is stacked on a second processor die. 10. The computer-implemented method of claim 9 , further comprising migrating a thread including at least one of the instructions executing on the first processing tile to a second processing tile fabricated within the second processor die for processing of data stored in a second memory tile fabricated within the second memory die. 11. The computer-implemented method of claim 1 , further comprising determining the portion of data associated with each program tile using a graph partitioner. 12. The computer-implemented method of claim 1 , wherein each processing tile comprises a mapping circuit configured to translate an address generated by the processing tile to a location in the local memory block or the local memory block of a different processing tile within the processor die. 13. The computer-implemented method of claim 1 , wherein each processing tile comprises a mapping circuit configured to translate an address generated by the processing tile to a location in the local memory block, the local memory block of a different processing tile within the processor die, an additional memory die that is stacked on an additional processor within a device that includes the processor die and the memory die, or an additional memory die that is stacked on an additional processor that is external to the device. 14. The computer-implemented method of claim 1 , wherein the conductive paths comprise a through-die via structure that is fabricated within the memory die. 15. The computer-implemented method of claim 14 , wherein the through-die via structure comprises at least one of through-silicon vias, solder bumps, or hybrid bonds. 16. The computer-implemented method of claim 1 , wherein the processor die comprises a graphics processing unit. 17. The computer-implemented method of claim 1 , wherein at least one of the steps of partitioning, instantiating, storing, and executing are performed on a server or in a data center to generate an image, and the image is streamed to a user device. 18. The computer-implemented method of claim 1 , wherein at least one of the steps of partitioning, instantiating, storing, and executing are performed within a cloud computing environment. 19. The computer-implemented method of claim 1 , wherein at least one of the steps of partitioning, instantiating, storing, and executing are performed for training, testing, or inferencing with a neural network employed in a machine, robot, or autonomous vehicle. 20. The computer-implemented method of claim 1 , wherein at least one of the steps of partitioning, instantiating, storing, and executing is performed on a virtual machine comprising a portion of a graphics processing unit. 21. A non-transitory computer-readable media storing computer instructions that, when executed by a stacked memory system, cause the one or more processors to perform the steps of: partitioning an N-dimensional array processed by an operation specified by the application program into a first number of N-dimensional sub-arrays; instantiating portions of the application program that include the operation to produce a second number of program tiles, wherein each program tile comprises instructions for a sub-loop of the operation; storing a portion of data associated with each sub-array in a local memory block comprising a memory tile of memory tiles, wherein the memory tiles are fabricated within a memory die that is stacked with a processor die within which a two-dimensional (2D) array of processing tiles are fabricated and conductive paths couple each processing tile in the 2D array to a corresponding one of the memory tiles for communication between each processing tile and the corresponding memory tile; and executing the second number of the program tiles by the processing tiles to compute results for the operation, wherein a tile communication network for transmitting memory access requests from each processing tile to memory tiles coupled to a different processing tile is fa

Assignees

Nvidia Corp

Inventors

Classifications

G06F13/1673
using buffers · CPC title
H03K19/1776Primary
for memories · CPC title
G11C8/12
Group selection circuits, e.g. for memory block selection, chip selection, array selection · CPC title
G06F13/1689
Synchronisation and timing concerns (synchronisation on a memory bus G06F13/4234) · CPC title
G06N3/0464
Convolutional networks [CNN, ConvNet] · CPC title

Patent family

Related publications grouped by family.

View patent family 88194443

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12099453B2 cover?: Embodiments of the present disclosure relate to application partitioning for locality in a stacked memory system. In an embodiment, one or more memory dies are stacked on the processor die. The processor die includes multiple processing tiles and each memory die includes multiple memory tiles. Vertically aligned memory tiles are directly coupled to and comprise the local memory block for a corr…
Who is the assignee on this patent?: Nvidia Corp
What technology area does this patent fall under?: Primary CPC classification H03K19/1776. Mapped technology areas include Electricity.
When was this patent published?: Publication date Tue Sep 24 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Masking for coarse grained reconfigurable architecture

Compute accelerated stacked memory

3D stacked integrated circuits having functional blocks configured to accelerate artificial neural network (ANN) computation

Non-uniform bus (nub) interconnect protocol for tiled last level caches

Interconnect architecture for three-dimensional processing systems

Frequently asked questions