Memory system architecture for multi-threaded processors
US-2020104164-A1 · Apr 2, 2020 · US
US11106494B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11106494-B2 |
| Application number | US-201816147302-A |
| Country | US |
| Kind code | B2 |
| Filing date | Sep 28, 2018 |
| Priority date | Sep 28, 2018 |
| Publication date | Aug 31, 2021 |
| Grant date | Aug 31, 2021 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Disclosed embodiments relate to an improved memory system architecture for multi-threaded processors. In one example, a system includes a system comprising a multi-threaded processor core (MTPC), the MTPC comprising: P pipelines, each to concurrently process T threads; a crossbar to communicatively couple the P pipelines; a memory for use by the P pipelines, a scheduler to optimize reduction operations by assigning multiple threads to generate results of commutative arithmetic operations, and then accumulate the generated results, and a memory controller (MC) to connect with external storage and other MTPCs, the MC further comprising at least one optimization selected from: an instruction set architecture including a dual-memory operation; a direct memory access (DMA) engine; a buffer to store multiple pending instruction cache requests; multiple channels across which to stripe memory requests; and a shadow-tag coherency management unit.
Opening claim text (preview).
What is claimed is: 1. A system comprising: a first multi-threaded processor core; and a second multi-threaded processor core coupled to the first multi-threaded processor core, wherein the first multi-threaded processor core and the second multi-threaded processor core each comprise: a plurality of pipelines, each pipeline to concurrently process a plurality of threads, a crossbar to communicatively couple the plurality of pipelines, a memory controller to connect with an external storage, and a direct memory access engine to, in response to a single instruction executed by a pipeline of the plurality of pipelines, cause a load of a pointer from a first location of the external storage by the memory controller, and perform an access at a second location of the external storage by the memory controller as indicated by the pointer, wherein one of: the direct memory access engine comprises multiple memory channels to the external storage, and a granularity of striping requests across the multiple memory channels is controlled by an N-bit field appended to each memory request address, wherein N is equal to at least two, or the first multi-threaded processor core comprises a shadow-tag coherency management unit, and requests follow a cache coherency protocol. 2. The system of claim 1 , further comprising a plurality of sockets, with a plurality of dies per socket, and a plurality of cores per die, wherein the first multi-threaded processor core and the second multi-threaded processor core are in a single die. 3. The system of claim 1 , wherein the first multi-threaded processor core further includes a plurality of single-threaded pipelines, and the pipeline is a single-threaded pipeline of the plurality of single-threaded pipelines. 4. The system of claim 1 , wherein the single instruction is one of an indirect load instruction, an indirect store instruction, and an indirect-load-store instruction. 5. The system of claim 1 , wherein the direct memory access engine is to perform either direct or indirect memory block transfers, and wherein the direct memory access engine is further to break each load or store block transfer into individual loads or stores, respectively. 6. The system of claim 1 , wherein the first multi-threaded processor core comprises a buffer that supports out-of-order execution by enqueuing and dequeuing instruction cache requests in order, and servicing enqueued instruction cache requests out-of-order. 7. The system of claim 1 , wherein the one is the direct memory access engine comprises the multiple memory channels to the external storage, and the granularity of striping requests across the multiple memory channels is controlled by the N-bit field appended to each memory request address, wherein N is equal to at least two. 8. The system of claim 1 , wherein the one is the first multi-threaded processor core comprises the shadow-tag coherency management unit, and requests follow the cache coherency protocol. 9. The system of claim 1 , wherein the access at the second location of the external storage is a load. 10. A method, performed by a system comprising a multi-threaded processor core coupled to an external storage, the multi-threaded processor core comprising a plurality of pipelines to process a plurality of threads, a crossbar coupling the plurality of pipelines, and a direct memory access engine, comprising: decoding a single instruction into a decoded single instruction; and executing the decoded single instruction with a pipeline of the plurality of pipelines to cause the direct memory access engine to load a pointer from a first location of the external storage, and perform an access at a second location of the external storage as indicated by the pointer, wherein one of: the direct memory access engine comprises multiple memory channels to the external storage, and a granularity of striping requests across the multiple memory channels is controlled by an N-bit field appended to each memory request address, wherein N is equal to at least two, or the multi-threaded processor core comprises a shadow-tag coherency management unit, and requests follow a cache coherency protocol. 11. The method of claim 10 , wherein the system comprises a plurality of sockets, with a plurality of dies per socket, and a plurality of cores per die, wherein the multi-threaded processor core is one of the plurality of cores in a die. 12. The method of claim 10 , wherein the multi-threaded processor core further includes a plurality of single-threaded pipelines, and the pipeline is a single-threaded pipeline of the plurality of single-threaded pipelines. 13. The method of claim 10 , wherein the direct memory access engine is to perform either direct or indirect memory block transfers, and wherein the direct memory access engine is further to break each load or store block transfer into individual loads or stores, respectively. 14. The method of claim 10 , wherein the multi-threaded processor core comprises a buffer that supports out-of-order execution by enqueuing and dequeuing instruction cache requests in order, and servicing enqueued instruction cache requests out-of-order. 15. The method of claim 10 , wherein the multi-threaded processor core further comprises a scheduler to optimize reduction operations by assigning multiple threads to generate results of commutative arithmetic operations, and then to accumulate the generated results. 16. The method of claim 10 , wherein the one is the direct memory access engine comprises the multiple memory channels to the external storage, and the granularity of striping requests across the multiple memory channels is controlled by the N-bit field appended to each memory request address, wherein N is equal to at least two. 17. The method of claim 10 , wherein the one is the multi-threaded processor core comprises the shadow-tag coherency management unit, and requests follow the cache coherency protocol. 18. The method of claim 10 , wherein the access at the second location of the external storage is a load. 19. A non-transitory computer-readable medium containing code, that when performed by a system comprising a multi-threaded processor core coupled to an external storage, the multi-threaded processor core comprising a plurality of pipelines to process a plurality of threads, a crossbar coupling the plurality of pipelines, and a direct memory access engine, causes a method comprising: decoding a single instruction into a decoded single instruction; and executing the decoded single instruction with a pipeline of the plurality of pipelines to cause the direct memory access engine to load a pointer from a first location of the external storage, and perform an access at a second location of the external storage as indicated by the pointer, wherein one of: the direct memory access engine comprises multiple memory channels to the external storage, and a granularity of striping requests across the multiple memory channels is controlled by an N-bit field appended to each memory request address, wherein N is equal to at least two, or the multi-threaded processor core comprises a shadow-tag coherency management unit, and requests follow a cache coherency protocol. 20. The non-transitory computer-readable medium of claim 19 , wherein the system comprises a plurality of sockets, with a plurality of dies per socket, and a plurality of cores per die, and the multi-threaded processor core is one of the plurality of cores in a die. 21. The non-transitory com
from multiple instruction streams, e.g. multistreaming · CPC title
controlled by a single instruction for multiple threads [SIMT] in parallel · CPC title
Operand accessing · CPC title
using burst mode transfer, e.g. direct memory access {DMA}, cycle steal (G06F13/32 takes precedence) · CPC title
Cache consistency protocols · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.