What technology area does this patent fall under?

Primary CPC classification G06F13/28. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Aug 31 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Memory system architecture for multi-threaded processors

US11106494B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11106494-B2
Application number	US-201816147302-A
Country	US
Kind code	B2
Filing date	Sep 28, 2018
Priority date	Sep 28, 2018
Publication date	Aug 31, 2021
Grant date	Aug 31, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Disclosed embodiments relate to an improved memory system architecture for multi-threaded processors. In one example, a system includes a system comprising a multi-threaded processor core (MTPC), the MTPC comprising: P pipelines, each to concurrently process T threads; a crossbar to communicatively couple the P pipelines; a memory for use by the P pipelines, a scheduler to optimize reduction operations by assigning multiple threads to generate results of commutative arithmetic operations, and then accumulate the generated results, and a memory controller (MC) to connect with external storage and other MTPCs, the MC further comprising at least one optimization selected from: an instruction set architecture including a dual-memory operation; a direct memory access (DMA) engine; a buffer to store multiple pending instruction cache requests; multiple channels across which to stripe memory requests; and a shadow-tag coherency management unit.

First claim

Opening claim text (preview).

What is claimed is: 1. A system comprising: a first multi-threaded processor core; and a second multi-threaded processor core coupled to the first multi-threaded processor core, wherein the first multi-threaded processor core and the second multi-threaded processor core each comprise: a plurality of pipelines, each pipeline to concurrently process a plurality of threads, a crossbar to communicatively couple the plurality of pipelines, a memory controller to connect with an external storage, and a direct memory access engine to, in response to a single instruction executed by a pipeline of the plurality of pipelines, cause a load of a pointer from a first location of the external storage by the memory controller, and perform an access at a second location of the external storage by the memory controller as indicated by the pointer, wherein one of: the direct memory access engine comprises multiple memory channels to the external storage, and a granularity of striping requests across the multiple memory channels is controlled by an N-bit field appended to each memory request address, wherein N is equal to at least two, or the first multi-threaded processor core comprises a shadow-tag coherency management unit, and requests follow a cache coherency protocol. 2. The system of claim 1 , further comprising a plurality of sockets, with a plurality of dies per socket, and a plurality of cores per die, wherein the first multi-threaded processor core and the second multi-threaded processor core are in a single die. 3. The system of claim 1 , wherein the first multi-threaded processor core further includes a plurality of single-threaded pipelines, and the pipeline is a single-threaded pipeline of the plurality of single-threaded pipelines. 4. The system of claim 1 , wherein the single instruction is one of an indirect load instruction, an indirect store instruction, and an indirect-load-store instruction. 5. The system of claim 1 , wherein the direct memory access engine is to perform either direct or indirect memory block transfers, and wherein the direct memory access engine is further to break each load or store block transfer into individual loads or stores, respectively. 6. The system of claim 1 , wherein the first multi-threaded processor core comprises a buffer that supports out-of-order execution by enqueuing and dequeuing instruction cache requests in order, and servicing enqueued instruction cache requests out-of-order. 7. The system of claim 1 , wherein the one is the direct memory access engine comprises the multiple memory channels to the external storage, and the granularity of striping requests across the multiple memory channels is controlled by the N-bit field appended to each memory request address, wherein N is equal to at least two. 8. The system of claim 1 , wherein the one is the first multi-threaded processor core comprises the shadow-tag coherency management unit, and requests follow the cache coherency protocol. 9. The system of claim 1 , wherein the access at the second location of the external storage is a load. 10. A method, performed by a system comprising a multi-threaded processor core coupled to an external storage, the multi-threaded processor core comprising a plurality of pipelines to process a plurality of threads, a crossbar coupling the plurality of pipelines, and a direct memory access engine, comprising: decoding a single instruction into a decoded single instruction; and executing the decoded single instruction with a pipeline of the plurality of pipelines to cause the direct memory access engine to load a pointer from a first location of the external storage, and perform an access at a second location of the external storage as indicated by the pointer, wherein one of: the direct memory access engine comprises multiple memory channels to the external storage, and a granularity of striping requests across the multiple memory channels is controlled by an N-bit field appended to each memory request address, wherein N is equal to at least two, or the multi-threaded processor core comprises a shadow-tag coherency management unit, and requests follow a cache coherency protocol. 11. The method of claim 10 , wherein the system comprises a plurality of sockets, with a plurality of dies per socket, and a plurality of cores per die, wherein the multi-threaded processor core is one of the plurality of cores in a die. 12. The method of claim 10 , wherein the multi-threaded processor core further includes a plurality of single-threaded pipelines, and the pipeline is a single-threaded pipeline of the plurality of single-threaded pipelines. 13. The method of claim 10 , wherein the direct memory access engine is to perform either direct or indirect memory block transfers, and wherein the direct memory access engine is further to break each load or store block transfer into individual loads or stores, respectively. 14. The method of claim 10 , wherein the multi-threaded processor core comprises a buffer that supports out-of-order execution by enqueuing and dequeuing instruction cache requests in order, and servicing enqueued instruction cache requests out-of-order. 15. The method of claim 10 , wherein the multi-threaded processor core further comprises a scheduler to optimize reduction operations by assigning multiple threads to generate results of commutative arithmetic operations, and then to accumulate the generated results. 16. The method of claim 10 , wherein the one is the direct memory access engine comprises the multiple memory channels to the external storage, and the granularity of striping requests across the multiple memory channels is controlled by the N-bit field appended to each memory request address, wherein N is equal to at least two. 17. The method of claim 10 , wherein the one is the multi-threaded processor core comprises the shadow-tag coherency management unit, and requests follow the cache coherency protocol. 18. The method of claim 10 , wherein the access at the second location of the external storage is a load. 19. A non-transitory computer-readable medium containing code, that when performed by a system comprising a multi-threaded processor core coupled to an external storage, the multi-threaded processor core comprising a plurality of pipelines to process a plurality of threads, a crossbar coupling the plurality of pipelines, and a direct memory access engine, causes a method comprising: decoding a single instruction into a decoded single instruction; and executing the decoded single instruction with a pipeline of the plurality of pipelines to cause the direct memory access engine to load a pointer from a first location of the external storage, and perform an access at a second location of the external storage as indicated by the pointer, wherein one of: the direct memory access engine comprises multiple memory channels to the external storage, and a granularity of striping requests across the multiple memory channels is controlled by an N-bit field appended to each memory request address, wherein N is equal to at least two, or the multi-threaded processor core comprises a shadow-tag coherency management unit, and requests follow a cache coherency protocol. 20. The non-transitory computer-readable medium of claim 19 , wherein the system comprises a plurality of sockets, with a plurality of dies per socket, and a plurality of cores per die, and the multi-threaded processor core is one of the plurality of cores in a die. 21. The non-transitory com

Assignees

Intel Corp

Inventors

Classifications

G06F9/3851
from multiple instruction streams, e.g. multistreaming · CPC title
G06F9/3888
controlled by a single instruction for multiple threads [SIMT] in parallel · CPC title
G06F9/3824
Operand accessing · CPC title
G06F13/28Primary
using burst mode transfer, e.g. direct memory access {DMA}, cycle steal (G06F13/32 takes precedence) · CPC title
G06F12/0815
Cache consistency protocols · CPC title

Patent family

Related publications grouped by family.

View patent family 69947491

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11106494B2 cover?: Disclosed embodiments relate to an improved memory system architecture for multi-threaded processors. In one example, a system includes a system comprising a multi-threaded processor core (MTPC), the MTPC comprising: P pipelines, each to concurrently process T threads; a crossbar to communicatively couple the P pipelines; a memory for use by the P pipelines, a scheduler to optimize reduction op…
Who is the assignee on this patent?: Intel Corp
What technology area does this patent fall under?: Primary CPC classification G06F13/28. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Aug 31 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).