System, method, and computer program product for simultaneous execution of compute and graphics workloads
US-10217183-B2 · Feb 26, 2019 · US
US11907717B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11907717-B2 |
| Application number | US-202318107374-A |
| Country | US |
| Kind code | B2 |
| Filing date | Feb 8, 2023 |
| Priority date | Oct 29, 2019 |
| Publication date | Feb 20, 2024 |
| Grant date | Feb 20, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A technique for block data transfer is disclosed that reduces data transfer and memory access overheads and significantly reduces multiprocessor activity and energy consumption. Threads executing on a multiprocessor needing data stored in global memory can request and store the needed data in on-chip shared memory, which can be accessed by the threads multiple times. The data can be loaded from global memory and stored in shared memory using an instruction which directs the data into the shared memory without storing the data in registers and/or cache memory of the multiprocessor during the data transfer.
Opening claim text (preview).
The invention claimed is: 1. A processing system comprising: a multithreaded processor configured to concurrently execute a plurality of threads; a plurality of data registers, each of the data registers assigned to executing threads; and on-chip memory including a cache memory and a shared memory, wherein the system is configured to, in response to at least one of the threads, execute a first instruction to retrieve data stored in a memory external to the processing system through the data registers and the cache memory and store the data into the shared memory, and in response to at least one of the threads, execute a second instruction to retrieve data stored in the memory external to the processing system and store the data into the shared memory without first storing the data in the data registers. 2. The processing system of claim 1 , wherein the cache memory is configured to allow the plurality of executing threads to access tagged data stored in the cache memory, and the shared memory is configured to allow the plurality of executing threads to access untagged data stored in the shared memory. 3. The processing system of claim 2 , wherein the shared memory is a software managed cache. 4. The processing system of claim 3 , wherein the cache memory and the shared memory are a unified physical random access memory. 5. The processing system of claim 4 , wherein the unified physical random access memory includes a register file including the plurality of data registers dynamically assignable to the executing threads. 6. The processing system of claim 1 , wherein the processing system includes hardware circuitry configured to: receive a plurality of instructions to load data stored in the external memory and store the data into the shared memory, and determine, based on the plurality of instructions, a shared memory sector fill pattern for storing the retrieved data in the shared memory. 7. The processing system of claim 1 , wherein the shared memory includes a plurality of sectors arranged in banks, and the retrieved data is stored into at least one of the sectors with an intra-sector swizzle. 8. The processing system of claim 1 , wherein the shared memory includes a plurality of sectors arranged in banks, and portions of the retrieved data are stored into the sectors in different banks. 9. The processing system of claim 1 , wherein the second instruction loads data from one sector of the external memory and stores the loaded data into parts of plural sectors in the shared memory. 10. The processing system of claim 1 , wherein the retrieved data is stored in one or more cache lines of the cache memory and the system further comprises an interconnect circuit configured to transfer data stored in the one or more cache lines into sectors of the shared memory without first storing the retrieved data in the plurality of data registers. 11. A processing system comprising: a multithreaded processor configured to concurrently execute a plurality of threads; a plurality of data registers, each of the data registers assigned to executing threads; and on-chip memory including a cache memory and a shared memory, wherein the system is configured to, in response to at least one of the threads, execute a first instruction to retrieve data stored in a memory external to the processing system through the data registers and the cache memory and store the data into the shared memory, and in response to at least one of the threads, execute a second instruction to retrieve data stored in the memory external to the processing system and store the data into the shared memory without first storing the data in the data registers and the cache memory. 12. The processing system of claim 11 , wherein the cache memory is configured to allow the plurality of executing threads to access tagged data stored in the cache memory, and the shared memory is configured to allow the plurality of executing threads to access untagged data stored in the shared memory. 13. The processing system of claim 12 , wherein the shared memory is a software managed cache. 14. The processing system of claim 12 , wherein the cache memory and the shared memory are a unified physical random access memory. 15. The processing system of claim 14 , wherein the unified physical random access memory includes a register file including the plurality of data registers dynamically assignable to the executing threads. 16. A method performed by a processing system comprising: a multithreaded processor configured to concurrently execute a plurality of threads; a plurality of data registers, each of the data registers assigned to executing threads; and on-chip memory including a cache memory and a shared memory, the method comprising: at least one of the threads executing a first instruction to retrieve data stored in a memory external to the processing system through the data registers and the cache memory and store the data into the shared memory; and at least one of the threads executing a second instruction to retrieve data stored in the memory external to the processing system and store the data into the shared memory without first storing the data in the plurality of data registers. 17. The method of claim 16 , wherein the cache memory is configured to allow the plurality of executing threads to access tagged data stored in the cache memory, and the shared memory is configured to allow the plurality of executing threads to access untagged data stored in the shared memory. 18. The method of claim 17 , wherein the shared memory is a software managed cache. 19. The method of claim 17 , wherein the cache memory and the shared memory are a unified physical random access memory. 20. The method of claim 19 , wherein the unified physical random access memory includes a register file including the plurality of data registers dynamically assignable to the executing threads. 21. The method of claim 16 , wherein the second instruction loads data from one sector of the external memory and stores the loaded data into parts of plural sectors in the shared memory.
from multiple instruction streams, e.g. multistreaming · CPC title
controlled by a single instruction for multiple threads [SIMT] in parallel · CPC title
LOAD or STORE instructions; Clear instruction · CPC title
Thread control instructions · CPC title
Barrier synchronisation · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.