Techniques for efficiently transferring data to a processor

US11907717B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11907717-B2
Application numberUS-202318107374-A
CountryUS
Kind codeB2
Filing dateFeb 8, 2023
Priority dateOct 29, 2019
Publication dateFeb 20, 2024
Grant dateFeb 20, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A technique for block data transfer is disclosed that reduces data transfer and memory access overheads and significantly reduces multiprocessor activity and energy consumption. Threads executing on a multiprocessor needing data stored in global memory can request and store the needed data in on-chip shared memory, which can be accessed by the threads multiple times. The data can be loaded from global memory and stored in shared memory using an instruction which directs the data into the shared memory without storing the data in registers and/or cache memory of the multiprocessor during the data transfer.

First claim

Opening claim text (preview).

The invention claimed is: 1. A processing system comprising: a multithreaded processor configured to concurrently execute a plurality of threads; a plurality of data registers, each of the data registers assigned to executing threads; and on-chip memory including a cache memory and a shared memory, wherein the system is configured to, in response to at least one of the threads, execute a first instruction to retrieve data stored in a memory external to the processing system through the data registers and the cache memory and store the data into the shared memory, and in response to at least one of the threads, execute a second instruction to retrieve data stored in the memory external to the processing system and store the data into the shared memory without first storing the data in the data registers. 2. The processing system of claim 1 , wherein the cache memory is configured to allow the plurality of executing threads to access tagged data stored in the cache memory, and the shared memory is configured to allow the plurality of executing threads to access untagged data stored in the shared memory. 3. The processing system of claim 2 , wherein the shared memory is a software managed cache. 4. The processing system of claim 3 , wherein the cache memory and the shared memory are a unified physical random access memory. 5. The processing system of claim 4 , wherein the unified physical random access memory includes a register file including the plurality of data registers dynamically assignable to the executing threads. 6. The processing system of claim 1 , wherein the processing system includes hardware circuitry configured to: receive a plurality of instructions to load data stored in the external memory and store the data into the shared memory, and determine, based on the plurality of instructions, a shared memory sector fill pattern for storing the retrieved data in the shared memory. 7. The processing system of claim 1 , wherein the shared memory includes a plurality of sectors arranged in banks, and the retrieved data is stored into at least one of the sectors with an intra-sector swizzle. 8. The processing system of claim 1 , wherein the shared memory includes a plurality of sectors arranged in banks, and portions of the retrieved data are stored into the sectors in different banks. 9. The processing system of claim 1 , wherein the second instruction loads data from one sector of the external memory and stores the loaded data into parts of plural sectors in the shared memory. 10. The processing system of claim 1 , wherein the retrieved data is stored in one or more cache lines of the cache memory and the system further comprises an interconnect circuit configured to transfer data stored in the one or more cache lines into sectors of the shared memory without first storing the retrieved data in the plurality of data registers. 11. A processing system comprising: a multithreaded processor configured to concurrently execute a plurality of threads; a plurality of data registers, each of the data registers assigned to executing threads; and on-chip memory including a cache memory and a shared memory, wherein the system is configured to, in response to at least one of the threads, execute a first instruction to retrieve data stored in a memory external to the processing system through the data registers and the cache memory and store the data into the shared memory, and in response to at least one of the threads, execute a second instruction to retrieve data stored in the memory external to the processing system and store the data into the shared memory without first storing the data in the data registers and the cache memory. 12. The processing system of claim 11 , wherein the cache memory is configured to allow the plurality of executing threads to access tagged data stored in the cache memory, and the shared memory is configured to allow the plurality of executing threads to access untagged data stored in the shared memory. 13. The processing system of claim 12 , wherein the shared memory is a software managed cache. 14. The processing system of claim 12 , wherein the cache memory and the shared memory are a unified physical random access memory. 15. The processing system of claim 14 , wherein the unified physical random access memory includes a register file including the plurality of data registers dynamically assignable to the executing threads. 16. A method performed by a processing system comprising: a multithreaded processor configured to concurrently execute a plurality of threads; a plurality of data registers, each of the data registers assigned to executing threads; and on-chip memory including a cache memory and a shared memory, the method comprising: at least one of the threads executing a first instruction to retrieve data stored in a memory external to the processing system through the data registers and the cache memory and store the data into the shared memory; and at least one of the threads executing a second instruction to retrieve data stored in the memory external to the processing system and store the data into the shared memory without first storing the data in the plurality of data registers. 17. The method of claim 16 , wherein the cache memory is configured to allow the plurality of executing threads to access tagged data stored in the cache memory, and the shared memory is configured to allow the plurality of executing threads to access untagged data stored in the shared memory. 18. The method of claim 17 , wherein the shared memory is a software managed cache. 19. The method of claim 17 , wherein the cache memory and the shared memory are a unified physical random access memory. 20. The method of claim 19 , wherein the unified physical random access memory includes a register file including the plurality of data registers dynamically assignable to the executing threads. 21. The method of claim 16 , wherein the second instruction loads data from one sector of the external memory and stores the loaded data into parts of plural sectors in the shared memory.

Assignees

Inventors

Classifications

  • G06F9/3851Primary

    from multiple instruction streams, e.g. multistreaming · CPC title

  • controlled by a single instruction for multiple threads [SIMT] in parallel · CPC title

  • LOAD or STORE instructions; Clear instruction · CPC title

  • Thread control instructions · CPC title

  • Barrier synchronisation · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11907717B2 cover?
A technique for block data transfer is disclosed that reduces data transfer and memory access overheads and significantly reduces multiprocessor activity and energy consumption. Threads executing on a multiprocessor needing data stored in global memory can request and store the needed data in on-chip shared memory, which can be accessed by the threads multiple times. The data can be loaded from…
Who is the assignee on this patent?
Nvidia Corp
What technology area does this patent fall under?
Primary CPC classification G06F9/3851. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Feb 20 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).