Techniques for efficiently transferring data to a processor

US11080051B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11080051-B2
Application numberUS-201916712083-A
CountryUS
Kind codeB2
Filing dateDec 12, 2019
Priority dateOct 29, 2019
Publication dateAug 3, 2021
Grant dateAug 3, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A technique for block data transfer is disclosed that reduces data transfer and memory access overheads and significantly reduces multiprocessor activity and energy consumption. Threads executing on a multiprocessor needing data stored in global memory can request and store the needed data in on-chip shared memory, which can be accessed by the threads multiple times. The data can be loaded from global memory and stored in shared memory using an instruction which directs the data into the shared memory without storing the data in registers and/or cache memory of the multiprocessor during the data transfer.

First claim

Opening claim text (preview).

The invention claimed is: 1. A method performed by a processing system including a multithreaded processor and on-chip memory including a cache memory and a shared memory managed by software, the method comprising: concurrently executing a plurality of threads including a thread executing a fused load and store instruction to load data stored in a memory external to the processing system and store the data into the shared memory, the fused load and store instruction configurable between bypassing (1) processor registers associated with the thread executing a fused load and store instruction or (2) the processor registers and the cache memory, wherein the instruction directs the system to: retrieve the data from the external memory; and store the retrieved data into the shared memory without first storing the retrieved data in the processor registers or without first storing the retrieved data in the processor registers and the cache memory. 2. The method of claim 1 , wherein a plurality of threads execute the fused load and store instructions to load data stored in the external memory into the shared memory, and the instructions direct the system to determine shared memory destination addresses for data requested by the instructions, determine a pattern between fill sectors of the retrieved data and the shared memory destination addresses, and store the fill sectors of the retrieved data to shared memory sectors identified by the determined pattern. 3. The method of claim 1 , wherein the shared memory includes a plurality of sectors arranged in banks, and data stored into at least one of the sectors is applied an intra-sector swizzle. 4. The method of claim 1 , wherein the fused load and store instruction is a single instruction identifying a destination shared memory address and a source global address. 5. The method of claim 1 , wherein when (1) the fused load and store instruction is configured to bypass the processor registers and the cache memory and (2) the data requested by the instruction is stored in the cache memory, the fused load and store instruction directs the system to: store the requested data in the cache memory into the shared memory and invalidate tag in the cache memory for the requested data. 6. The method of claim 1 , wherein the fused load and store instruction is dynamically configurable during run-time. 7. A processing system comprising: a multithreaded processor configured to concurrently execute a plurality of threads; a plurality of data registers, each of the data registers assigned to executing threads; and on-chip memory including a cache memory and a shared memory, the cache memory configured to allow the plurality of executing threads to access tagged data stored in the cache memory, and the shared memory configured to allow the plurality of executing threads to access untagged data stored in the shared memory, wherein the system is configured to, in response to at least one of the threads executing an instruction to load data stored in a memory external to the processing system and selectively store the data into the shared memory without first storing the data in the plurality of data registers or without first storing the data in the plurality of data registers and the cache memory, retrieve the data from the external memory and selectively store the retrieved data in the shared memory (1) without first storing the retrieved data in the plurality of data registers or (2) without first storing the retrieved data in the plurality of data registers and the cache memory. 8. The processing system of claim 7 , wherein the retrieved data is stored in the shared memory (1) without first storing the retrieved data in the plurality of data registers or (2) without first storing the retrieved data in the plurality of data registers and the cache memory, based on a dynamic selection during run-time. 9. The processing system of claim 7 , wherein the cache memory and the shared memory are a unified physical random access memory. 10. The processing system of claim 9 , wherein the unified physical random access memory includes a register file including the plurality of data registers dynamically assignable to the executing threads. 11. The processing system of claim 7 , wherein the processing system includes hardware circuitry configured to: receive a plurality of instructions to load data stored in the external memory and store the data into the shared memory, and determine, based on the plurality of instructions, a shared memory sector fill pattern for storing the retrieved data in the shared memory. 12. The processing system of claim 7 , wherein the shared memory includes a plurality of sectors arranged in banks, and the retrieved data is stored into at least one of the sectors with an intra-sector swizzle. 13. The processing system of claim 7 , wherein the shared memory includes a plurality of sectors arranged in banks, and portions of the retrieved data are stored into the sectors in different banks. 14. The processing system of claim 7 , wherein the instruction loads data from one sector of the external memory and stores the loaded data into parts of plural sectors in the shared memory. 15. The processing system of claim 7 , wherein the shared memory is a software managed cache. 16. The processing system of claim 7 , wherein the retrieved data is stored in one or more cache lines of the cache memory and the system further comprises an interconnect circuit configured to transfer data stored in the one or more cache lines into sectors of the shared memory without first storing the retrieved data in the plurality of data registers. 17. A method performed by a processing system including: a multithreaded processor configured to concurrently execute a plurality of threads; a plurality of registers assigned to the executing threads; hardware managed cache memory configured to allow the plurality of executing threads to access tagged data stored in the cache memory; and software managed shared memory configured to allow the plurality of executing threads to access untagged data stored in the shared memory, the method comprising: concurrently executing a plurality of threads, at least one of the threads executing an instruction to load data stored in a memory external to the processing system and store the data into the shared memory, the instruction configurable to (1) store the data into the shared memory without first storing the data into registers assigned to the executing threads, or (2) store the data into the shared memory without first storing the data into the registers and the cache memory; in response to the instruction configured to store the data into the shared memory without first storing the data into the registers, retrieving the data from the external memory and storing the retrieved data into the shared memory without first storing the retrieved data in the registers; and in response to the instruction configured to store the data into the shared memory without first storing the data into the registers and the cache memory, retrieving the data from the external memory and storing the retrieved data in the shared memory without first storing the retrieved data in the registers and the shared memory. 18. The method of claim 17 , further comprising two or more of the executing threads executing instructions to load data stored in the shared memory into registers assigned to the two or more executing threads and perform a computation using the data stored in the registers. 19. The method of claim 18 , wher

Assignees

Inventors

Classifications

  • G06F9/3004Primary

    to perform operations on memory · CPC title

  • Dependency mechanisms, e.g. register scoreboarding · CPC title

  • Thread control instructions · CPC title

  • Event management; Broadcasting; Multicasting; Notifications · CPC title

  • Coherency control relating to peripheral accessing, e.g. from DMA or I/O device · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11080051B2 cover?
A technique for block data transfer is disclosed that reduces data transfer and memory access overheads and significantly reduces multiprocessor activity and energy consumption. Threads executing on a multiprocessor needing data stored in global memory can request and store the needed data in on-chip shared memory, which can be accessed by the threads multiple times. The data can be loaded from…
Who is the assignee on this patent?
Nvidia Corp
What technology area does this patent fall under?
Primary CPC classification G06F9/3004. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Aug 03 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).