Performing load and permute with a single instruction in a system on a chip

US12118353B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12118353-B2
Application numberUS-202117391491-A
CountryUS
Kind codeB2
Filing dateAug 2, 2021
Priority dateAug 2, 2021
Publication dateOct 15, 2024
Grant dateOct 15, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

In various examples, a VPU and associated components may be optimized to improve VPU performance and throughput. For example, the VPU may include a min/max collector, automatic store predication functionality, a SIMD data path organization that allows for inter-lane sharing, a transposed load/store with stride parameter functionality, a load with permute and zero insertion functionality, hardware, logic, and memory layout functionality to allow for two point and two by two point lookups, and per memory bank load caching capabilities. In addition, decoupled accelerators may be used to offload VPU processing tasks to increase throughput and performance, and a hardware sequencer may be included in a DMA system to reduce programming complexity of the VPU and the DMA system. The DMA and VPU may execute a VPU configuration mode that allows the VPU and DMA to operate without a processing controller for performing dynamic region based data movement operations.

First claim

Opening claim text (preview).

What is claimed is: 1. A processor comprising: a first set of multiplexers to select, based at least on a permute pattern transmitted with a memory address to local memory using a load instruction and an indication of a starting lane that corresponds to a starting address for a non-aligned access to the local memory, a routing of memory lanes that correspond to one or more memory address locations of the local memory to processing lanes of one or more single instruction multiple data (SIMD) units; a second set of multiplexers to, in response to the load instruction, load, using the routing, one or more values retrieved using the non-aligned access from inputs corresponding to the memory lanes to the processing lanes, wherein at least one multiplexer of the second set of multiplexers is operable to selectively load the one or more values to registers of each of one or more processing lanes of the processing lanes; and one or more processing elements to perform one or more operations within one or more of the processing lanes using the one or more values from the registers and at least one instruction. 2. The processor of claim 1 , further comprising a third set of multiplexers to replace at least one value of the one or more values with one or more zero values based at least on the permute pattern indicating the at least one value corresponds to one or more unused entries. 3. The processor of claim 1 , wherein at least two multiplexers of the second set of multiplexers are coupled to a same memory address location of the local memory. 4. The processor of claim 1 , wherein the local memory comprises a plurality of memory banks to form a vector processing width of the one or more SIMD units. 5. The processor of claim 1 , wherein the permute pattern includes, within the load instruction, a list of lane indices having positions in the list that correspond to respective source locations in the local memory, and values of the lane indices include lane identifiers of the processing lanes to receive data from the respective source locations. 6. The processor of claim 1 , wherein the first set of multiplexers includes a first set of inputs that correspond to the permute pattern and a second set of inputs that correspond to the starting lane. 7. The processor of claim 1 , wherein the routing is selected using routing logic that shifts and wraps around a mapping between the memory lanes and the processing lanes in a direction based at least on the starting lane. 8. The processor of claim 1 , wherein the processor is comprised in at least one of: a control system for an autonomous or semi-autonomous machine; a perception system for an autonomous or semi-autonomous machine; a system for performing simulation operations; a system for performing deep learning operations; a system on chip (SoC); a system including a programmable vision accelerator (PVA); a system including a vison processing unit; a system implemented using an edge device; a system implemented using a robot; a system incorporating one or more virtual machines (VMs); a system implemented at least partially in a data center; or a system implemented at least partially using cloud computing resources. 9. A system comprising: a processor comprising: a first set of multiplexers to select, based at least on a permute pattern transmitted with a memory address to local memory using a load instruction and an indication of a starting lane that corresponds to a starting address for a non-aligned access to the local memory, a routing of memory lanes that correspond to one or more memory address locations in the local memory to processing lanes of one or more single instruction multiple data (SIMD) units; a second set of multiplexers to, in response to the load instruction, load, using the routing, one or more values retrieved using the non-aligned access from inputs corresponding to the memory lanes to the processing lanes, wherein at least one multiplexer of the second set of multiplexers is operable to selectively load the one or more values to registers of each of one or more processing lanes of the processing lanes; and one or more processing elements to perform one or more operations within the processing lanes using the one or more values from the registers and at least one instruction. 10. The system of claim 9 , wherein the processor further comprises a third set of multiplexers to replace at least one value of the one or more values with one or more padded values based at least on the permute pattern, the one or more padded values configured to delineate a gap between chunks of data in a data structure. 11. The system of claim 9 , wherein at least two multiplexers of the second set of multiplexers are coupled to a same memory address location of the local memory. 12. The system of claim 9 , wherein the one or more values are loaded into one or more vector registers of a plurality of SIMD units, and the at least one instruction is executed using the one or more vector registers. 13. The system of claim 9 , wherein the permute pattern includes a repeating pattern, and a same value from a same memory address location is included in two or more of the processing lanes based at least on the repeating pattern. 14. The system of claim 9 , wherein the permute pattern is generated dynamically based at least on an output of one or more algorithms. 15. The system of claim 9 , wherein the second set of multiplexers are included in a crossbar switch. 16. The system of claim 9 , wherein the system is comprised in at least one of: a control system for an autonomous or semi-autonomous machine; a perception system for an autonomous or semi-autonomous machine; a system for performing simulation operations; a system for performing deep learning operations; a system on chip (SoC); a system including a programmable vision accelerator (PVA); a system including a vison processing unit; a system implemented using an edge device; a system implemented using a robot; a system incorporating one or more virtual machines (VMs); a system implemented at least partially in a data center; or a system implemented at least partially using cloud computing resources. 17. A method comprising: selecting, using a first set of multiplexers and based at least on a permute pattern transmitted with a memory address to local memory using a load instruction and an indication of a starting lane that corresponds to a starting address for a non-aligned access to the local memory, a routing of memory lanes that correspond to one or more memory address locations in the local memory to processing lanes of one or more single instruction multiple data (SIMD) units; in response to the load instruction loading, using a second set of multiplexers and the routing, one or more values retrieved using the non-aligned access from inputs corresponding to the memory lanes to the processing lanes, wherein at least one multiplexer of the second set of multiplexers is operable to selectively load the one or more values to registers of each of one or more processing lanes of the processing lanes; and performing, using one or more processing elements of a processor, one or more operations within one or more of the processing lanes using the one or more values from the registers and at least one instruction. 18. The method of claim 17 , further comprising replacing, using a third set of multiplexers, one or more of the one or more values with a padded value based at least on the one or more of the one or more values corresponding to a ne

Assignees

Inventors

Classifications

  • using a mask · CPC title

  • Iterative single instructions for multiple data lanes [SIMD] · CPC title

  • controlled by a single instruction for multiple threads [SIMT] in parallel · CPC title

  • controlled by a single instruction for multiple data lanes [SIMD] · CPC title

  • Special purpose registers · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12118353B2 cover?
In various examples, a VPU and associated components may be optimized to improve VPU performance and throughput. For example, the VPU may include a min/max collector, automatic store predication functionality, a SIMD data path organization that allows for inter-lane sharing, a transposed load/store with stride parameter functionality, a load with permute and zero insertion functionality, hardwa…
Who is the assignee on this patent?
Nvidia Corp
What technology area does this patent fall under?
Primary CPC classification G06F9/30036. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Oct 15 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).