Performing load and store operations of 2D arrays in a single cycle in a system on a chip

US12099439B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12099439-B2
Application numberUS-202117391468-A
CountryUS
Kind codeB2
Filing dateAug 2, 2021
Priority dateAug 2, 2021
Publication dateSep 24, 2024
Grant dateSep 24, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

In various examples, a VPU and associated components may be optimized to improve VPU performance and throughput. For example, the VPU may include a min/max collector, automatic store predication functionality, a SIMD data path organization that allows for inter-lane sharing, a transposed load/store with stride parameter functionality, a load with permute and zero insertion functionality, hardware, logic, and memory layout functionality to allow for two point and two by two point lookups, and per memory bank load caching capabilities. In addition, decoupled accelerators may be used to offload VPU processing tasks to increase throughput and performance, and a hardware sequencer may be included in a DMA system to reduce programming complexity of the VPU and the DMA system. The DMA and VPU may execute a VPU configuration mode that allows the VPU and DMA to operate without a processing controller for performing dynamic region based data movement operations.

First claim

Opening claim text (preview).

What is claimed is: 1. A processor comprising processing circuitry to: receive data representative of a line pitch and a starting memory address in a first memory bank of a plurality of memory banks, the starting memory address corresponding to a first element of a plurality of elements, the plurality of memory banks including a width corresponding to a number of elements included in an individual memory bank of the plurality of memory banks, the line pitch being associated with the width; and read, based at least on the starting memory address, at least the first element of the plurality of elements from the first memory bank of the plurality of memory banks; determine, based at least on the line pitch, an offset between the first element and a second element of the plurality of elements that is in a second memory bank of the plurality of memory banks; and read, based at least on the offset and in a same read operation as the first element, the second element of the plurality of elements from the second memory bank of the plurality of memory banks. 2. The processor of claim 1 , wherein the processor is a vector processing unit (VPU) and the plurality of memory banks are included in vector memory (VMEM). 3. The processor of claim 1 , wherein the line pitch is computed using 16*K+1, K is a lane offset parameter value, and the data representative of the line pitch includes the lane offset parameter value. 4. The processor of claim 1 , wherein the data is indicative of a load type, and the load type is a transposed load. 5. The processor of claim 1 , wherein the first element is stored in line N of the first memory bank and the second element of the plurality of elements is stored in line N+1 of the second memory bank. 6. The processor of claim 1 , wherein the processor is comprised in at least one of: a control system for an autonomous or semi-autonomous machine; a perception system for an autonomous or semi-autonomous machine; a system for performing simulation operations; a system for performing deep learning operations; a system on chip (SoC); a system including a programmable vision accelerator (PVA); a system including a vison processing unit; a system implemented using an edge device; a system implemented using a robot; a system incorporating one or more virtual machines (VMs); a system implemented at least partially in a data center; or a system implemented at least partially using cloud computing resources. 7. A system comprising: one or more processors to: receive data representative of a line pitch and a starting memory address in a first memory bank of a plurality of memory banks of a memory, the starting memory address corresponding to a first element of a plurality of elements and the line pitch being associated with a number of elements included in an individual memory bank of the plurality of memory banks; read, based at least on the starting memory address, the first element from the first memory bank; determine, based at least on the line pitch, an offset between the first element and a second element of the plurality of elements that is in a second memory bank of the plurality of memory banks; and read, based at least on the offset and in a same read operation as the first element, the second element of the plurality of elements from the second memory bank of the plurality of memory banks. 8. The system of claim 7 , wherein the plurality of elements are arranged vertically in a logical view. 9. The system of claim 7 , wherein the data is further representative of a stride parameter and the determination of the offset is further based at least on the stride parameter. 10. The system of claim 7 , wherein the processor is a vector processing unit (VPU) and the memory includes a vector memory (VMEM). 11. The system of claim 7 , wherein the line pitch is computed using 16*K+1, K is a lane offset parameter value, and the data representative of the line pitch includes the offset parameter value. 12. The system of claim 7 , wherein the data is indicative of a load type, and the load type is a transposed load. 13. The system of claim 7 , wherein the first element is stored in line N of the first memory bank and the second element is stored in line N+1 of the second memory bank. 14. The system of claim 7 , wherein the system is comprised in at least one of: a control system for an autonomous or semi-autonomous machine; a perception system for an autonomous or semi-autonomous machine; a system for performing simulation operations; a system for performing deep learning operations; a system on chip (SoC); a system including a programmable vision accelerator (PVA); a system including a vison processing unit; a system implemented using an edge device; a system implemented using a robot; a system incorporating one or more virtual machines (VMs); a system implemented at least partially in a data center; or a system implemented at least partially using cloud computing resources. 15. A method comprising: storing a plurality of elements in a plurality of memory banks, the plurality of memory banks including a width that is associated with a first number of elements included in an individual memory bank of the plurality of memory banks; receiving data representative of a line pitch and a starting memory address in a first memory bank of a plurality of memory banks, the starting memory address corresponding to a first element of a plurality of elements and the line pitch indicating a second number of elements that is greater than the first number of elements; reading, based at least on the starting memory address, the first element from the first memory bank; determining, based at least on the line pitch, an offset between the first element and a second element from the plurality of elements that is in a second memory bank of the plurality of memory banks; and reading, based at least on the offset, the second element from the second memory bank of the plurality of memory banks. 16. The method of claim 15 , wherein the data further represents a stride, the stride indicating: a first number of elements read from the first memory bank, the first number of elements including at least the first element; and a second number of elements read from the second memory bank, the second number of elements including at least the second element. 17. The processor of claim 1 , wherein the data further indicates a stride, the stride indicating a number of elements associated with at least one of the first memory bank or the second memory bank. 18. The processor of claim 1 , wherein: the data further indicates a stride, the stride indicating a number of consecutive elements to read from the individual memory bank of the plurality of memory banks; the reading the first element comprises reading, based at least on the starting memory address and from the first memory bank, a first number of elements that corresponds to the number of consecutive elements indicated by the stride, the first number of elements including the first element; and the reading the second element comprises reading, based at least on the offset and from the second memory bank, a second number of elements that corresponds to the number of consecutive elements indicated by the stride, the second number of elements including the second element. 19. The method of claim 15 , further comprising: determining the width associated with the plurality of memory banks based at least on the line pitch; and determining a number of the plu

Assignees

Inventors

Classifications

  • from multiple instruction streams, e.g. multistreaming · CPC title

  • using a mask · CPC title

  • controlled by a single instruction for multiple threads [SIMT] in parallel · CPC title

  • Instructions to perform operations on packed data, e.g. vector, tile or matrix operations · CPC title

  • for peripheral access to main memory, e.g. direct memory access [DMA] · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12099439B2 cover?
In various examples, a VPU and associated components may be optimized to improve VPU performance and throughput. For example, the VPU may include a min/max collector, automatic store predication functionality, a SIMD data path organization that allows for inter-lane sharing, a transposed load/store with stride parameter functionality, a load with permute and zero insertion functionality, hardwa…
Who is the assignee on this patent?
Nvidia Corp
What technology area does this patent fall under?
Primary CPC classification G06F13/28. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 24 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).