Direct memory access architecture with multi-level multi-striding

US11314674B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11314674-B2
Application numberUS-202016838796-A
CountryUS
Kind codeB2
Filing dateApr 2, 2020
Priority dateFeb 14, 2020
Publication dateApr 26, 2022
Grant dateApr 26, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

DMA architectures capable of performing multi-level multi-striding and determining multiple memory addresses in parallel are described. In one aspect, a DMA system includes one or more hardware DMA threads. Each DMA thread includes a request generator configured to generate, during each parallel memory address computation cycle, m memory addresses for a multi-dimensional tensor in parallel and, for each memory address, a respective request for a memory system to perform a memory operation. The request generator includes m memory address units that each include a step tracker configured to generate, for each dimension of the tensor, a respective step index value for the dimension and, based on the respective step index value, a respective stride offset value for the dimension. Each memory address unit includes a memory address computation element configured to generate a memory address for a tensor element and transmit the request to perform the memory operation.

First claim

Opening claim text (preview).

What is claimed is: 1. A direct memory access (DMA) system, comprising: one or more hardware DMA threads, wherein each DMA thread comprises: a request generator configured to generate, during each parallel memory address computation cycle, (i) m memory addresses for a multi-dimensional tensor in parallel and, for each memory address, (ii) a respective request for a memory system to perform a memory operation for the multi-dimensional tensor, wherein the request generator comprises m memory address units, and wherein each memory address unit comprises: a step tracker configured to generate, during each parallel memory address computation cycle, memory address offset values for a respective tensor element of the multi-dimensional tensor, the generating comprising determining, for each dimension of the multi-dimensional tensor, (i) a respective step index value for the dimension and, based on the respective step index value, (ii) a respective memory address offset value for the dimension, wherein the respective step index values for the dimensions of the multi-dimensional tensor correspond to a location of the respective tensor element within the multi-dimensional tensor; and a memory address computation element configured to: generate, during each parallel memory address computation cycle and based on the respective memory address offset value for each dimension, a memory address for the respective tensor element of the multi-dimensional tensor; and transmit, to the memory system, the request to perform the memory operation using the memory address; wherein each step tracker generates step index values and memory address offset values for a different tensor element than each other step tracker during each parallel memory address computation cycle; wherein m is greater than or equal to two. 2. The DMA system of claim 1 , wherein the request generator is configured to generate the memory addresses in parallel during a single clock cycle and each parallel memory computation is performed during a single clock cycle. 3. The DMA system of claim 1 , wherein the request generator is configured to receive, for the multi-dimensional tensor, a descriptor that defines, for each dimension, a respective steps for stride value for the dimension. 4. The DMA system of claim 1 , wherein the request generator includes m lanes that each include a respective step tracker and a respective memory address computation element, wherein the respective step tracker and respective memory address computation element of each lane computes a corresponding memory address in parallel with each other lane. 5. The DMA system of claim 4 , wherein: the step trackers are configured to generate the memory addresses for the multi-dimensional tensor based on a loop nest that includes, for each dimension of the multi-dimensional tensor, a respective loop for traversing the dimension of the multi-dimensional tensor; and the steps per stride value for each dimension represents a loop bound for the respective loop for the dimension and the step index value for each dimension represents a loop index for the respective loop for the dimension. 6. The DMA system of claim 5 , wherein each step tracker is configured to update the step index value for each of the dimensions during each clock cycle. 7. The DMA system of claim 5 , wherein a combination of the step index values for each step tracker is different from a combination of the step index values for each other step tracker. 8. The DMA system of claim 7 , wherein: each step tracker comprises a step incrementer chain comprising plurality of step incrementers each configured to determine a dimension memory address offset value for a respective dimension: a first step incrementer of the step incrementer chain corresponding to an innermost loop of the loop nest is configured to receive an advance amount; and updating the step index value for one or more of the dimensions during each clock cycle comprises updating, by the first step incrementer, the step index value for the one or more dimensions based on the advance amount. 9. The DMA system of claim 8 , wherein: each of one or more second step incrementers of the step incrementer chain corresponding to a loop in which the innermost loop is nested is configured to receive, from a previous step tracker in the step incrementer chain, a wrap amount; and updating the step index value for one or more of the dimensions during each clock cycle comprises updating, by the second step incrementer, the step index value for the one or more dimensions based on the wrap amount. 10. The DMA system of claim 1 , further comprising a progress tracker comprising a response reorder unit and a synchronization unit. 11. The DMA system of claim 10 , wherein the response reorder unit is configured to maintain, for each tensor, a status of whether a memory operation for the tensor element has been performed. 12. The DMA system of claim 10 , wherein the synchronization unit is configured to provide, to a processor core, multiple partial updates that each specify an overall status of memory operations performed on the tensor elements of the multi-dimensional tensor. 13. The DMA system of claim 10 , wherein: each request comprises a unique identifier; the response reorder unit is configured to: receive responses from the memory system in any order, each response comprising the unique identifier of the request for which the response is provided; and release a set of unique identifiers for re-use by the request generator when at least a threshold number of consecutive unique identifiers are received in the responses. 14. A system, comprising: one or more processor cores; a memory system; and a DMA engine comprising one or more DMA threads, wherein each DMA thread comprises: a request generator configured to generate, during each parallel memory address computation cycle, (i) m memory addresses for a multi-dimensional tensor in parallel and, for each memory address, (ii) a respective request for a memory system to perform a memory operation for the multi-dimensional tensor, wherein the request generator comprises m memory address units, wherein m is greater than or equal to two, and wherein each memory address unit comprises: a step tracker configured to generate, during each parallel memory address computation cycle, memory address offset values for a respective tensor element of the multi-dimensional tensor, the generating comprising determining, for each dimension of the multi-dimensional tensor, (i) a respective step index value for the dimension and, based on the respective step index value, (ii) a respective memory address offset value for the dimension, wherein the respective step index values for the dimensions of the multi-dimensional tensor correspond to a location of the respective tensor element within the multi-dimensional tensor; and a memory address computation element configured to: generate, during each parallel memory address computation cycle and based on the respective memory address offset value for each dimension, a memory address for the respective tensor element of the multi-dimensional tensor; and transmit, to the memory system, the request to perform the memory operation using the memory address, wherein each step tracker generates step index values and memory address offset values for a different tensor element than each other step tracker during each parallel memory address computation cycle; and a progress tracker comprising a response reorder unit and a synchronization update unit configured to provide, to the one or more processor core, partial synchroni

Assignees

Inventors

Classifications

  • for peripheral access to main memory, e.g. direct memory access [DMA] · CPC title

  • G06F12/08Primary

    in hierarchically structured memory systems, e.g. virtual memory systems · CPC title

  • Addressing or accessing the instruction operand or the result {; Formation of operand address; Addressing modes (address translation G06F12/00)} · CPC title

  • G06F13/28Primary

    using burst mode transfer, e.g. direct memory access {DMA}, cycle steal (G06F13/32 takes precedence) · CPC title

  • using a plurality of independent parallel functional units · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11314674B2 cover?
DMA architectures capable of performing multi-level multi-striding and determining multiple memory addresses in parallel are described. In one aspect, a DMA system includes one or more hardware DMA threads. Each DMA thread includes a request generator configured to generate, during each parallel memory address computation cycle, m memory addresses for a multi-dimensional tensor in parallel and,…
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification G06F12/08. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Apr 26 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 10 related publications on this page (citations in our corpus or others sharing the same primary CPC).