Application programming interface to wait on matrix multiply-accumulate

US12204897B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12204897-B2
Application numberUS-202218072081-A
CountryUS
Kind codeB2
Filing dateNov 30, 2022
Priority dateNov 21, 2022
Publication dateJan 21, 2025
Grant dateJan 21, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Apparatuses, systems, and techniques to perform computational operations in response to one or more compute uniform device architecture (CUDA) programs. In at least one embodiment, one or more computational operations are to cause one or more other computational operations to wait until a portion of matrix multiply-accumulate (MMA) operations have been performed.

First claim

Opening claim text (preview).

What is claimed is: 1. A processor comprising: one or more circuits to perform an instruction to cause one or more instructions to wait to perform one or more matrix multiply-accumulate (MMA) operations in parallel based, at least in part, on a number of the MMA operations waiting to be performed. 2. The processor of claim 1 , wherein the instruction is to cause one or more threads comprising the one or more instructions to wait until one or more one or more groups of the one or more MMA operations have been performed. 3. The processor of claim 1 , wherein the instruction is to cause one or more threads comprising the one or more instructions to perform one or more other instructions and, in response to the instruction, wait until one or more one or more groups of the one or more MMA operations have been performed. 4. The processor of claim 1 , wherein the instruction is a wait instruction and one or more portions of the one or more MMA operations to be performed in parallel are one or more groups of asynchronous MMA operations to be performed. 5. The processor of claim 1 , wherein one or more portions of the one or more MMA operations to be performed in parallel are to be indicated, at least in part, as a parameter to the instruction. 6. The processor of claim 1 , wherein the one or more MMA operations have been performed if one or more results of said one or more MMA operations is stored in memory. 7. The processor of claim 1 , wherein a constant integer data value is to be indicated to the instruction to determine one or more portions of the one or more MMA operations to be performed in parallel. 8. The processor of claim 1 , wherein the processor is a graphics processing unit (GPU). 9. A system comprising: one or more processors to perform an instruction to cause one or more instructions to wait to perform one or more matrix multiply-accumulate (MMA) operations in parallel based, at least in part, on a number of the MMA operations waiting to be performed. 10. The system of claim 9 , wherein the instruction is to cause one or more threads comprising the one or more instructions to wait until a threshold quantity of groupings of the MMA operations are pending. 11. The system of claim 9 , wherein the instruction is to cause one or more threads comprising the one or more instructions to wait until one or more one or more groups of the one or more MMA operations have been performed. 12. The system of claim 9 , wherein one or more portions of the one or more MMA operations to be performed in parallel are to be indicated, at least in part, as a parameter to the instruction. 13. The system of claim 9 , wherein the instruction is to cause one or more threads comprising the one or more instructions to perform one or more other instructions and, in response to the instruction, wait until one or more one or more groups of the one or more MMA operations have been performed. 14. The system of claim 9 , wherein the one or more processors are graphics processing units (GPUs). 15. A method comprising: performing an instruction to cause one or more instructions to wait to perform one or more matrix multiply-accumulate (MMA) operations in parallel based, at least in part, on a number of the MMA operations waiting to be performed. 16. The method of claim 15 , further comprising causing, in response to the instruction, one or more threads comprising the one or more instructions to wait until a threshold quantity of groupings of the one or more MMA operations are pending. 17. The method of claim 15 , further comprising causing, in response to the instruction, one or more threads comprising the one or more instructions to wait until a threshold quantity of groupings of the one or more MMA operations have been performed. 18. The method of claim 15 , further comprising causing, in response to the instruction, one or more threads comprising the one or more instructions to perform one or more other instructions and, in response to the instruction, to wait until one or more portions of the one or more MMA operations to be performed in parallel have been performed. 19. The method of claim 15 , wherein the one or more MMA operations are to be asynchronously performed by one or more accelerators of one or more graphics processing units (GPUs). 20. The method of claim 15 , further comprising receiving, as a parameter to the instruction, a threshold value usable to identify one or more portions of the one or more MMA operations to be performed in parallel.

Assignees

Inventors

Classifications

  • G06F17/16Primary

    Matrix or vector computation {, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization (matrix transposition G06F7/78)} · CPC title

  • Maintaining memory consistency · CPC title

  • Synchronisation or serialisation instructions · CPC title

  • Thread control instructions · CPC title

  • G06F9/3001Primary

    Arithmetic instructions · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12204897B2 cover?
Apparatuses, systems, and techniques to perform computational operations in response to one or more compute uniform device architecture (CUDA) programs. In at least one embodiment, one or more computational operations are to cause one or more other computational operations to wait until a portion of matrix multiply-accumulate (MMA) operations have been performed.
Who is the assignee on this patent?
Nvidia Corp
What technology area does this patent fall under?
Primary CPC classification G06F17/16. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 21 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).