Efficient execution of atomic instructions for single instruction, multiple thread (SIMT) architectures

US12547413B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12547413-B2
Application numberUS-202418604201-A
CountryUS
Kind codeB2
Filing dateMar 13, 2024
Priority dateMar 13, 2024
Publication dateFeb 10, 2026
Grant dateFeb 10, 2026

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A first set of threads having a same address corresponding to the shared memory is identified from a group of active threads associated with an instruction to update a shared memory. A first thread of the first set of threads is selected. The instruction is executed for the first thread using the same address to access the shared memory. Attempts to execute the instruction for remaining threads of the first set of threads are delayed until after the first thread is executed and until at least one of the remaining threads of the first set of threads is not guaranteed to fail execution of the instruction.

First claim

Opening claim text (preview).

What is claimed is: 1 . A system comprising: a shared memory; and one or more processing units coupled with the shared memory, wherein the one or more processing units are to: identify, from a group of active threads associated with an instruction to update the shared memory, a first set of threads having a same address corresponding to the shared memory; select a first thread of the first set of threads; execute the instruction for the first thread using the same address to access the shared memory; and store a Boolean value in one or more predicate registers corresponding to remaining threads of the first set of threads to prevent executing the instruction for the remaining threads until after the first thread is executed, wherein the Boolean value indicates that the remaining threads failed to execute the instruction. 2 . The system of claim 1 , wherein the one or more processing units are further to: responsive to the execution of the instruction for the first thread, store a Boolean value in a predicate register corresponding to the first thread, wherein the Boolean value indicates whether the first thread successfully executed the instruction. 3 . The system of claim 1 , wherein the instruction is a compare-and-store (CAST) instruction, and wherein to execute the compare and store instruction, the one or more processing units are to: compare a first value stored at the same memory address of the shared memory with an expected value; and responsive to a determination that the first value matches the expected value, write a second value to the shared memory at the same memory address. 4 . The system of claim 3 , wherein the one or more processing units are further to: write the second value to one or more private registers corresponding to the first set of threads. 5 . The system of claim 4 , wherein the one or more processing units are further to: subsequent to the execution of the instruction for the first thread, execute the CAST instruction for a second thread of the remaining threads of the first set of threads using the second value stored in a respective private register of the private registers. 6 . The system of claim 1 , wherein the shared memory comprises a plurality of logical units, and wherein the one or more processing units are further to: serially execute the instruction for threads from the group of active threads with different addresses corresponding to a same logical unit of the plurality of logical units of the shared memory. 7 . The system of claim 1 , wherein the system is comprised in at least one of: a control system for an autonomous or semi-autonomous machine; a perception system for an autonomous or semi-autonomous machine; a system for performing simulation operations; a system for performing digital twin operations; a system for performing light transport simulation; a system for performing collaborative content creation for 3D assets; a system for performing deep learning operations; a system for generating or presenting at least one of augmented reality content, virtual reality content, or mixed reality content; a system for hosting one or more real-time streaming applications; a system implemented using an edge device; a system implemented using a robot; a system for performing conversational AI operations; a system for generating synthetic data; a system incorporating one or more virtual machines (VMs); a system implemented at least partially in a data center; or a system implemented at least partially using cloud computing resources. 8 . A method comprising: identifying, from a group of active threads associated with an instruction to update a shared memory, a first set of threads having a same address corresponding to the shared memory; selecting a first thread of the first set of threads; executing the instruction for the first thread using the same address to access the shared memory; and storing a Boolean value in one or more predicate registers corresponding to remaining threads of the first set of threads to prevent executing the instruction for the remaining threads until after the first thread is executed and until at least one of the remaining threads of the first set of threads is not guaranteed to fail execution of the instruction, wherein the Boolean value indicates that the remaining threads failed to execute the instruction. 9 . The method of claim 8 , further comprising: responsive to executing the instruction for the first thread, storing a Boolean value in a predicate register corresponding to the first thread, wherein the Boolean value indicates whether the first thread successfully executed the instruction. 10 . The method of claim 8 , wherein the instruction is a compare-and-store (CAST) instruction, and wherein executing the compare and store instruction comprises: comparing a first value stored at the same memory address of the shared memory with an expected value; and responsive to determining that the first value matches the expected value, writing a second value to the shared memory at the same memory address. 11 . The method of claim 10 , further comprising: writing the second value to one or more private registers corresponding to the first set of threads. 12 . The method of claim 11 , further comprising: subsequent to executing the instruction for the first thread, executing the CAST instruction for a second thread of the remaining threads of the first set of threads using the second value stored in a respective private register of the one or more private registers. 13 . The method of claim 8 , wherein the shared memory comprises a plurality of logical units, and wherein the method further comprises: serially executing the instruction for threads from the group of active threads with different addresses corresponding to a same logical unit of the plurality of logical units of the shared memory. 14 . A parallel processing unit (PPU) comprising one or more execution units and a shared memory, wherein the PPU is to: identify, from a group of active threads associated with an instruction to update the shared memory, a first set of threads having a same address corresponding to the shared memory; select a first thread of the first set of threads; execute the instruction for the first thread on the one or more execution units using the same address to access the shared memory; and store a Boolean value in one or more predicate registers corresponding to remaining threads of the first set of threads to prevent executing the instruction for the remaining threads until after the first thread is executed, wherein the Boolean value indicates that the remaining threads failed to execute the instruction. 15 . The PPU of claim 14 , wherein the PPU is further to: responsive to the execution of the instruction for the first thread, store a Boolean value in a predicate register corresponding to the first thread, wherein the Boolean value indicates whether the first thread successfully executed the instruction. 16 . The PPU of claim 14 , wherein the instruction is a compare-and-store (CAST) instruction, and wherein to execute the compare and store instruction, the PPU is to: compare a first value stored at the same memory address of the shared memory with an expected value; and responsive to a determination that the first value matches the expected value, write a second value to the shared memory at the same memory address. 17 . The PPU of claim 16 , wherein the PPU is further to: write the second value to one or mor

Assignees

Inventors

Classifications

  • LOAD or STORE instructions; Clear instruction · CPC title

  • Divergence aspects · CPC title

  • Iterative single instructions for multiple data lanes [SIMD] · CPC title

  • Compare instructions, e.g. Greater-Than, Equal-To, MINMAX · CPC title

  • Maintaining memory consistency · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12547413B2 cover?
A first set of threads having a same address corresponding to the shared memory is identified from a group of active threads associated with an instruction to update a shared memory. A first thread of the first set of threads is selected. The instruction is executed for the first thread using the same address to access the shared memory. Attempts to execute the instruction for remaining threads…
Who is the assignee on this patent?
Nvidia Corp
What technology area does this patent fall under?
Primary CPC classification G06F9/30043. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Feb 10 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 10 related publications on this page (citations in our corpus or others sharing the same primary CPC).