What technology area does this patent fall under?

Primary CPC classification G06F17/16. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Feb 18 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Performing matrix multiplication in a streaming processor

US12229215B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12229215-B2
Application number	US-202318487918-A
Country	US
Kind code	B2
Filing date	Oct 16, 2023
Priority date	Dec 30, 2019
Publication date	Feb 18, 2025
Grant date	Feb 18, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The present disclosure relates to methods and apparatus for compute processing. For example, disclosed techniques facilitate improving performance of matrix multiplication in streaming processor. Aspects of the present disclosure can execute, with a load control unit, a first load instruction to load a set of input data of an input matrix from a first memory to a second memory. Aspects of the present disclosure can also execute, with the load control unit, a second load instruction to load a set of weight data of a weight matrix from the first memory to the second memory. Additionally, aspects of the present disclosure can perform, with an ALU component, a matrix multiplication operation using the set of input data and the set of weight data to generate an output matrix. Further, aspects of the present disclosure can store the output matrix at a general purpose register accessible to the ALU component.

First claim

Opening claim text (preview).

What is claimed is: 1. An apparatus for data processing, comprising: at least one memory; and at least one processor coupled to the at least one memory and configured to: fetch an element of an input matrix from graphics memory; determine whether the element of the input matrix is to be used across multiple threads; and store the element of the input matrix at a buffer until a workgroup corresponding to at least one of the multiple threads is executed in response to a first determination that the element of the input matrix is to be used across multiple threads, or store the element of the input matrix at a general purpose register (GPR) in response to a second determination that the element of the input matrix is not to be used across multiple threads. 2. The apparatus of claim 1 , wherein the at least one processor is further configured to: broadcast the element of the input matrix when a thread in the multiple threads is executed. 3. The apparatus of claim 1 , wherein to determine whether the element of the input matrix is to be used across the multiple threads, the at least one processor is configured to: determine that the element of the input matrix is not to be used across the multiple threads, and wherein to store the element of the input matrix at the buffer or at the GPR based on the determination, the at least one processor is configured to: store the element of the input matrix at the GPR based on the determination that the element of the input matrix is not to be used across the multiple threads. 4. The apparatus of claim 1 , wherein the at least one processor is further configured to: fetch a first element of a weight matrix from the graphics memory; and store the first element of the weight matrix at the buffer. 5. The apparatus of claim 4 , and wherein to store the element of the input matrix at the buffer, the at least one processor is configured to: store the element of the input matrix at the buffer in a pattern that is based on matrix multiplication, and wherein to store the first element of the weight matrix at the buffer, the at least one processor is configured to: store the first element of the weight matrix at the buffer in the pattern that is based on the matrix multiplication. 6. The apparatus of claim 4 , wherein the at least one processor is further configured to: perform, via the multiple threads, a matrix multiplication operation with respect to the element of the input matrix and the first element of the weight matrix. 7. The apparatus of claim 1 , wherein to fetch the element of the input matrix from the graphics memory, the at least one processor is configured to; fetch the element of the input matrix from the graphics memory via a single block load instruction. 8. The apparatus of claim 1 , wherein the apparatus is a wireless communication device comprising a transceiver. 9. The apparatus of claim 1 , wherein, to determine whether the element of the input matrix is to be used across multiple threads, the at least one processor is configured to: determine that the element of the input matrix is to be used across multiple threads, wherein, to store the element of the input matrix at the buffer until the workgroup corresponding to at least one of the multiple threads is executed in response to the first determination that the element of the input matrix is to be used across multiple threads, or store the element of the input matrix at the GPR in response to the second determination that the element of the input matrix is not to be used across multiple threads, the at least one processor is configured to: store the element of the input matrix at the buffer until the workgroup corresponding to at least one of the multiple threads is executed in response to the determination that the element of the input matrix is to be used across multiple threads. 10. The apparatus of claim 9 , wherein the at least one processor is further configured to: fetch a second element of the input matrix from graphics memory; determine that the second element of the input matrix is not to be used across multiple threads; and store the second element of the input matrix at the GPR in response to the determination that the second element of the input matrix is not to be used across multiple threads. 11. A method of data processing, comprising: fetching an element of an input matrix from graphics memory; determining whether the element of the input matrix is to be used across multiple threads; and storing the element of the input matrix at a buffer until a workgroup corresponding to at least one of the multiple threads is executed in response to a first determination that the element of the input matrix is to be used across multiple threads, and or storing the element of the input matrix at a general purpose register (GPR) in response to a second determination that the element of the input matrix is not to be used across multiple threads. 12. The method of claim 11 , further comprising: broadcasting the element of the input matrix when a thread in the multiple threads is executed. 13. The method of claim 11 , wherein determining whether the element of the input matrix is to be used across the multiple threads comprises: determining that the element of the input matrix is not to be used across the multiple threads, and wherein storing the element of the input matrix at the buffer or at the GPR based on the determination comprises: storing the element of the input matrix at the GPR based on the determination that the element of the input matrix is not to be used across the multiple threads. 14. The method of claim 11 , further comprising: fetching a first element of a weight matrix from the graphics memory; and storing the first element of the weight matrix at the buffer. 15. The method of claim 14 , and wherein storing the element of the input matrix at the buffer comprises: storing the element of the input matrix at the buffer in a pattern that is based on matrix multiplication, and wherein storing the first element of the weight matrix at the buffer comprises: storing the first element of the weight matrix at the buffer in the pattern that is based on the matrix multiplication. 16. The method of claim 14 , further comprising: performing, via the multiple threads, a matrix multiplication operation with respect to the element of the input matrix and the first element of the weight matrix. 17. The method of claim 11 , wherein fetching the element of the input matrix from the graphics memory comprises: fetching the element of the input matrix from the graphics memory via a single block load instruction. 18. At least one non-transient computer-readable medium storing computer executable code for data processing, comprising code to: fetch an element of an input matrix from graphics memory; determine whether the element of the input matrix is to be used across multiple threads; and store the element of the input matrix at a buffer until a workgroup corresponding to at least one of the multiple threads is executed in response to a first determination that the element of the input matrix is to be used across multiple threads, and or store the element of the input matrix at a general purpose register (GPR) in response to a second determination that the element of the input matrix is not to be used across multiple threads.

Assignees

Qualcomm Inc

Inventors

Classifications

G06F9/3851
from multiple instruction streams, e.g. multistreaming · CPC title
G06F9/3887
controlled by a single instruction for multiple data lanes [SIMD] · CPC title
G06F9/30036
Instructions to perform operations on packed data, e.g. vector, tile or matrix operations · CPC title
G06F7/57
Arithmetic logic units [ALU], i.e. arrangements or devices for performing two or more of the operations covered by groups G06F7/483 – G06F7/556 or for performing logical operations {(G06F7/49, G06F7/491 take precedence)} · CPC title
G06F17/16Primary
Matrix or vector computation {, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization (matrix transposition G06F7/78)} · CPC title

Patent family

Related publications grouped by family.

View patent family 76547364

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12229215B2 cover?: The present disclosure relates to methods and apparatus for compute processing. For example, disclosed techniques facilitate improving performance of matrix multiplication in streaming processor. Aspects of the present disclosure can execute, with a load control unit, a first load instruction to load a set of input data of an input matrix from a first memory to a second memory. Aspects of the p…
Who is the assignee on this patent?: Qualcomm Inc
What technology area does this patent fall under?: Primary CPC classification G06F17/16. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Feb 18 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).