What technology area does this patent fall under?

Primary CPC classification G06F17/16. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Nov 28 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Methods and apparatus to perform matrix multiplication in a streaming processor

US11829439B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11829439-B2
Application number	US-202017137226-A
Country	US
Kind code	B2
Filing date	Dec 29, 2020
Priority date	Dec 30, 2019
Publication date	Nov 28, 2023
Grant date	Nov 28, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The present disclosure relates to methods and apparatus for compute processing. For example, disclosed techniques facilitate improving performance of matrix multiplication in streaming processor. Aspects of the present disclosure can execute, with a load control unit, a first load instruction to load a set of input data of an input matrix from a first memory to a second memory. Aspects of the present disclosure can also execute, with the load control unit, a second load instruction to load a set of weight data of a weight matrix from the first memory to the second memory. Additionally, aspects of the present disclosure can perform, with an ALU component, a matrix multiplication operation using the set of input data and the set of weight data to generate an output matrix. Further, aspects of the present disclosure can store the output matrix at a general purpose register accessible to the ALU component.

First claim

Opening claim text (preview).

What is claimed is: 1. A method of performing data processing, comprising: executing, with a load control unit of a streaming processor of a plurality of streaming processors of a processor, a first load instruction to load a set of input data of an input matrix from a first memory to a second memory of the streaming processor, the input matrix comprising a first number of rows and a first number of columns; executing, with the load control unit, a second load instruction to load a set of weight data of a weight matrix from the first memory to the second memory, the weight matrix comprising a second number of rows and a second number of columns; executing, with the load control unit, a third load instruction to fetch an element of a second input matrix; determining, with a direct memory access (DMA) component of the load control unit, the element is not shared by multiple fibers; based on determining the element is not shared by multiple fibers, storing the element to a general purpose register; performing, with an arithmetic logic unit (ALU) component of the streaming processor, a matrix multiplication operation using the set of input data and the set of weight data to generate an output matrix having the first number of rows and the second number of columns, each element of the output matrix representing a dot product of a plurality of elements of a row of the set of input data and a column of the set of weight data, the dot product including a plurality of multiplication operations and accumulation operations resulting in respective intermediate results that are re-input to the ALU component for a subsequent operation of the dot product; and storing the output matrix at the general purpose register of the streaming processor, the general purpose register configured to be accessible to the ALU component. 2. The method of claim 1 , wherein the respective intermediate results are output from the ALU component to an ALU controller of the streaming processor, and wherein the respective intermediate results are re-input from the ALU controller to the ALU component for executing the subsequent operation and without accessing the general purpose register for the executing of the subsequent operation. 3. The method of claim 1 , wherein executing the first load instruction comprises loading, with the load control unit, a first block of elements at memory addresses of the first memory using a first pattern, the first block of elements corresponding to the set of input data of the input matrix, and wherein executing the second load instruction comprises loading, with the load control unit, a second block of elements at memory addresses of the first memory using a second pattern, the second block of elements corresponding to the set of weight data of the weight matrix. 4. The method of claim 3 , wherein using at least one of the first pattern or the second pattern comprises accessing elements that are at contiguous memory addresses at the first memory. 5. The method of claim 1 , wherein a size of the set of input data of the input matrix is based on a wave size, and wherein a size of the set of weight data of the weight matrix is based on the wave size. 6. The method of claim 5 , wherein the wave size corresponds to a plurality of fibers, and wherein each output of the output matrix corresponds to execution of a respective fiber. 7. The method of claim 1 , wherein at least one of the first number of rows and the first number of columns is greater than one, and wherein at least one of the second number of rows and the second number of columns is greater than one. 8. The method of claim 1 , further comprising fetching, with the load control unit, the first load instruction from a local memory of the streaming processor or a shared memory that is accessible to the streaming processor. 9. An apparatus for performing data processing, comprising: a memory; and at least one processor coupled to the memory, wherein: the at least one processor comprises a plurality of streaming processors; a streaming processor of the plurality of streaming processors comprises a load control unit, a second memory, an arithmetic logic unit (ALU) component, and a general purpose register; and the streaming processor is configured to: execute, with the load control unit of the streaming processor, a first load instruction to load a set of input data of an input matrix from a first memory to the second memory the input matrix comprising a first number of rows and a first number of columns; execute, with the load control unit, a second load instruction to load a set of weight data of a weight matrix from the first memory to the second memory, the weight matrix comprising a second number of rows and a second number of columns; execute, with the load control unit, a third load instruction to fetch an element of a second input matrix; determine, with a direct memory access (DMA) component of the load control unit, the element is not shared by multiple fibers; based on the determination that the element is not shared by multiple fibers, store the element to the general purpose register; perform, with the ALU component, a matrix multiplication operation using the set of input data and the set of weight data to generate an output matrix having the first number of rows and the second number of columns, each element of the output matrix representing a dot product of a plurality of elements of a row of the set of input data and a column of the set of weight data, the dot product including a plurality of multiplication operations and accumulation operations resulting in respective intermediate results that are re-input to the ALU component for a subsequent operation of the dot product; and store the output matrix at the general purpose register, the general purpose register configured to be accessible to the ALU component. 10. The apparatus of claim 9 , wherein: the streaming processor further comprises an ALU controller; the intermediate results are output from the ALU component to the ALU controller; and the respective intermediate results are re-input from the ALU controller to the ALU component to execute the subsequent operation without an access to the general purpose register. 11. The apparatus of claim 9 , wherein the streaming processor is configured to: execute the first load instruction to load, with the load control unit, a first block of elements at memory addresses of the first memory using a first pattern, the first block of elements corresponding to the set of input data of the input matrix, and execute the second load instruction to load, with the load control unit, a second block of elements at memory addresses of the first memory using a second pattern, the second block of elements corresponding to the set of weight data of the weight matrix. 12. The apparatus of claim 11 , wherein the streaming processor is configured to use at least one of the first pattern or the second pattern to access elements that are at contiguous memory addresses at the first memory. 13. The apparatus of claim 9 , wherein a size of the set of input data of the input matrix is based on a wave size, and wherein a size of the set of weight data of the weight matrix is based on the wave size. 14. The apparatus of claim 13 , wherein the wave size corresponds to a plurality of fibers, and wherein each output of the output matrix corresponds to execution of a respective fiber. 15. The apparatus of claim 9 , wherein at least one of the first number of rows and the first number of columns is greater than one, and wherein at least one of the second number of rows and th

Assignees

Qualcomm Inc

Inventors

Classifications

G06F9/3851
from multiple instruction streams, e.g. multistreaming · CPC title
G06F9/3887
controlled by a single instruction for multiple data lanes [SIMD] · CPC title
G06F9/30036
Instructions to perform operations on packed data, e.g. vector, tile or matrix operations · CPC title
G06F17/16Primary
Matrix or vector computation {, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization (matrix transposition G06F7/78)} · CPC title
G06F7/57
Arithmetic logic units [ALU], i.e. arrangements or devices for performing two or more of the operations covered by groups G06F7/483 – G06F7/556 or for performing logical operations {(G06F7/49, G06F7/491 take precedence)} · CPC title

Patent family

Related publications grouped by family.

View patent family 76547364

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11829439B2 cover?: The present disclosure relates to methods and apparatus for compute processing. For example, disclosed techniques facilitate improving performance of matrix multiplication in streaming processor. Aspects of the present disclosure can execute, with a load control unit, a first load instruction to load a set of input data of an input matrix from a first memory to a second memory. Aspects of the p…
Who is the assignee on this patent?: Qualcomm Inc
What technology area does this patent fall under?: Primary CPC classification G06F17/16. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Nov 28 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Using machine learning to detect system changes

Method for forward progress and programmable timeouts of tree traversal mechanisms in hardware

Instruction and logic for systolic dot product with accumulate

Systems and methods for implementing chained tile operations

Vector computational unit

Specialized fixed function hardware for efficient convolution

Block floating point for neural network implementations

Frequently asked questions