What technology area does this patent fall under?

Primary CPC classification G06F7/5443. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue May 19 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Deep learning accelerator architecture with chunking GEMM

US10657442B2 · US · B2

Patent metadata
Field	Value
Publication number	US-10657442-B2
Application number	US-201815957711-A
Country	US
Kind code	B2
Filing date	Apr 19, 2018
Priority date	Apr 19, 2018
Publication date	May 19, 2020
Grant date	May 19, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A compute matrix is configured to include a set of compute units, each compute unit including a multiplier and an accumulator, each of the multiplier and the accumulator formed using at least one floating point unit (FPU). An accumulator array is configured to include a set of external accumulators. The compute matrix is operated to produce a chunk dot-product using a first chunk of a first input vector and a first chunk of a second input vector. The accumulator array is operated to output a dot-product of the first input vector and the second input vector using the chunk dot-product.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: configuring a compute matrix comprising a set of compute units wherein each compute unit comprises a multiplier and an accumulator, each of the multiplier and the accumulator formed using at least one floating point unit (FPU); configuring an accumulator array comprising a set of external accumulators; operating the compute matrix to produce a chunk dot-product using a first chunk of a first input vector and a first chunk of a second input vector; operating the accumulator array to output a dot-product of the first input vector and the second input vector using the chunk dot-product; chunking the first input vector into a first set of chunks, each chunk in the first set of chunks including a non-overlapping subset of values from a first set of values in the first input vector, wherein the first set of chunks includes the first chunk of the first input vector; and chunking the second input vector into a second set of chunks, each chunk in the second set of chunks including a non-overlapping subset of values from a second set of values in the second input vector, wherein the second set of chunks includes the first chunk of the second input vector. 2. The method of claim 1 , further comprising: changing, responsive to a precision of the chunk dot-product, a bit-width of an external accumulator in the accumulator array from a first bit-width to a second bit-width. 3. The method of claim 1 , further comprising: changing, responsive to a precision of values in the first chunk of the first input vector, a bit-width of the multiplier in the compute unit from a first bit-width to a second bit-width. 4. The method of claim 1 , further comprising: changing, responsive to a precision of a product expected to be produced by the multiplier, a bit-width of the accumulator in the compute unit from a first bit-width to a second bit-width. 5. The method of claim 1 , wherein each chunk in the first set of chunks is of a first size. 6. The method of claim 5 , wherein each chunk in the second set of chunks is of the first size. 7. The method of claim 1 , wherein two chunks in the first set of chunks are of different sizes relative to each other. 8. The method of claim 1 , further comprising: configuring each external accumulator in the accumulator array to perform only an accumulation operation. 9. The method of claim 1 , further comprising: configuring an external accumulator in the accumulator array using a third FPU of a third bit-width. 10. The method of claim 1 , wherein the third bit-width exceeds a second bit-width of a second FPU used in the accumulator of the compute unit. 11. The method of claim 1 , further comprising: configuring the multiplier using a first FPU of a first bit-width; and configuring the accumulator using a second FPU of a second bit-width. 12. A computer usable program product comprising a computer-readable storage medium, and program instructions stored on the storage medium, the stored program instructions comprising: program instructions to configure a compute matrix comprising a set of compute units wherein each compute unit comprises a multiplier and an accumulator, each of the multiplier and the accumulator formed using at least one floating point unit (FPU); program instructions to configure an accumulator array comprising a set of external accumulators; program instructions to operate the compute matrix to produce a chunk dot-product using a first chunk of a first input vector and a first chunk of a second input vector; program instructions to operate the accumulator array to output a dot-product of the first input vector and the second input vector using the chunk dot-product; program instructions to chunk the first input vector into a first set of chunks, each chunk in the first set of chunks including a non-overlapping subset of values from a first set of values in the first input vector, wherein the first set of chunks includes the first chunk of the first input vector; and program instructions to chunk the second input vector into a second set of chunks, each chunk in the second set of chunks including a non-overlapping subset of values from a second set of values in the second input vector, wherein the second set of chunks includes the first chunk of the second input vector. 13. The computer usable program product of claim 12 , further comprising: program instructions to change, responsive to a precision of the chunk dot-product, a bit-width of an external accumulator in the accumulator array from a first bit-width to a second bit-width. 14. The computer usable program product of claim 12 , further comprising: program instructions to change, responsive to a precision of values in the first chunk of the first input vector, a bit-width of the multiplier in the compute unit from a first bit-width to a second bit-width. 15. The computer usable program product of claim 12 , further comprising: program instructions to change, responsive to a precision of a product expected to be produced by the multiplier, a bit-width of the accumulator in the compute unit from a first bit-width to a second bit-width. 16. The computer usable program product of claim 12 , wherein the computer usable code is stored in a computer readable storage device in a data processing system, and wherein the computer usable code is transferred over a network from a remote data processing system. 17. The computer usable program product of claim 12 , wherein the computer usable code is stored in a computer readable storage device in a server data processing system, and wherein the computer usable code is downloaded over a network to a remote data processing system for use in a computer readable storage device associated with the remote data processing system. 18. A computer system comprising a processor, a computer-readable memory, and a computer-readable storage device, and program instructions stored on the storage device for execution by the processor via the memory, the stored program instructions comprising: program instructions to configure a compute matrix comprising a set of compute units wherein each compute unit comprises a multiplier and an accumulator, each of the multiplier and the accumulator formed using at least one floating point unit (FPU); program instructions to configure an accumulator array comprising a set of external accumulators; program instructions to operate the compute matrix to produce a chunk dot-product using a first chunk of a first input vector and a first chunk of a second input vector; program instructions to operate the accumulator array to output a dot-product of the first input vector and the second input vector using the chunk dot-product; program instructions to chunk the first input vector into a first set of chunks, each chunk in the first set of chunks including a non-overlapping subset of values from a first set of values in the first input vector, wherein the first set of chunks includes the first chunk of the first input vector; and program instructions to chunk the second input vector into a second set of chunks, each chunk in the second set of chunks including a non-overlapping subset of values from a second set of values in the second input vector, wherein the second set of chunks includes the first chunk of the second input vector.

Assignees

Inventors

Classifications

G06F7/5443Primary
Sum of products (for applications thereof, see the relevant places, e.g. G06F17/10, H03H17/00) · CPC title
G06F7/483
Computations with numbers represented by a non-linear combination of denominational numbers, e.g. rational numbers, logarithmic number system or floating-point numbers {(G06F7/4806, G06F7/4824, G06F7/49, G06F7/491, G06F7/544 take precedence)} · CPC title
G06F17/16
Matrix or vector computation {, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization (matrix transposition G06F7/78)} · CPC title
G06N3/08Primary
Learning methods · CPC title
G06N3/063
using electronic means · CPC title

Patent family

Related publications grouped by family.

View patent family 68238012

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10657442B2 cover?: A compute matrix is configured to include a set of compute units, each compute unit including a multiplier and an accumulator, each of the multiplier and the accumulator formed using at least one floating point unit (FPU). An accumulator array is configured to include a set of external accumulators. The compute matrix is operated to produce a chunk dot-product using a first chunk of a first inp…
Who is the assignee on this patent?: IBM
What technology area does this patent fall under?: Primary CPC classification G06F7/5443. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue May 19 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Hardware node having a mixed-signal matrix vector unit

Smart performance of spill fill data transfers in computing environments

Hardware node having a matrix vector unit with block-floating point processing

Scaling half-precision floating point tensors for training deep neural networks

Tensor processing using low precision format

Accelerator for deep neural networks

Neural network compute tile

Frequently asked questions