Deep learning accelerator architecture with chunking GEMM

US10657442B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10657442-B2
Application numberUS-201815957711-A
CountryUS
Kind codeB2
Filing dateApr 19, 2018
Priority dateApr 19, 2018
Publication dateMay 19, 2020
Grant dateMay 19, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A compute matrix is configured to include a set of compute units, each compute unit including a multiplier and an accumulator, each of the multiplier and the accumulator formed using at least one floating point unit (FPU). An accumulator array is configured to include a set of external accumulators. The compute matrix is operated to produce a chunk dot-product using a first chunk of a first input vector and a first chunk of a second input vector. The accumulator array is operated to output a dot-product of the first input vector and the second input vector using the chunk dot-product.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: configuring a compute matrix comprising a set of compute units wherein each compute unit comprises a multiplier and an accumulator, each of the multiplier and the accumulator formed using at least one floating point unit (FPU); configuring an accumulator array comprising a set of external accumulators; operating the compute matrix to produce a chunk dot-product using a first chunk of a first input vector and a first chunk of a second input vector; operating the accumulator array to output a dot-product of the first input vector and the second input vector using the chunk dot-product; chunking the first input vector into a first set of chunks, each chunk in the first set of chunks including a non-overlapping subset of values from a first set of values in the first input vector, wherein the first set of chunks includes the first chunk of the first input vector; and chunking the second input vector into a second set of chunks, each chunk in the second set of chunks including a non-overlapping subset of values from a second set of values in the second input vector, wherein the second set of chunks includes the first chunk of the second input vector. 2. The method of claim 1 , further comprising: changing, responsive to a precision of the chunk dot-product, a bit-width of an external accumulator in the accumulator array from a first bit-width to a second bit-width. 3. The method of claim 1 , further comprising: changing, responsive to a precision of values in the first chunk of the first input vector, a bit-width of the multiplier in the compute unit from a first bit-width to a second bit-width. 4. The method of claim 1 , further comprising: changing, responsive to a precision of a product expected to be produced by the multiplier, a bit-width of the accumulator in the compute unit from a first bit-width to a second bit-width. 5. The method of claim 1 , wherein each chunk in the first set of chunks is of a first size. 6. The method of claim 5 , wherein each chunk in the second set of chunks is of the first size. 7. The method of claim 1 , wherein two chunks in the first set of chunks are of different sizes relative to each other. 8. The method of claim 1 , further comprising: configuring each external accumulator in the accumulator array to perform only an accumulation operation. 9. The method of claim 1 , further comprising: configuring an external accumulator in the accumulator array using a third FPU of a third bit-width. 10. The method of claim 1 , wherein the third bit-width exceeds a second bit-width of a second FPU used in the accumulator of the compute unit. 11. The method of claim 1 , further comprising: configuring the multiplier using a first FPU of a first bit-width; and configuring the accumulator using a second FPU of a second bit-width. 12. A computer usable program product comprising a computer-readable storage medium, and program instructions stored on the storage medium, the stored program instructions comprising: program instructions to configure a compute matrix comprising a set of compute units wherein each compute unit comprises a multiplier and an accumulator, each of the multiplier and the accumulator formed using at least one floating point unit (FPU); program instructions to configure an accumulator array comprising a set of external accumulators; program instructions to operate the compute matrix to produce a chunk dot-product using a first chunk of a first input vector and a first chunk of a second input vector; program instructions to operate the accumulator array to output a dot-product of the first input vector and the second input vector using the chunk dot-product; program instructions to chunk the first input vector into a first set of chunks, each chunk in the first set of chunks including a non-overlapping subset of values from a first set of values in the first input vector, wherein the first set of chunks includes the first chunk of the first input vector; and program instructions to chunk the second input vector into a second set of chunks, each chunk in the second set of chunks including a non-overlapping subset of values from a second set of values in the second input vector, wherein the second set of chunks includes the first chunk of the second input vector. 13. The computer usable program product of claim 12 , further comprising: program instructions to change, responsive to a precision of the chunk dot-product, a bit-width of an external accumulator in the accumulator array from a first bit-width to a second bit-width. 14. The computer usable program product of claim 12 , further comprising: program instructions to change, responsive to a precision of values in the first chunk of the first input vector, a bit-width of the multiplier in the compute unit from a first bit-width to a second bit-width. 15. The computer usable program product of claim 12 , further comprising: program instructions to change, responsive to a precision of a product expected to be produced by the multiplier, a bit-width of the accumulator in the compute unit from a first bit-width to a second bit-width. 16. The computer usable program product of claim 12 , wherein the computer usable code is stored in a computer readable storage device in a data processing system, and wherein the computer usable code is transferred over a network from a remote data processing system. 17. The computer usable program product of claim 12 , wherein the computer usable code is stored in a computer readable storage device in a server data processing system, and wherein the computer usable code is downloaded over a network to a remote data processing system for use in a computer readable storage device associated with the remote data processing system. 18. A computer system comprising a processor, a computer-readable memory, and a computer-readable storage device, and program instructions stored on the storage device for execution by the processor via the memory, the stored program instructions comprising: program instructions to configure a compute matrix comprising a set of compute units wherein each compute unit comprises a multiplier and an accumulator, each of the multiplier and the accumulator formed using at least one floating point unit (FPU); program instructions to configure an accumulator array comprising a set of external accumulators; program instructions to operate the compute matrix to produce a chunk dot-product using a first chunk of a first input vector and a first chunk of a second input vector; program instructions to operate the accumulator array to output a dot-product of the first input vector and the second input vector using the chunk dot-product; program instructions to chunk the first input vector into a first set of chunks, each chunk in the first set of chunks including a non-overlapping subset of values from a first set of values in the first input vector, wherein the first set of chunks includes the first chunk of the first input vector; and program instructions to chunk the second input vector into a second set of chunks, each chunk in the second set of chunks including a non-overlapping subset of values from a second set of values in the second input vector, wherein the second set of chunks includes the first chunk of the second input vector.

Assignees

Inventors

Classifications

  • G06F7/5443Primary

    Sum of products (for applications thereof, see the relevant places, e.g. G06F17/10, H03H17/00) · CPC title

  • Computations with numbers represented by a non-linear combination of denominational numbers, e.g. rational numbers, logarithmic number system or floating-point numbers {(G06F7/4806, G06F7/4824, G06F7/49, G06F7/491, G06F7/544 take precedence)} · CPC title

  • Matrix or vector computation {, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization (matrix transposition G06F7/78)} · CPC title

  • G06N3/08Primary

    Learning methods · CPC title

  • using electronic means · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10657442B2 cover?
A compute matrix is configured to include a set of compute units, each compute unit including a multiplier and an accumulator, each of the multiplier and the accumulator formed using at least one floating point unit (FPU). An accumulator array is configured to include a set of external accumulators. The compute matrix is operated to produce a chunk dot-product using a first chunk of a first inp…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F7/5443. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue May 19 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).