Graphics processors and graphics processing units having dot product accumulate instruction for hybrid floating point format

US12007935B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12007935-B2
Application numberUS-202017428523-A
CountryUS
Kind codeB2
Filing dateMar 14, 2020
Priority dateMar 15, 2019
Publication dateJun 11, 2024
Grant dateJun 11, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Graphics processors and graphics processing units having dot product accumulate instructions for a hybrid floating point format are disclosed. In one embodiment, a graphics multiprocessor comprises an instruction unit to dispatch instructions and a processing resource coupled to the instruction unit. The processing resource is configured to receive a dot product accumulate instruction from the instruction unit and to process the dot product accumulate instruction using a bfloat16 number (BF16) format.

First claim

Opening claim text (preview).

What is claimed is: 1. A graphics multiprocessor, comprising: an instruction unit to dispatch instructions; and a processing resource coupled to the instruction unit, the processing resource of the graphics multiprocessor is configured to receive a dot product accumulate instruction from the instruction unit and to process the dot product accumulate instruction using a bfloat16 number (BF16) format. 2. The graphics multiprocessor of claim 1 , wherein the dot product accumulate instruction causes a second source operand to multiply a third source operand while an accumulator adds a first source operand with output from multiplying the second source operand and the third source operand. 3. The graphics multiprocessor of claim 2 , wherein the accumulator generates an output for a destination. 4. The graphics multiprocessor of claim 2 , wherein the first source operand comprises a single-precision floating point format while at least one of the second and third source operands comprise BF16 format. 5. The graphics multiprocessor of claim 2 wherein the first source operand and the destination are half-precision floating point format, single-precision floating point format, or BF16 formats. 6. The graphics multiprocessor of claim 1 wherein the processing resource comprises a floating point unit (FPU) to execute the dot product accumulate instruction using the BF16 format. 7. The graphics multiprocessor of claim 1 , wherein the instruction unit to dispatch instructions comprising single instruction multiple data (SIMD) instructions, wherein the processing resource is configured to apply a rectified linear unit function to a result of the add. 8. A general-purpose graphics processing unit (GPGPU) core comprising: a single precision floating-point unit for single precision floating point operations; and a half-precision floating point unit for half-precision floating point operations, the half-precision floating point unit of the GPGPU core is configured to execute a dot product accumulate instruction using a bfloat16 (BF16) format. 9. The GPGPU core of claim 8 , wherein the dot product accumulate instruction causes first and second multipliers to each multiply second and third source operands while an accumulator adds a first source operand with output from each of the first and second multipliers. 10. The GPGPU core of claim 9 , wherein the accumulator generates an output for a destination. 11. The GPGPU core of claim 9 , wherein the first source operand comprises a single-precision floating point format while at least one of the second and third source operands comprise BF16 format. 12. The GPGPU core of claim 9 wherein the first source operand and the destination are half-precision floating point format, single-precision floating point format, or BF16 formats. 13. The GPGPU core of claim 8 wherein the dot product accumulate instruction causes a first stage of first and second BF16 multipliers to each multiply second and third source operands while an accumulator adds a first source operand with output from each of the first and second multipliers to generate an output of the first stage. 14. The GPGPU core of claim 13 , wherein the dot product accumulate instruction for a cascaded arrangement with N stages of multipliers and accumulators causes a second stage of first and second multipliers to each multiply second and third source operands while an accumulator adds the output from the first stage with output from each of the first and second BF16 multipliers of the second stage. 15. A parallel processing unit comprising: a first processing cluster to perform parallel processing operations; and a second processing cluster coupled to the first processing cluster, wherein the first processing cluster of the parallel processing unit includes a floating-point unit to perform floating point operations, the floating-point unit is configured to process a dot product accumulate instruction using a bfloat16 (BF16) format. 16. The parallel processing unit of claim 15 , wherein the dot product accumulate instruction causes first and second multipliers to each multiply second and third source operands while an accumulator adds a first source operand with output from each of the first and second multipliers. 17. The parallel processing unit of claim 16 , wherein the accumulator generates an output for a destination. 18. The parallel processing unit of claim 16 , wherein the first source operand comprises a single-precision floating point format while at least one of the second and third source operands comprise BF16 format. 19. The parallel processing unit of claim 16 , wherein the first source operand and the destination are half-precision floating point format, single-precision floating point format, or BF16 formats. 20. The parallel processing unit of claim 15 , wherein the dot product accumulate instruction causes a first stage of first and second multipliers to each multiply second and third source operands while an accumulator adds a first source operand with output from each of the first and second multipliers, wherein the dot product accumulate instruction for a cascaded arrangement with N stages of multipliers and accumulators causes a second stage of first and second multipliers to each multiply second and third source operands while an accumulator adds the output from the first stage with output from each of the first and second multipliers of the second stage. 21. The parallel processing unit of claim 15 , wherein the floating-point unit comprises a cascaded arrangement with N stages of multipliers and accumulators. 22. The parallel processing unit of claim 21 , wherein the N stages comprise: a first stage of first and second multipliers to each multiply second and third source operands and an accumulator to add a first source operand with output from each of the first and second multipliers to generate output of the first stage; and a second stage of first and second multipliers to each multiply second and third source operands and an accumulator to add the output from the first stage with output from each of the first and second multipliers of the second stage. 23. A computing device, comprising: input/output (I/O) devices; a central processing unit (CPU) coupled to the I/O devices; a graphics processing unit (GPU) coupled to the CPU, the GPU having a core that is configured to receive a dot product accumulate instruction and to process the dot product accumulate instruction using a bfloat16 number (BF16) format. 24. The computing device of claim 23 , wherein the dot product accumulate instruction causes first and second multipliers to each multiply second and third source operands while an accumulator adds a first source operand with output from each of the first and second multipliers. 25. The computing device of claim 24 , wherein the accumulator is configured to generate an output for a destination.

Assignees

Inventors

Classifications

  • Page size control · CPC title

  • Details relating to cache mapping · CPC title

  • Prefetching based on hints or prefetch instructions · CPC title

  • Prefetching based on access pattern detection, e.g. stride based prefetch · CPC title

  • Reconfiguration of cache memory · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12007935B2 cover?
Graphics processors and graphics processing units having dot product accumulate instructions for a hybrid floating point format are disclosed. In one embodiment, a graphics multiprocessor comprises an instruction unit to dispatch instructions and a processing resource coupled to the instruction unit. The processing resource is configured to receive a dot product accumulate instruction fro…
Who is the assignee on this patent?
Intel Corp
What technology area does this patent fall under?
Primary CPC classification G06T1/20. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jun 11 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).