Apparatus and method for determining a sector division ratio of a shared cache memory
US-2015339229-A1 · Nov 26, 2015 · US
US12007935B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12007935-B2 |
| Application number | US-202017428523-A |
| Country | US |
| Kind code | B2 |
| Filing date | Mar 14, 2020 |
| Priority date | Mar 15, 2019 |
| Publication date | Jun 11, 2024 |
| Grant date | Jun 11, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Graphics processors and graphics processing units having dot product accumulate instructions for a hybrid floating point format are disclosed. In one embodiment, a graphics multiprocessor comprises an instruction unit to dispatch instructions and a processing resource coupled to the instruction unit. The processing resource is configured to receive a dot product accumulate instruction from the instruction unit and to process the dot product accumulate instruction using a bfloat16 number (BF16) format.
Opening claim text (preview).
What is claimed is: 1. A graphics multiprocessor, comprising: an instruction unit to dispatch instructions; and a processing resource coupled to the instruction unit, the processing resource of the graphics multiprocessor is configured to receive a dot product accumulate instruction from the instruction unit and to process the dot product accumulate instruction using a bfloat16 number (BF16) format. 2. The graphics multiprocessor of claim 1 , wherein the dot product accumulate instruction causes a second source operand to multiply a third source operand while an accumulator adds a first source operand with output from multiplying the second source operand and the third source operand. 3. The graphics multiprocessor of claim 2 , wherein the accumulator generates an output for a destination. 4. The graphics multiprocessor of claim 2 , wherein the first source operand comprises a single-precision floating point format while at least one of the second and third source operands comprise BF16 format. 5. The graphics multiprocessor of claim 2 wherein the first source operand and the destination are half-precision floating point format, single-precision floating point format, or BF16 formats. 6. The graphics multiprocessor of claim 1 wherein the processing resource comprises a floating point unit (FPU) to execute the dot product accumulate instruction using the BF16 format. 7. The graphics multiprocessor of claim 1 , wherein the instruction unit to dispatch instructions comprising single instruction multiple data (SIMD) instructions, wherein the processing resource is configured to apply a rectified linear unit function to a result of the add. 8. A general-purpose graphics processing unit (GPGPU) core comprising: a single precision floating-point unit for single precision floating point operations; and a half-precision floating point unit for half-precision floating point operations, the half-precision floating point unit of the GPGPU core is configured to execute a dot product accumulate instruction using a bfloat16 (BF16) format. 9. The GPGPU core of claim 8 , wherein the dot product accumulate instruction causes first and second multipliers to each multiply second and third source operands while an accumulator adds a first source operand with output from each of the first and second multipliers. 10. The GPGPU core of claim 9 , wherein the accumulator generates an output for a destination. 11. The GPGPU core of claim 9 , wherein the first source operand comprises a single-precision floating point format while at least one of the second and third source operands comprise BF16 format. 12. The GPGPU core of claim 9 wherein the first source operand and the destination are half-precision floating point format, single-precision floating point format, or BF16 formats. 13. The GPGPU core of claim 8 wherein the dot product accumulate instruction causes a first stage of first and second BF16 multipliers to each multiply second and third source operands while an accumulator adds a first source operand with output from each of the first and second multipliers to generate an output of the first stage. 14. The GPGPU core of claim 13 , wherein the dot product accumulate instruction for a cascaded arrangement with N stages of multipliers and accumulators causes a second stage of first and second multipliers to each multiply second and third source operands while an accumulator adds the output from the first stage with output from each of the first and second BF16 multipliers of the second stage. 15. A parallel processing unit comprising: a first processing cluster to perform parallel processing operations; and a second processing cluster coupled to the first processing cluster, wherein the first processing cluster of the parallel processing unit includes a floating-point unit to perform floating point operations, the floating-point unit is configured to process a dot product accumulate instruction using a bfloat16 (BF16) format. 16. The parallel processing unit of claim 15 , wherein the dot product accumulate instruction causes first and second multipliers to each multiply second and third source operands while an accumulator adds a first source operand with output from each of the first and second multipliers. 17. The parallel processing unit of claim 16 , wherein the accumulator generates an output for a destination. 18. The parallel processing unit of claim 16 , wherein the first source operand comprises a single-precision floating point format while at least one of the second and third source operands comprise BF16 format. 19. The parallel processing unit of claim 16 , wherein the first source operand and the destination are half-precision floating point format, single-precision floating point format, or BF16 formats. 20. The parallel processing unit of claim 15 , wherein the dot product accumulate instruction causes a first stage of first and second multipliers to each multiply second and third source operands while an accumulator adds a first source operand with output from each of the first and second multipliers, wherein the dot product accumulate instruction for a cascaded arrangement with N stages of multipliers and accumulators causes a second stage of first and second multipliers to each multiply second and third source operands while an accumulator adds the output from the first stage with output from each of the first and second multipliers of the second stage. 21. The parallel processing unit of claim 15 , wherein the floating-point unit comprises a cascaded arrangement with N stages of multipliers and accumulators. 22. The parallel processing unit of claim 21 , wherein the N stages comprise: a first stage of first and second multipliers to each multiply second and third source operands and an accumulator to add a first source operand with output from each of the first and second multipliers to generate output of the first stage; and a second stage of first and second multipliers to each multiply second and third source operands and an accumulator to add the output from the first stage with output from each of the first and second multipliers of the second stage. 23. A computing device, comprising: input/output (I/O) devices; a central processing unit (CPU) coupled to the I/O devices; a graphics processing unit (GPU) coupled to the CPU, the GPU having a core that is configured to receive a dot product accumulate instruction and to process the dot product accumulate instruction using a bfloat16 number (BF16) format. 24. The computing device of claim 23 , wherein the dot product accumulate instruction causes first and second multipliers to each multiply second and third source operands while an accumulator adds a first source operand with output from each of the first and second multipliers. 25. The computing device of claim 24 , wherein the accumulator is configured to generate an output for a destination.
Page size control · CPC title
Details relating to cache mapping · CPC title
Prefetching based on hints or prefetch instructions · CPC title
Prefetching based on access pattern detection, e.g. stride based prefetch · CPC title
Reconfiguration of cache memory · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.