Controlling multi-pass rendering sequences in a cache tiling architecture
US-10535114-B2 · Jan 14, 2020 · US
US11093579B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11093579-B2 |
| Application number | US-201816122030-A |
| Country | US |
| Kind code | B2 |
| Filing date | Sep 5, 2018 |
| Priority date | Sep 5, 2018 |
| Publication date | Aug 17, 2021 |
| Grant date | Aug 17, 2021 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Disclosed embodiments relate to mixed-precision vector multiply-accumulate (MPVMAC) In one example, a processor includes fetch circuitry to fetch a compress instruction having fields to specify locations of a source vector having N single-precision formatted elements, and a compressed vector having N neural half-precision (NHP) formatted elements, decode circuitry to decode the fetched compress instruction, execution circuitry to respond to the decoded compress instruction by: converting each element of the source vector into the NHP format and writing each converted element to a corresponding compressed vector element, wherein the processor is further to fetch, decode, and execute a MPVMAC instruction to multiply corresponding NHP-formatted elements using a 16-bit multiplier, and accumulate each of the products with previous contents of a corresponding destination using a 32-bit accumulator.
Opening claim text (preview).
What is claimed is: 1. A processor comprising: fetch circuitry to fetch a compress instruction having fields to specify locations of a source vector having N single-precision formatted elements, and a compressed vector having N neural half-precision (NHP) formatted elements; decode circuitry to decode the fetched compress instruction; execution circuitry to respond to the decoded compress instruction by: converting each element of the source vector into the NHP format; rounding each converted element according to a rounding mode; and writing each rounded element to a corresponding compressed vector element; wherein the NHP format comprises seven significand bits, and eight exponent bits; wherein the source and compressed vectors are each either in memory or in registers; wherein the fetch, decode, and execution circuitry are further to fetch, decode, and execute a second compress instruction specifying locations of a second source vector having N elements formatted according to the single-precision format, and a second compressed vector having N elements formatted according to the NHP format; wherein the fetch and decode circuitry is further to fetch and decode a mixed-precision vector multiply-accumulate (MPVMAC) instruction having fields to specify first and second source vectors having N NHP-formatted elements, and a destination vector having N single-precision-formatted elements; wherein the specified source vectors are the compressed vector and the second compressed vector; and wherein the execution circuitry is further to respond to the decoded MPVMAC instruction, for each of the N elements, by generating a 16-bit product of the compressed vector element and the second compressed vector element and accumulating the generated 16-bit product with previous contents of a corresponding element of the destination vector. 2. The processor of claim 1 , wherein the MPVMAC instruction further has a field to specify a writemask, the specified writemask comprising N bits, each bit to identify either when the corresponding element of the destination vector is unmasked and to be written with the generated 16-bit product, or when the corresponding element of the destination vector is mapped and is either to be zeroed or merged. 3. The processor of claim 1 , wherein the fetch circuitry is further to fetch an expand instruction having fields to specify locations of a destination vector having N elements formatted according to the single-precision format and the compressed vector; wherein the processor further comprises: decode circuitry to decode the fetched expand instruction; and execution circuitry to respond to the decoded expand instruction by: converting each element of the compressed vector into the single-precision format; and writing each converted element to a corresponding destination vector element. 4. The processor of claim 1 , wherein the single-precision format is a binary32 format standardized by the Institute of Electrical and Electronics Engineers (IEEE) as part of the IEEE 754-2008 standard. 5. The processor of claim 4 , wherein the rounding mode is specified by the IEEE 754 standard and is one of round to nearest with ties to even, round to nearest with ties away from zero, round toward zero, round toward positive infinity, and round toward negative infinity, and wherein the rounding mode is specified either on a per-instruction basis by an immediate value specified by the instruction, or on an embedded basis by a software-programmable control and status register. 6. The processor of claim 1 , wherein the specified source and compressed vectors each occupy one or rows of a matrix having M rows by N columns. 7. The processor of claim 1 , wherein the execution circuitry is further to perform rounding when converting, accumulating, and multiplying, according to the rounding mode. 8. The processor of claim 1 , wherein the rounding mode is one of round to nearest even, round toward negative infinity, round toward positive infinity and round toward zero, and wherein the rounding mode is specified either on a per-instruction basis by an immediate value specified by the instruction, or on an embedded basis by a software-programmable control and status register. 9. The processor of claim 1 , wherein the execution circuitry is further to perform saturation, as necessary, when accumulating and multiplying. 10. A method comprising: fetching, using fetch circuitry, a compress instruction having fields to specify locations of a source vector having N single-precision formatted elements, and a compressed vector having N neural half-precision (NHP) formatted elements; decoding, using decode circuitry, the fetched compress instruction; responding, using execution circuitry, to the decoded compress instruction by: converting each element of the source vector into the NHP format; rounding each converted element according to a rounding mode; writing each rounded element to a corresponding compressed vector element; wherein the NHP format comprises seven significand bits, and eight exponent bits; wherein the source and compressed vectors are each either in memory or in registers; fetching, decoding, and executing, using the fetch, decode, and execution circuitry, a second compress instruction specifying locations of a second source vector having N elements formatted according to the single-precision format, and a second compressed vector having N elements formatted according to the NHP format; fetching and decoding, using the fetch and decode circuitry, a mixed-precision vector multiply-accumulate (MPVMAC) instruction having fields to specify first and second source vectors having N NHP-formatted elements, and a destination vector having N single-precision-formatted elements, wherein the specified source vectors are the compressed vector and the second compressed vector; and responding, using the execution circuitry, to the decoded MPVMAC instruction, for each of the N elements, by generating a 16-bit product of the compressed vector element and the second compressed vector element, and accumulating the generated 16-bit product with previous contents of a corresponding element of the destination vector. 11. The method of claim 10 , wherein the MPVMAC instruction further has a field to specify a writemask, the specified writemask comprising N bits, each bit to identify either when the corresponding element of the destination vector is unmasked and to be written with the generated 16-bit product, or when the corresponding element of the destination vector is mapped and is either to be zeroed or merged. 12. The method of claim 10 , further comprising: fetching, using the fetch circuitry, an expand instruction having fields to specify locations of a destination vector having N elements formatted according to the single-precision format and the compressed vector; decoding, using decode circuitry, the fetched expand instruction; responding, using execution circuitry, to the decoded expand instruction by: converting each element of the compressed vector into the single-precision format; and writing each converted element to a corresponding destination vector element. 13. The method of claim 10 , wherein the single-precision format is a binary32 format standardized by the Institute of Electrical and Electronics Engineers (IEEE) as part of the IEEE 754-2008 standard. 14. The method of claim 13 , wherein the rounding mode is specified by the IEEE 754 standard and is one of round to nearest with ties to even, round to nearest with ties away from zero, round toward zero, round toward positive infin
Learning methods · CPC title
Quantised networks; Sparse networks; Compressed networks · CPC title
Convolutional networks [CNN, ConvNet] · CPC title
using a mask · CPC title
Instructions to perform operations on packed data, e.g. vector, tile or matrix operations · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.