Compute optimizations for low precision machine learning operations

US11308574B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11308574-B2
Application numberUS-202016983080-A
CountryUS
Kind codeB2
Filing dateAug 3, 2020
Priority dateApr 28, 2017
Publication dateApr 19, 2022
Grant dateApr 19, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Embodiments described herein provide a graphics processor that can perform a variety of mixed and multiple precision instructions and operations. One embodiment provides a streaming multiprocessor that can concurrently execute multiple thread groups, wherein the streaming multiprocessor includes a single instruction, multiple thread (SIMT) architecture and the streaming multiprocessor is to execute multiple threads for each of multiple instructions. The streaming multiprocessor can perform concurrent integer and floating-point operations and includes a mixed precision core to perform operations at multiple precisions.

First claim

Opening claim text (preview).

What is claimed is: 1. An apparatus comprising: a semiconductor substrate; a 3D memory stack comprising a plurality of stacked memory dies integrated on the semiconductor substrate; a parallel processor die mounted on the semiconductor substrate; a local memory interconnect to couple the parallel processor die to the 3D memory stack, the local memory interconnect comprising a plurality of memory interfaces, each memory interface associated with a memory die of the plurality of stacked memory dies; wherein the parallel processor die comprises: an interconnect fabric comprising one or more crossbar switches; a memory controller coupled to the local memory interconnect via the memory interfaces to provide access to the 3D memory stack, the memory controller coupled to the interconnect fabric; an input/output (TO) interface coupled to the interconnect fabric; and an array of graphics processor compute units coupled to the interconnect fabric to process mixed-precision dot-product instructions, at least one graphics processor compute unit comprising: a plurality of packed data registers to store a plurality of packed data elements at a first precision; and mixed-precision execution circuitry to execute the mixed-precision dot-product instructions, the mixed-precision execution circuitry to perform a plurality of multiplications of different pairs of the plurality of packed data elements to generate a corresponding plurality of products and to add the corresponding plurality of products to an accumulation value stored at a second precision greater than the first precision to generate a result at the second precision. 2. The apparatus of claim 1 wherein at least some of the plurality of packed data elements comprise data elements of matrices. 3. The apparatus of claim 2 wherein the mixed-precision dot-product instructions are primitives of a machine learning framework. 4. The apparatus of claim 3 wherein the matrices are associated with a convolutional layer of the machine-learning framework. 5. The apparatus of claim 4 wherein the matrices associated with the convolutional layer of the machine-learning framework comprise a first matrix and a second matrix, each of the plurality of multiplications comprises a multiplication of a packed data element from the first matrix and a packed data element from the second matrix, the mixed-precision execution circuitry is to generate an output matrix comprising data elements generated by a multiplication of each packed data element from the first matrix and each packed data element from the second matrix, the mixed-precision execution circuitry is to evaluate an activation function based on the output matrix, and the activation function is a primitive of the machine learning framework. 6. The apparatus of claim 3 wherein the machine learning framework comprises a neural network. 7. The apparatus of claim 6 wherein the neural network comprises a recurrent neural network (RNN). 8. The apparatus of claim 1 further comprising: virtualization circuitry to share the array of graphics processor compute units with a plurality of virtual machines. 9. The apparatus of claim 8 wherein the virtualization circuitry comprises multiple sets of control registers to be associated with multiple corresponding virtual machines, a group of control registers to store one or more address pointers to identify a region of memory associated with a corresponding virtual machine. 10. The apparatus of claim 1 wherein a memory interface comprises a physical memory channel, and wherein one or more virtual memory channels are to be associated with a physical memory channel. 11. The apparatus of claim 1 further comprising: a cache hierarchy to store data for the array of graphics processor compute units, the cache hierarchy including an L1 cache and an L2 cache to be shared between the of array of graphics processor compute units. 12. The apparatus of claim 1 further comprising: memory management circuitry to map physical addresses of the 3D memory stack to a shared virtual memory space and to access the physical addresses using shared virtual memory (SVM) technology. 13. The apparatus of claim 1 further comprising: an input/output memory management unit (IOMMU) coupled to the interconnect fabric, the IOMMU comprising a translation buffer to store virtual-to-physical address translations to access the 3D memory stack. 14. The apparatus of claim 13 wherein a first one or more virtual-to-physical address translations are to identify regions in the 3D memory stack and wherein a second one or more virtual-to-physical address translations are to identify regions in a system memory device. 15. The apparatus of claim 14 wherein the array of graphics processor compute units are to connect to the system memory device via the IO interface. 16. The apparatus of claim 1 , wherein the 3D memory stack comprises a High Bandwidth Memory (HBM) memory device. 17. A graphics processor comprising: a semiconductor substrate; a parallel processor die mounted on the semiconductor substrate, the parallel processor die comprising: an interconnect fabric comprising one or more crossbar switches; an input/output (TO) interface coupled to the interconnect fabric; and an array of graphics processor compute units coupled to the interconnect fabric to process mixed-precision dot-product instructions, at least one graphics processor compute unit comprising: a plurality of packed data registers to store a plurality of packed data elements at a first precision; and mixed-precision execution circuitry to execute the mixed-precision dot-product instructions, the mixed-precision execution circuitry to perform a plurality of multiplications of different pairs of the plurality of packed data elements to generate a corresponding plurality of products and to add the corresponding plurality of products to an accumulation value stored at a second precision greater than the first precision to generate a result at the second precision. 18. The graphics processor of claim 17 further comprising: a 3D memory stack comprising a plurality of stacked memory dies integrated on the semiconductor substrate; a local memory interconnect to couple the parallel processor die to the 3D memory stack, the local memory interconnect comprising a plurality of memory interfaces, each memory interface associated with a memory die of the plurality of stacked memory dies; and a memory controller coupled to the local memory interconnect via the memory interfaces to provide access to the 3D memory stack, the memory controller coupled to the interconnect fabric. 19. The graphics processor of claim 18 wherein a memory interface comprises a physical memory channel and one or more virtual memory channels are to be associated with a physical memory channel, and the graphics processor further comprises: a cache hierarchy to store data for the array of graphics processor compute units, the cache hierarchy including an L1 cache and an L2 cache to be shared between the array of graphics processor compute units; memory management circuitry to map physical addresses of the 3D memory stack to a shared virtual memory space and to access the physical addresses using shared virtual memory (SVM) technology; and an input/output memory management unit (IOMMU) coupled to the interconnect fabric, the IOMMU comprising a translation buffer to store virtual-to-physical address translations to access the 3D memory stack. 20. The graphics processor of cla

Assignees

Inventors

Classifications

  • Combinations of networks · CPC title

  • Recurrent networks, e.g. Hopfield networks · CPC title

  • Learning methods · CPC title

  • Convolutional networks [CNN, ConvNet] · CPC title

  • controlled by a single instruction for multiple data lanes [SIMD] · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11308574B2 cover?
Embodiments described herein provide a graphics processor that can perform a variety of mixed and multiple precision instructions and operations. One embodiment provides a streaming multiprocessor that can concurrently execute multiple thread groups, wherein the streaming multiprocessor includes a single instruction, multiple thread (SIMT) architecture and the streaming multiprocessor is to exe…
Who is the assignee on this patent?
Intel Corp
What technology area does this patent fall under?
Primary CPC classification G06T1/20. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Apr 19 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).