Scaled compute fabric for accelerated deep learning

US11328207B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11328207-B2
Application numberUS-201917271801-A
CountryUS
Kind codeB2
Filing dateAug 11, 2019
Priority dateAug 28, 2018
Publication dateMay 10, 2022
Grant dateMay 10, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Techniques in advanced deep learning provide improvements in one or more of accuracy, performance, energy efficiency, and cost. In a first embodiment, a scaled array of processing elements is implementable with varying dimensions of the processing elements to enable varying price/performance systems. In a second embodiment, an array of clusters communicates via high-speed serial channels. The array and the channels are implemented on a Printed Circuit Board (PCB). Each cluster comprises respective processing and memory elements. Each cluster is implemented via a plurality of 3D-stacked dice, 2.5D-stacked dice, or both in a Ball Grid Array (BGA). A processing portion of the cluster is implemented via one or more Processing Element (PE) dice of the stacked dice. A memory portion of the cluster is implemented via one or more High Bandwidth Memory (HBM) dice of the stacked dice.

First claim

Opening claim text (preview).

What is claimed is: 1. A system comprising: a plurality of processing clusters, each processing cluster comprising a respective plurality of processing elements, each processing element comprising a respective fabric router and a respective compute element collectively enabled to perform processing comprising dataflow-based processing and instruction-based processing; wherein each processing cluster comprises means for performing intra-cluster selective communication of fabric packets between all the processing elements of the respective processing cluster via an intra-cluster fabric communication technique; wherein each processing cluster comprises means for performing inter-cluster selective communication of fabric packets with all others of the processing clusters via an inter-cluster fabric communication technique; and wherein each compute element comprises means for selectively performing the processing in accordance with a virtual channel specifier and a task specifier of one or more of the selectively communicated fabric packets the respective compute element receives. 2. The system of claim 1 , wherein the dataflow-based processing is in accordance with the virtual channel specifier, and the virtual channel specifier specifies in part one or more communication pathways between a plurality of the processing elements. 3. The system of claim 1 , wherein the instruction-based processing is in accordance with the task specifier, and the task specifier specifies in part a starting address for fetching instructions executable by one or more of the compute elements. 4. The system of claim 1 , wherein the inter-cluster fabric communication technique is compatible with inter-package communication between packages coupled via one or more printed circuit board substrates. 5. The system of claim 1 , wherein the inter-cluster fabric communication technique is compatible with intra-package communication between dice coupled via one or more packages. 6. The system of claim 1 , wherein each processing cluster is coupled to at least one of one or more memories and communicates with the coupled memories via a cluster-memory communication technique. 7. The system of claim 6 , wherein the memories coupled to a particular one of the processing clusters are operable as a backing store for a software managed cache comprised in local memory of the processing elements of the particular processing cluster. 8. The system of claim 6 , wherein at least one of the memories is enabled to store at least a portion of one or more of: a weight of a neural network, an activation of a neural network, a partial sum of activations of a neural network, an error of a neural network, a gradient estimate of a neural network, and a weight update of a neural network. 9. The system of claim 6 , wherein the processing clusters and the memories are packaged so that each processing cluster and the memories coupled to the respective processing cluster are in a same respective package. 10. The system of claim 6 , wherein the processing elements of a particular one of the processing clusters share access to the at least one of the memories the particular processing cluster is coupled to. 11. The system of claim 6 , wherein the cluster-memory communication technique is compatible with intra-package communication. 12. The system of claim 6 , wherein the intra-cluster fabric communication technique, the inter-cluster fabric communication technique, and the cluster-memory communication technique are distinct from each other. 13. The system of claim 6 , wherein at least one of the memories is implemented in part via DRAM and each processing element comprises a respective one or more local memories implemented in part via SRAM. 14. The system of claim 6 , wherein the memories are implemented in part via DRAM and the cluster-memory communication technique is implemented in accordance with high-bandwidth memory. 15. The system of claim 1 , wherein at least some of the fabric packets selectively communicated between the processing elements of the respective processing cluster comprise at least one of the virtual channel specifier and the task specifier. 16. The system of claim 1 , wherein at least some of the fabric packets selectively communicated between the processing clusters comprise at least one of the virtual channel specifier and the task specifier. 17. The system of claim 1 , wherein: the fabric packets selectively communicated between the processing elements of the respective processing cluster are intra-cluster fabric packets; the fabric packets selectively communicated between the processing clusters are inter-cluster fabric packets; at least some of the inter-cluster fabric packets are in accordance with corresponding ones of the intra-cluster packets; and at least some of the inter-cluster fabric packets in accordance with corresponding ones of the intra-cluster packets comprise at least one of the virtual channel specifier and the task specifier. 18. A system comprising: a plurality of processing elements each comprising respective local memory, a respective fabric router, and a respective compute element enabled to perform dataflow-based processing and instruction-based processing, and wherein the processing elements are arranged into respective processor/memory clusters; a plurality of non-local memories, each processor/memory cluster comprising at least one of the non-local memories; and wherein each processor/memory cluster comprises means for performing intra-cluster selective communication of fabric packets between all the processing elements of the respective processor/memory cluster via an intra-cluster fabric communication technique at least in part via the fabric routers; wherein each processor/memory cluster further comprises means for performing inter-cluster selective communication of fabric packets with all others of the processor/memory clusters via an inter-cluster fabric communication technique; wherein each compute element comprises means for selectively performing the dataflow-based processing in accordance with virtual channel specifiers, and means for selectively performing the instruction-based processing in accordance with task specifiers; wherein at least some of the fabric packets communicated via the intra-cluster fabric communication technique comprise one or more of at least some of the virtual channel specifiers and at least some of the task specifiers; and wherein at least some of the fabric packets communicated via the inter-cluster fabric communication technique comprise one or more of at least some of the virtual channel specifiers and at least some of the task specifiers. 19. The system of claim 18 , wherein a particular one of the virtual channel specifiers is comprised in a particular one of the fabric packets communicated via the intra-cluster fabric communication technique, and further comprising means for using the particular virtual channel specifier to determine, at least in part, which of a plurality of outputs of a particular one of the fabric routers to direct the particular fabric packet to, and wherein the means for using is comprised in the particular fabric router. 20. The system of claim 18 , wherein a particular one of the fabric packets communicated via the intra-cluster fabric communication technique comprises a particular one of the task specifiers, and further comprising means for using the particular task specifier to determine, at least in part, which instructions to execute, and w

Assignees

Inventors

Classifications

  • Probabilistic or stochastic networks · CPC title

  • Activation functions · CPC title

  • Recurrent networks, e.g. Hopfield networks · CPC title

  • Combinations of networks · CPC title

  • Quantised networks; Sparse networks; Compressed networks · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11328207B2 cover?
Techniques in advanced deep learning provide improvements in one or more of accuracy, performance, energy efficiency, and cost. In a first embodiment, a scaled array of processing elements is implementable with varying dimensions of the processing elements to enable varying price/performance systems. In a second embodiment, an array of clusters communicates via high-speed serial channels. The a…
Who is the assignee on this patent?
Cerebras Systems Inc
What technology area does this patent fall under?
Primary CPC classification G06N3/105. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue May 10 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).