Task synchronization for accelerated deep learning

US12314218B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12314218-B2
Application numberUS-202117367879-A
CountryUS
Kind codeB2
Filing dateJul 6, 2021
Priority dateApr 17, 2017
Publication dateMay 27, 2025
Grant dateMay 27, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Techniques in advanced deep learning provide improvements in one or more of accuracy, performance, and energy efficiency. An array of processing elements performs flow-based computations on wavelets of data. Each processing element has a compute element and a routing element. Each compute element has memory. Each router enables communication via wavelets with at least nearest neighbors in a 2D mesh. Routing is controlled by respective virtual channel specifiers in each wavelet and routing configuration information in each router. A compute element conditionally selects for task initiation a previously received wavelet specifying a particular one of the virtual channels. The conditional selecting excludes the previously received wavelet for selection until at least block/unblock state maintained for the particular virtual channel is in an unblock state. The compute element executes block/unblock instructions to modify the block/unblock state.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: sending a first fabric packet by a sending processing element, the first fabric packet comprising a virtual channel specifier specifying one of a plurality of virtual channels; routing the first fabric packet from the sending processing element to a receiving processing element via one or more of routing elements in accordance with the virtual channel specifier; in the receiving processing element, picking for processing the first fabric packet, the picking in accordance with a respective block/unblock state comprised in the receiving processing element and maintained for each of the plurality of virtual channels, the picking excluding the first fabric packet from being picked until at least the respective block/unblock state maintained for the specified virtual channel is in the unblock state; wherein the sending processing element implements at least a portion of a first node of a plurality of nodes of a dataflow graph and the receiving processing element implements at least a portion of a second node of the dataflow graph, and the sending processing element and the receiving processing element are respective instances of a plurality of processing elements interconnected as a fabric; and wherein the first fabric packet is an instance of a plurality of fabric packets and the routing of the first fabric packet is an instance of routing between the processing elements of the fabric, and wherein the routing between the processing elements of the fabric comprises each processing element transmitting each of one or more of the plurality of fabric packets over a selected group of preconfigured groups of one or more physical couplings between connected neighbors of the processing element, and the virtual channel specifier of each transmitted fabric packet is used to select the group. 2. The method of claim 1 , further comprising: receiving in the receiving processing element a second fabric packet, the second fabric packet enabling execution of a selected one of a block instruction and an unblock instruction, the selected instruction having been stored in the receiving processing element prior to the receiving and comprising an immediate source operand specifying one or more of the virtual channels; decoding the selected instruction in the receiving processing element; and in the receiving processing element, setting the respective block/unblock state maintained for the specified one or more of the virtual channels in accordance with the decoding. 3. The method of claim 1 , further comprising: receiving in the receiving processing element a second fabric packet of the fabric packets, the second fabric packet enabling execution of a selected one of a block instruction and an unblock instruction, the selected instruction having been stored in the receiving processing element prior to the receiving and comprising other than an immediate source operand; decoding the selected instruction in the receiving processing element; and in the receiving processing element, setting the respective block/unblock state maintained for all the virtual channels in accordance with the decoding. 4. The method of claim 1 , further comprising, in the receiving processing element, setting the respective block/unblock state maintained for a particular virtual channel of the virtual channels to a blocked state in response to a block instruction specifying the particular virtual channel of the virtual channels. 5. The method of claim 1 , further comprising, in the receiving processing element, setting the respective block/unblock state maintained for a particular virtual channel of the virtual channels to an unblocked state in response to an unblock instruction specifying the particular virtual channel of the virtual channels. 6. The method of claim 1 , wherein the sending processing element and the receiving processing element are fabricated via wafer-scale integration on separate die of a single wafer. 7. The method of claim 1 , wherein the sending processing element implements at least a portion of a first neuron of a neural network and the receiving processing element implements at least a portion of a second neuron of the neural network. 8. The method of claim 1 , wherein the sending processing element implements at least a portion of a first layer of a neural network and the receiving processing element implements at least a portion of a second layer of the neural network. 9. The method of claim 1 , wherein the sending processing element and the receiving processing element implement respective portions of at least a partitioned neuron of a neural network. 10. The method of claim 1 , further comprising, with respect to the receiving processing element, managing fabric packet input queues to have generally equal average rates of production and consumption by stalling/resuming task activities via manipulation of the respective block/unblock state. 11. The method of claim 1 , further comprising, with respect to the receiving processing element, managing one or more priorities within and between tasks by stalling/resuming task activities via manipulation of the respective block/unblock state. 12. The method of claim 1 , further comprising, with respect to the receiving processing element, managing one or more dependencies within and between tasks by stalling/resuming task activities via manipulation of the respective block/unblock state. 13. The method of claim 1 , further comprising, with respect to the receiving processing element, synchronizing one or more of computations and communications of one or more tasks, via manipulation of the respective block/unblock state. 14. The method of claim 1 , further comprising, with respect to the receiving processing element, implementing task software interlocks via manipulation of the respective block/unblock state. 15. The method of claim 1 , further comprising synchronizing data sourced via unequal delay paths by manipulation of the respective block/unblock state of one or more of the processing elements along the delay paths. 16. The method of claim 1 , further comprising shaping at least some dataflow in at least part of the fabric by manipulation of the respective block/unblock state of one or more of the processing elements. 17. The method of claim 1 , further comprising: wherein the one of the plurality of virtual channels is used for communicating at least one of control and data associated with one or more of: computing an activation of a neural network, computing a partial sum of activations of the neural network, computing an error of the neural network, computing a gradient estimate of the neural network, and updating a weight of the neural network; and wherein the first fabric packet comprises the at least one of control and data associated with the one or more of: computing the activation of the neural network, computing the partial sum of activations of the neural network, computing the error of the neural network, computing the gradient estimate of the neural network, and updating the weight of the neural network. 18. The method of claim 1 , wherein the sending processing element, the one or more of the routing elements, and the receiving processing element are fabricated via wafer-scale integration. 19. The method of claim 1 , further comprising: initializing the fabric with all parameters and task software required for concurrent execution of communications and computations respectively corresponding to the dataflow graph; and concurrently executing all layers

Assignees

Inventors

Classifications

  • Probabilistic or stochastic networks · CPC title

  • Learning methods · CPC title

  • Convolutional networks [CNN, ConvNet] · CPC title

  • Generative networks · CPC title

  • Quantised networks; Sparse networks; Compressed networks · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12314218B2 cover?
Techniques in advanced deep learning provide improvements in one or more of accuracy, performance, and energy efficiency. An array of processing elements performs flow-based computations on wavelets of data. Each processing element has a compute element and a routing element. Each compute element has memory. Each router enables communication via wavelets with at least nearest neighbors in a 2D …
Who is the assignee on this patent?
Cerebras Systems Inc
What technology area does this patent fall under?
Primary CPC classification G06N3/063. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue May 27 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).