What technology area does this patent fall under?

Primary CPC classification G06N3/098. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Mar 19 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Accelerated deep learning

US11934945B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11934945-B2
Application number	US-201816481016-A
Country	US
Kind code	B2
Filing date	Feb 23, 2018
Priority date	Feb 23, 2017
Publication date	Mar 19, 2024
Grant date	Mar 19, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Techniques in advanced deep learning provide improvements in one or more of accuracy, performance, and energy efficiency, such as accuracy of learning, accuracy of prediction, speed of learning, performance of learning, and energy efficiency of learning. An array of processing elements performs flow-based computations on wavelets of data. Each processing element has a respective compute element and a respective routing element. Each compute element has processing resources and memory resources. Each router enables communication via wavelets with at least nearest neighbors in a 2D mesh. Stochastic gradient descent, mini-batch gradient descent, and continuous propagation gradient descent are techniques usable to train weights of a neural network modeled by the processing elements. Reverse checkpoint is usable to reduce memory usage during the training.

First claim

Opening claim text (preview).

What is claimed is: 1. A system comprising: a fabric of processor elements, each processor element comprising a fabric router and a compute engine enabled to perform dataflow-based and instruction-based processing; wherein each processor element selectively communicates fabric packets with others of the processor elements; and wherein each compute engine selectively performs the processing in accordance with a virtual channel specifier and a task specifier of each fabric packet the compute engine receives. 2. The system of claim 1 , wherein: each compute engine is configured to perform a predefined set of basic operations in response to receiving a corresponding basic instruction selected from a predefined native instruction set of codes; and further comprising a training workload comprising a first set of machine codes selected from the native instruction set for performing a mapping of at least a part of a neuron onto the compute engine of the processor element, the mapping comprising managing at least one partial-neuron weight, a second set of machine codes selected from the native instruction set for performing a forward pass to propagate activations in a forward logical direction based at least in part on the at least one partial-neuron weight, the forward pass initiated responsive to an input sample, a third set of machine codes selected from the native instruction set for performing a delta pass in a backward logical direction to generate deltas, the delta pass initiated responsive to completion of the forward pass, a fourth set of machine codes selected from the native instruction set for performing a chain pass to calculate gradients based on the deltas, and a fifth set of machine codes selected from the native instruction set for performing a selective update of the at least one partial-neuron weight in accordance with a predetermined learning rule and based at least in part on the deltas; and wherein each compute engine comprises storage for the at least one partial-neuron weight. 3. The system of claim 2 , wherein the mapping is in accordance with initializing the fabric to implement a partitioning of a neural network into a plurality of layers, the neuron is a first neuron of a plurality of neurons of the neural network, the first neuron is comprised in a first layer of the plurality of layers, and each of the plurality of neurons is mapped in a distributed manner across a plurality of the processor elements of the fabric. 4. The system of claim 3 , wherein the plurality of layers operates as a logical fabric pipeline comprising logical fabric pipeline stages, each logical fabric pipeline stage comprising completion of all of the passes for each layer, the completion for each layer taking a time step comprising the same amount of time. 5. The system of claim 3 , wherein as each input sample of a training set streams through at least a first plurality of the processor elements across the plurality of layers, the neuron weights are selectively updated in the first plurality of the processor elements across the plurality of layers. 6. The system of claim 2 , wherein an iteration of the training workload is performed for each of a plurality of input samples collectively comprising a training set. 7. The system of claim 6 , wherein the training set is partitioned into a plurality of so-called mini-batches and the predetermined learning rule specifies that the at least one partial-neuron weight is updated after the completion of all the passes for each input sample of each of the mini-batches. 8. The system of claim 7 , wherein the forward pass incorporates weight updates within a first plurality of the processor elements while the mini-batch learning is ongoing within the first plurality of the processor elements. 9. The system of claim 6 , wherein for each input sample, the system is enabled to selectively update the at least one partial-neuron weight in accordance with the predetermined learning rule responsive to completion of the forward pass, the delta pass, and the chain pass corresponding to the input sample. 10. The system of claim 9 , wherein the system is enabled for each forward pass to use weight information provided by the most recent selective update of the at least one partial-neuron weight. 11. The system of claim 10 , wherein the system is enabled to perform the delta pass and the chain pass for each input sample based at least in part on activations that are recomputed based at least in part on a first partial-neuron weight. 12. A method comprising: in each of a fabric of processor elements, selectively communicating fabric packets with others of the processor elements, each processor element comprising a fabric router and a compute engine enabled to perform dataflow-based and instruction-based processing; and in each compute engine, selectively performing the processing in accordance with a virtual channel specifier and a task specifier of each fabric packet the compute engine receives. 13. The method of claim 12 , wherein: each compute engine is configured to perform a predefined set of basic operations in response to receiving a corresponding basic instruction selected from a predefined native instruction set of codes; and further comprising processing a training workload comprising a first set of machine codes selected from the native instruction set for performing a mapping of at least a part of a neuron onto the compute engine of the processor element, the mapping comprising managing at least one partial-neuron weight, a second set of machine codes selected from the native instruction set for performing a forward pass to propagate activations in a forward logical direction based at least in part on the at least one partial-neuron weight, the forward pass initiated responsive to an input sample, a third set of machine codes selected from the native instruction set for performing a delta pass in a backward logical direction to generate deltas, the delta pass initiated responsive to completion of the forward pass, a fourth set of machine codes selected from the native instruction set for performing a chain pass to calculate gradients based on the deltas, and a fifth set of machine codes selected from the native instruction set for performing a selective update of the at least one partial-neuron weight in accordance with a predetermined learning rule and based at least in part on the deltas; and wherein each compute engine comprises storage for the at least one partial-neuron weight. 14. The method of claim 13 , wherein the mapping is in accordance with initializing the fabric to implement a partitioning of a neural network into a plurality of layers, the neuron is a first neuron of a plurality of neurons of the neural network, the first neuron is comprised in a first layer of the plurality of layers, and each of the plurality of neurons is mapped in a distributed manner across a plurality of the processor elements of the fabric. 15. The method of claim 14 , wherein the plurality of layers operates as a logical fabric pipeline comprising logical fabric pipeline stages, each logical fabric pipeline stage comprising completion of all of the passes for each layer, the completion for each layer taking a time step comprising the same amount of time. 16. The method of claim 14 , wherein as each input sample of a training set streams through at least a first plurality of the processor elements across the plurality of layers, the neuron weights are selectively updated in the first plurality of the processor elements across the plurality of layers.

Assignees

Cerebras Systems Inc

Inventors

Classifications

G06N3/098Primary
Distributed learning, e.g. federated learning · CPC title
G06N3/0442
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title
G06N3/09
Supervised learning · CPC title
G06N3/0464
Convolutional networks [CNN, ConvNet] · CPC title
G06N3/0495
Quantised networks; Sparse networks; Compressed networks · CPC title

Patent family

Related publications grouped by family.

View patent family 63253606

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11934945B2 cover?: Techniques in advanced deep learning provide improvements in one or more of accuracy, performance, and energy efficiency, such as accuracy of learning, accuracy of prediction, speed of learning, performance of learning, and energy efficiency of learning. An array of processing elements performs flow-based computations on wavelets of data. Each processing element has a respective compute element…
Who is the assignee on this patent?: Cerebras Systems Inc
What technology area does this patent fall under?: Primary CPC classification G06N3/098. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Mar 19 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).