Task pooling and work affinity in data processing
US-2016098296-A1 · Apr 7, 2016 · US
US11934945B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11934945-B2 |
| Application number | US-201816481016-A |
| Country | US |
| Kind code | B2 |
| Filing date | Feb 23, 2018 |
| Priority date | Feb 23, 2017 |
| Publication date | Mar 19, 2024 |
| Grant date | Mar 19, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Techniques in advanced deep learning provide improvements in one or more of accuracy, performance, and energy efficiency, such as accuracy of learning, accuracy of prediction, speed of learning, performance of learning, and energy efficiency of learning. An array of processing elements performs flow-based computations on wavelets of data. Each processing element has a respective compute element and a respective routing element. Each compute element has processing resources and memory resources. Each router enables communication via wavelets with at least nearest neighbors in a 2D mesh. Stochastic gradient descent, mini-batch gradient descent, and continuous propagation gradient descent are techniques usable to train weights of a neural network modeled by the processing elements. Reverse checkpoint is usable to reduce memory usage during the training.
Opening claim text (preview).
What is claimed is: 1. A system comprising: a fabric of processor elements, each processor element comprising a fabric router and a compute engine enabled to perform dataflow-based and instruction-based processing; wherein each processor element selectively communicates fabric packets with others of the processor elements; and wherein each compute engine selectively performs the processing in accordance with a virtual channel specifier and a task specifier of each fabric packet the compute engine receives. 2. The system of claim 1 , wherein: each compute engine is configured to perform a predefined set of basic operations in response to receiving a corresponding basic instruction selected from a predefined native instruction set of codes; and further comprising a training workload comprising a first set of machine codes selected from the native instruction set for performing a mapping of at least a part of a neuron onto the compute engine of the processor element, the mapping comprising managing at least one partial-neuron weight, a second set of machine codes selected from the native instruction set for performing a forward pass to propagate activations in a forward logical direction based at least in part on the at least one partial-neuron weight, the forward pass initiated responsive to an input sample, a third set of machine codes selected from the native instruction set for performing a delta pass in a backward logical direction to generate deltas, the delta pass initiated responsive to completion of the forward pass, a fourth set of machine codes selected from the native instruction set for performing a chain pass to calculate gradients based on the deltas, and a fifth set of machine codes selected from the native instruction set for performing a selective update of the at least one partial-neuron weight in accordance with a predetermined learning rule and based at least in part on the deltas; and wherein each compute engine comprises storage for the at least one partial-neuron weight. 3. The system of claim 2 , wherein the mapping is in accordance with initializing the fabric to implement a partitioning of a neural network into a plurality of layers, the neuron is a first neuron of a plurality of neurons of the neural network, the first neuron is comprised in a first layer of the plurality of layers, and each of the plurality of neurons is mapped in a distributed manner across a plurality of the processor elements of the fabric. 4. The system of claim 3 , wherein the plurality of layers operates as a logical fabric pipeline comprising logical fabric pipeline stages, each logical fabric pipeline stage comprising completion of all of the passes for each layer, the completion for each layer taking a time step comprising the same amount of time. 5. The system of claim 3 , wherein as each input sample of a training set streams through at least a first plurality of the processor elements across the plurality of layers, the neuron weights are selectively updated in the first plurality of the processor elements across the plurality of layers. 6. The system of claim 2 , wherein an iteration of the training workload is performed for each of a plurality of input samples collectively comprising a training set. 7. The system of claim 6 , wherein the training set is partitioned into a plurality of so-called mini-batches and the predetermined learning rule specifies that the at least one partial-neuron weight is updated after the completion of all the passes for each input sample of each of the mini-batches. 8. The system of claim 7 , wherein the forward pass incorporates weight updates within a first plurality of the processor elements while the mini-batch learning is ongoing within the first plurality of the processor elements. 9. The system of claim 6 , wherein for each input sample, the system is enabled to selectively update the at least one partial-neuron weight in accordance with the predetermined learning rule responsive to completion of the forward pass, the delta pass, and the chain pass corresponding to the input sample. 10. The system of claim 9 , wherein the system is enabled for each forward pass to use weight information provided by the most recent selective update of the at least one partial-neuron weight. 11. The system of claim 10 , wherein the system is enabled to perform the delta pass and the chain pass for each input sample based at least in part on activations that are recomputed based at least in part on a first partial-neuron weight. 12. A method comprising: in each of a fabric of processor elements, selectively communicating fabric packets with others of the processor elements, each processor element comprising a fabric router and a compute engine enabled to perform dataflow-based and instruction-based processing; and in each compute engine, selectively performing the processing in accordance with a virtual channel specifier and a task specifier of each fabric packet the compute engine receives. 13. The method of claim 12 , wherein: each compute engine is configured to perform a predefined set of basic operations in response to receiving a corresponding basic instruction selected from a predefined native instruction set of codes; and further comprising processing a training workload comprising a first set of machine codes selected from the native instruction set for performing a mapping of at least a part of a neuron onto the compute engine of the processor element, the mapping comprising managing at least one partial-neuron weight, a second set of machine codes selected from the native instruction set for performing a forward pass to propagate activations in a forward logical direction based at least in part on the at least one partial-neuron weight, the forward pass initiated responsive to an input sample, a third set of machine codes selected from the native instruction set for performing a delta pass in a backward logical direction to generate deltas, the delta pass initiated responsive to completion of the forward pass, a fourth set of machine codes selected from the native instruction set for performing a chain pass to calculate gradients based on the deltas, and a fifth set of machine codes selected from the native instruction set for performing a selective update of the at least one partial-neuron weight in accordance with a predetermined learning rule and based at least in part on the deltas; and wherein each compute engine comprises storage for the at least one partial-neuron weight. 14. The method of claim 13 , wherein the mapping is in accordance with initializing the fabric to implement a partitioning of a neural network into a plurality of layers, the neuron is a first neuron of a plurality of neurons of the neural network, the first neuron is comprised in a first layer of the plurality of layers, and each of the plurality of neurons is mapped in a distributed manner across a plurality of the processor elements of the fabric. 15. The method of claim 14 , wherein the plurality of layers operates as a logical fabric pipeline comprising logical fabric pipeline stages, each logical fabric pipeline stage comprising completion of all of the passes for each layer, the completion for each layer taking a time step comprising the same amount of time. 16. The method of claim 14 , wherein as each input sample of a training set streams through at least a first plurality of the processor elements across the plurality of layers, the neuron weights are selectively updated in the first plurality of the processor elements across the plurality of layers.
Distributed learning, e.g. federated learning · CPC title
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title
Supervised learning · CPC title
Convolutional networks [CNN, ConvNet] · CPC title
Quantised networks; Sparse networks; Compressed networks · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.