Computing system for training neural networks
US-2017308789-A1 · Oct 26, 2017 · US
US10628740B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10628740-B2 |
| Application number | US-201615146917-A |
| Country | US |
| Kind code | B2 |
| Filing date | May 5, 2016 |
| Priority date | Oct 2, 2015 |
| Publication date | Apr 21, 2020 |
| Grant date | Apr 21, 2020 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
The example computer-implemented method may comprise computing, by a generator processor on each of a plurality of learners, a gradient for a mini-batch using a current weight at each of the plurality of learners. The method may also comprise generating, by the generator processor on each of the plurality of learners, a plurality of triples, wherein each of the triples comprises the gradient, the weight index of the current weights used to compute the gradient, and a mass of the gradient. The method may further comprise performing, by a reconciler processor on each of the plurality of learners, an allreduce operation on the plurality of triples to obtain an allreduced triple sequence. Additionally, the method may comprise updating, by the reconciler processor on each of the plurality of learners, the current weight at each of the plurality of learners to a new current weight using the allreduced triple sequence.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method for asynchronous stochastic gradient descent, the method comprising: computing, by a generator processor on each of a plurality of learners, a gradient for a mini-batch using a current weight at each of the plurality of learners, the current weight being uniquely identified by a weight index of each of the plurality of learners, wherein the plurality of learners are arranged in a peer-to-peer arrangement without a parameter server; generating, by the generator processor on each of the plurality of learners, a plurality of triples, wherein each of the triples comprises the gradient, the weight index of the current weights used to compute the gradient, and a mass of the gradient, the mass equaling a number of mini-batches used to generate the gradient times a number of observations in the mini-batch; performing, by a reconciler processor on each of the plurality of learners, an allreduce operation on the plurality of triples to obtain an allreduced triple sequence; and updating, by the reconciler processor on each of the plurality of learners, the current weight at each of the plurality of learners to a new current weight using the allreduced triple sequence, wherein the new current weight becomes the current weight for a next processing batch to be computed by the generator processor. 2. The computer-implemented method of claim 1 , wherein the generator processor and the reconciler processor execute simultaneously. 3. The computer-implemented method of claim 2 , further comprising: communicating the plurality of triples from the generator processor to the reconciler processor using a non-blocking to-Learner buffer; and communicating the current weight with index from the reconciler processor to the generator processor through the use of a non-blocking from-Learner buffer. 4. The computer-implemented method of claim 3 , further comprising: performing a summing-in of a new gradient into an existing gradient when it is determined that the non-blocking to-Learner buffer is full. 5. The computer-implemented method of claim 1 , wherein performing the allreduce operation comprises performing a summation function on the plurality of triples to sum the masses of the plurality of triples. 6. The computer-implemented method of claim 1 , wherein performing the allreduce operation comprises performing a minimum function on the plurality of triples to determine a minimum of the weight indices of the plurality of triples. 7. The computer-implemented method of claim 1 , wherein the allreduce operation is performed in stages across subsets of the plurality of learners. 8. The computer-implemented method of claim 1 , wherein the allreduced triple sequence comprises at least one triple, and wherein a mass associated with each of the at least one triples is equal to a given target mass. 9. The computer-implemented method of claim 1 , wherein at least one of the generator processor and the reconciler processor are multi-threaded processors. 10. A system for Asynchronous stochastic gradient descent, the system comprising: a processor in communication with one or more types of memory, the processor configured to: compute, by a generator processor on each of a plurality of learners, a gradient for a mini-batch using a current weight at each of the plurality of learners, the current weight being uniquely identified by a weight index of each of the plurality of learners, wherein the plurality of learners are arranged in a peer-to-peer arrangement without a parameter server; generate, by the generator processor on each of the plurality of learners, a plurality of triples, wherein each of the triples comprises the gradient, the weight index of the current weights used to compute the gradient, and a mass of the gradient, the mass equaling a number of mini-batches used to generate the gradient times a number of observations in the mini-batch; perform, by a reconciler processor on each of the plurality of learners, an allreduce operation on the plurality of triples to obtain an allreduced triple sequence; and update by the reconciler processor on each of the plurality of learners, the current weight at each of the plurality of learners to a new current weight using the allreduced triple sequence, wherein the new current weight becomes the current weight for a next processing batch to be computed by the generator processor. 11. The system of claim 10 , wherein the generator processor and the reconciler processor execute simultaneously. 12. The system of claim 11 , wherein the processor is further configured to: communicate the plurality of triples from the generator processor to the reconciler processor using a non-blocking to-Learner buffer; and communicate the current weight with index from the reconciler processor to the generator processor through the use of a non-blocking from-Learner buffer. 13. The system of claim 12 , wherein the processor is further configured to: perform a summing-in of a new gradient into an existing gradient when it is determined that the non-blocking to-Learner buffer is full. 14. The system of claim 10 , wherein performing the allreduce operation comprises performing a summation function on the plurality of triples to sum the masses of the plurality of triples. 15. The system of claim 10 , wherein performing the allreduce operation comprises performing a minimum function on the plurality of triples to determine a minimum of the weight indices of the plurality of triples. 16. The system of claim 10 , wherein the allreduce operation is performed in stages across subsets of the plurality of learners. 17. The system of claim 10 , wherein the allreduced triple sequence comprises at least one triple, and wherein a mass associated with each of the at least one triples is equal to a given target mass. 18. A computer program product for asynchronous stochastic gradient descent, the computer program product comprising: a non-transitory storage medium readable by a processing circuit and storing instructions for execution by the processing circuit for performing a method comprising: computing, by a generator processor on each of a plurality of learners, a gradient for a mini-batch using a current weight at each of the plurality of learners, the current weight being uniquely identified by a weight index of each of the plurality of learners, wherein the plurality of learners are arranged in a peer-to-peer arrangement without a parameter server; generating, by the generator processor on each of the plurality of learners, a plurality of triples, wherein each of the triples comprises the gradient, the weight index of the current weights used to compute the gradient, and a mass of the gradient, the mass equaling a number of mini-batches used to generate the gradient times a number of observations in the mini-batch; performing, by a reconciler processor on each of the plurality of learners, an allreduce operation on the plurality of triples to obtain an allreduced triple sequence; and updating, by the reconciler processor on each of the plurality of learners, the current weight at each of the plurality of learners to a new current weight using the allreduced triple sequence, wherein the new current weight becomes the current weight for a next processing batch to be computed by the generator processor.
Backpropagation, e.g. using gradient descent · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.