Asynchronous stochastic gradient descent

US10628740B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10628740-B2
Application numberUS-201615146917-A
CountryUS
Kind codeB2
Filing dateMay 5, 2016
Priority dateOct 2, 2015
Publication dateApr 21, 2020
Grant dateApr 21, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The example computer-implemented method may comprise computing, by a generator processor on each of a plurality of learners, a gradient for a mini-batch using a current weight at each of the plurality of learners. The method may also comprise generating, by the generator processor on each of the plurality of learners, a plurality of triples, wherein each of the triples comprises the gradient, the weight index of the current weights used to compute the gradient, and a mass of the gradient. The method may further comprise performing, by a reconciler processor on each of the plurality of learners, an allreduce operation on the plurality of triples to obtain an allreduced triple sequence. Additionally, the method may comprise updating, by the reconciler processor on each of the plurality of learners, the current weight at each of the plurality of learners to a new current weight using the allreduced triple sequence.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method for asynchronous stochastic gradient descent, the method comprising: computing, by a generator processor on each of a plurality of learners, a gradient for a mini-batch using a current weight at each of the plurality of learners, the current weight being uniquely identified by a weight index of each of the plurality of learners, wherein the plurality of learners are arranged in a peer-to-peer arrangement without a parameter server; generating, by the generator processor on each of the plurality of learners, a plurality of triples, wherein each of the triples comprises the gradient, the weight index of the current weights used to compute the gradient, and a mass of the gradient, the mass equaling a number of mini-batches used to generate the gradient times a number of observations in the mini-batch; performing, by a reconciler processor on each of the plurality of learners, an allreduce operation on the plurality of triples to obtain an allreduced triple sequence; and updating, by the reconciler processor on each of the plurality of learners, the current weight at each of the plurality of learners to a new current weight using the allreduced triple sequence, wherein the new current weight becomes the current weight for a next processing batch to be computed by the generator processor. 2. The computer-implemented method of claim 1 , wherein the generator processor and the reconciler processor execute simultaneously. 3. The computer-implemented method of claim 2 , further comprising: communicating the plurality of triples from the generator processor to the reconciler processor using a non-blocking to-Learner buffer; and communicating the current weight with index from the reconciler processor to the generator processor through the use of a non-blocking from-Learner buffer. 4. The computer-implemented method of claim 3 , further comprising: performing a summing-in of a new gradient into an existing gradient when it is determined that the non-blocking to-Learner buffer is full. 5. The computer-implemented method of claim 1 , wherein performing the allreduce operation comprises performing a summation function on the plurality of triples to sum the masses of the plurality of triples. 6. The computer-implemented method of claim 1 , wherein performing the allreduce operation comprises performing a minimum function on the plurality of triples to determine a minimum of the weight indices of the plurality of triples. 7. The computer-implemented method of claim 1 , wherein the allreduce operation is performed in stages across subsets of the plurality of learners. 8. The computer-implemented method of claim 1 , wherein the allreduced triple sequence comprises at least one triple, and wherein a mass associated with each of the at least one triples is equal to a given target mass. 9. The computer-implemented method of claim 1 , wherein at least one of the generator processor and the reconciler processor are multi-threaded processors. 10. A system for Asynchronous stochastic gradient descent, the system comprising: a processor in communication with one or more types of memory, the processor configured to: compute, by a generator processor on each of a plurality of learners, a gradient for a mini-batch using a current weight at each of the plurality of learners, the current weight being uniquely identified by a weight index of each of the plurality of learners, wherein the plurality of learners are arranged in a peer-to-peer arrangement without a parameter server; generate, by the generator processor on each of the plurality of learners, a plurality of triples, wherein each of the triples comprises the gradient, the weight index of the current weights used to compute the gradient, and a mass of the gradient, the mass equaling a number of mini-batches used to generate the gradient times a number of observations in the mini-batch; perform, by a reconciler processor on each of the plurality of learners, an allreduce operation on the plurality of triples to obtain an allreduced triple sequence; and update by the reconciler processor on each of the plurality of learners, the current weight at each of the plurality of learners to a new current weight using the allreduced triple sequence, wherein the new current weight becomes the current weight for a next processing batch to be computed by the generator processor. 11. The system of claim 10 , wherein the generator processor and the reconciler processor execute simultaneously. 12. The system of claim 11 , wherein the processor is further configured to: communicate the plurality of triples from the generator processor to the reconciler processor using a non-blocking to-Learner buffer; and communicate the current weight with index from the reconciler processor to the generator processor through the use of a non-blocking from-Learner buffer. 13. The system of claim 12 , wherein the processor is further configured to: perform a summing-in of a new gradient into an existing gradient when it is determined that the non-blocking to-Learner buffer is full. 14. The system of claim 10 , wherein performing the allreduce operation comprises performing a summation function on the plurality of triples to sum the masses of the plurality of triples. 15. The system of claim 10 , wherein performing the allreduce operation comprises performing a minimum function on the plurality of triples to determine a minimum of the weight indices of the plurality of triples. 16. The system of claim 10 , wherein the allreduce operation is performed in stages across subsets of the plurality of learners. 17. The system of claim 10 , wherein the allreduced triple sequence comprises at least one triple, and wherein a mass associated with each of the at least one triples is equal to a given target mass. 18. A computer program product for asynchronous stochastic gradient descent, the computer program product comprising: a non-transitory storage medium readable by a processing circuit and storing instructions for execution by the processing circuit for performing a method comprising: computing, by a generator processor on each of a plurality of learners, a gradient for a mini-batch using a current weight at each of the plurality of learners, the current weight being uniquely identified by a weight index of each of the plurality of learners, wherein the plurality of learners are arranged in a peer-to-peer arrangement without a parameter server; generating, by the generator processor on each of the plurality of learners, a plurality of triples, wherein each of the triples comprises the gradient, the weight index of the current weights used to compute the gradient, and a mass of the gradient, the mass equaling a number of mini-batches used to generate the gradient times a number of observations in the mini-batch; performing, by a reconciler processor on each of the plurality of learners, an allreduce operation on the plurality of triples to obtain an allreduced triple sequence; and updating, by the reconciler processor on each of the plurality of learners, the current weight at each of the plurality of learners to a new current weight using the allreduced triple sequence, wherein the new current weight becomes the current weight for a next processing batch to be computed by the generator processor.

Assignees

Inventors

Classifications

  • G06N3/084Primary

    Backpropagation, e.g. using gradient descent · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10628740B2 cover?
The example computer-implemented method may comprise computing, by a generator processor on each of a plurality of learners, a gradient for a mini-batch using a current weight at each of the plurality of learners. The method may also comprise generating, by the generator processor on each of the plurality of learners, a plurality of triples, wherein each of the triples comprises the gradient, t…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06N3/084. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Apr 21 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).