Distributed training of models using stochastic gradient descent
US-10152676-B1 · Dec 11, 2018 · US
US11049006B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11049006-B2 |
| Application number | US-201415510356-A |
| Country | US |
| Kind code | B2 |
| Filing date | Sep 12, 2014 |
| Priority date | Sep 12, 2014 |
| Publication date | Jun 29, 2021 |
| Grant date | Jun 29, 2021 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Techniques and constructs can reduce the time required to determine solutions to optimization problems such as training of neural networks. Modifications to a computational model can be determined by a plurality of nodes operating in parallel. Quantized modification values can be transmitted between the nodes to reduce the volume of data to be transferred. The quantized values can be as small as one bit each. Quantization-error values can be stored and used in quantizing subsequent modifications. The nodes can operate in parallel and overlap computation and data transfer to further reduce the time required to determine solutions. The quantized values can be partitioned and each node can aggregate values for a corresponding partition.
Opening claim text (preview).
What is claimed is: 1. A method comprising: for each of a plurality of nodes: determining, by a respective node of the plurality of nodes, a respective stripe of a plurality of stripes of gradient matrices of a computational model of an optimization problem for a first minibatch of training data; quantizing the respective stripe of the plurality of stripes of gradient matrices for the first minibatch using corresponding stored error matrices stored by the respective node; updating the stored error matrices stored by the respective node for the first minibatch using the corresponding quantized stripe of gradient matrices; exchanging the quantized respective stripe of the plurality of stripes of gradient matrices for the first minibatch synchronously with the plurality of nodes, wherein the exchanging comprises: partitioning the quantized respective stripe of gradient matrices into a plurality of data stripes; providing, to each of the plurality of nodes, a respective data stripe of the plurality of data stripes of the respective node during a first phase of the exchanging; receiving a respective data stripe from each of the other nodes during the first phase of the exchanging; aggregating the received data stripes into aggregated stripe data; transmitting, to each of the plurality of nodes, the aggregated stripe data of the respective node during a second phase of the exchanging; receiving other aggregated stripe data from each of the other nodes during the second phase of the exchanging; and recovering the full set of gradients for the computational model from the aggregated stripe data and the other aggregated stripe data; determining a respective stripe of a plurality of stripes of gradient matrices of the computational model for a second minibatch of the training data while exchanging the quantized gradient matrices for the first minibatch; and repeating the determining, quantizing, updating, and exchanging steps for each of a plurality of minibatches of the computational model, the plurality of minibatches including the first and the second minibatches. 2. The method of claim 1 , further including adjusting a parallelization factor as a function of batch size based at least in part on time measurements. 3. The method of claim 1 , the quantizing comprising determining an approximate single-bit representation for each element of the gradient matrices. 4. The method of claim 1 , further including reconstructing the quantized gradient matrices after the exchanging. 5. The method of claim 1 , wherein the computational model includes a neural network model, the method further including updating the neural network model using the gradient matrices. 6. A non-transitory computer-readable medium having stored thereon computer-executable instructions, the computer-executable instructions upon execution configuring a computer to carry out steps including: for each of a plurality of nodes: determining, by a respective node of the plurality of nodes, a respective stripe of a plurality of stripes of gradient matrices of a computational model of an optimization problem for a first minibatch of training data; quantizing the respective stripe of the plurality of stripes of the gradient matrices for the first minibatch using corresponding stored error matrices stored by the respective node; updating the stored error matrices stored by the respective node for the first minibatch using the corresponding quantized stripe of gradient matrices; exchanging the quantized respective stripe of the plurality of stripes of gradient matrices for the first minibatch synchronously with the plurality of node, wherein the exchanging comprises: partitioning the quantized respective stripe of gradient matrices into a plurality of data stripes; providing, to each of the plurality of nodes, a respective data stripe of the plurality of data stripes of the respective node during a first phase of the exchanging; receiving a respective data stripe from each of the other nodes during the first phase of the exchanging; aggregating the received data stripes into aggregated stripe data; transmitting, to each of the plurality of nodes, the aggregated stripe data of the respective node during a second phase of the exchanging; receiving other aggregated stripe data from each of the other nodes during the second phase of the exchanging; and recovering the full set of gradients for the computational model from the aggregated stripe data and the other aggregated stripe data; determining a respective stripe of a plurality of stripes of gradient matrices of the computational model for a second minibatch of the training data while exchanging the quantized gradient matrices for the first minibatch; and repeating the determining, quantizing, updating, and exchanging steps for each of a plurality of minibatches of the computational model, the plurality of minibatches including the first and the second minibatches. 7. A system comprising: one or more computer-readable media having thereon a plurality of modules and a computational model of an optimization problem; and a plurality of nodes, each including at least one processing unit, each processing unit operably coupled to at least one of the computer-readable media, the processing units adapted to intercommunicate and to execute modules of the plurality of modules comprising: an update-determining module configured to determine modification values of the computational model for a first minibatch of training data, the modification values comprising, for each node of the plurality of nodes, a respective stripe of a plurality stripes of gradient matrices of the computational model for the first minibatch of training data; a quantization module configured to quantize, for each node of the plurality of nodes, the respective stripe of the gradient matrices determined modification values for the first minibatch using stored error values stored by the respective node and to update the stored error values stored by the respective node for the first minibatch using the determined modification values and the quantized stripe of the gradient matrices determined modification values; a transferring module configured to transmit at least some of the quantized stripes of the gradient matrices determined modification values for the first minibatch synchronously to at least one other of the plurality of nodes, wherein the transferring module is configured to: partition each of the quantized stripes of the gradient matrices determined modification values for the first minibatch into a plurality of data stripes; provide, to each of the plurality of nodes, a respective data stripe of the plurality of data stripes of the respective node during a first phase of transferring; receive, by each of the plurality of nodes, respective data stripe from each of the other nodes during the first phase of the transferring; aggregate, for each of the plurality of nodes, the received data stripes into aggregated stripe data; transmit, to each of the plurality of nodes, the aggregated stripe data of the respective node during a second phase of the transferring; receive, by each of the plurality of nodes, other aggregated stripe data from each of the other nodes during the second phase of the transferring; and recover, by each of the plurality of nodes, the full set of gradients for the computational model from the aggregated stripe data and the other aggregated stripe data; and an updating module configured to modify the stored computational model according to the recovered full set of gradients; wherein the update-determining module is further configured to determine modification values of the computational model for a second minibatch of the training data while th
Probabilistic graphical models, e.g. probabilistic networks · CPC title
Combinations of networks · CPC title
Backpropagation, e.g. using gradient descent · CPC title
modifying the architecture, e.g. adding, deleting or silencing nodes or connections · CPC title
Distributed learning, e.g. federated learning · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.