Computing system for training neural networks

US11049006B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11049006-B2
Application numberUS-201415510356-A
CountryUS
Kind codeB2
Filing dateSep 12, 2014
Priority dateSep 12, 2014
Publication dateJun 29, 2021
Grant dateJun 29, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Techniques and constructs can reduce the time required to determine solutions to optimization problems such as training of neural networks. Modifications to a computational model can be determined by a plurality of nodes operating in parallel. Quantized modification values can be transmitted between the nodes to reduce the volume of data to be transferred. The quantized values can be as small as one bit each. Quantization-error values can be stored and used in quantizing subsequent modifications. The nodes can operate in parallel and overlap computation and data transfer to further reduce the time required to determine solutions. The quantized values can be partitioned and each node can aggregate values for a corresponding partition.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: for each of a plurality of nodes: determining, by a respective node of the plurality of nodes, a respective stripe of a plurality of stripes of gradient matrices of a computational model of an optimization problem for a first minibatch of training data; quantizing the respective stripe of the plurality of stripes of gradient matrices for the first minibatch using corresponding stored error matrices stored by the respective node; updating the stored error matrices stored by the respective node for the first minibatch using the corresponding quantized stripe of gradient matrices; exchanging the quantized respective stripe of the plurality of stripes of gradient matrices for the first minibatch synchronously with the plurality of nodes, wherein the exchanging comprises: partitioning the quantized respective stripe of gradient matrices into a plurality of data stripes; providing, to each of the plurality of nodes, a respective data stripe of the plurality of data stripes of the respective node during a first phase of the exchanging; receiving a respective data stripe from each of the other nodes during the first phase of the exchanging; aggregating the received data stripes into aggregated stripe data; transmitting, to each of the plurality of nodes, the aggregated stripe data of the respective node during a second phase of the exchanging; receiving other aggregated stripe data from each of the other nodes during the second phase of the exchanging; and recovering the full set of gradients for the computational model from the aggregated stripe data and the other aggregated stripe data; determining a respective stripe of a plurality of stripes of gradient matrices of the computational model for a second minibatch of the training data while exchanging the quantized gradient matrices for the first minibatch; and repeating the determining, quantizing, updating, and exchanging steps for each of a plurality of minibatches of the computational model, the plurality of minibatches including the first and the second minibatches. 2. The method of claim 1 , further including adjusting a parallelization factor as a function of batch size based at least in part on time measurements. 3. The method of claim 1 , the quantizing comprising determining an approximate single-bit representation for each element of the gradient matrices. 4. The method of claim 1 , further including reconstructing the quantized gradient matrices after the exchanging. 5. The method of claim 1 , wherein the computational model includes a neural network model, the method further including updating the neural network model using the gradient matrices. 6. A non-transitory computer-readable medium having stored thereon computer-executable instructions, the computer-executable instructions upon execution configuring a computer to carry out steps including: for each of a plurality of nodes: determining, by a respective node of the plurality of nodes, a respective stripe of a plurality of stripes of gradient matrices of a computational model of an optimization problem for a first minibatch of training data; quantizing the respective stripe of the plurality of stripes of the gradient matrices for the first minibatch using corresponding stored error matrices stored by the respective node; updating the stored error matrices stored by the respective node for the first minibatch using the corresponding quantized stripe of gradient matrices; exchanging the quantized respective stripe of the plurality of stripes of gradient matrices for the first minibatch synchronously with the plurality of node, wherein the exchanging comprises: partitioning the quantized respective stripe of gradient matrices into a plurality of data stripes; providing, to each of the plurality of nodes, a respective data stripe of the plurality of data stripes of the respective node during a first phase of the exchanging; receiving a respective data stripe from each of the other nodes during the first phase of the exchanging; aggregating the received data stripes into aggregated stripe data; transmitting, to each of the plurality of nodes, the aggregated stripe data of the respective node during a second phase of the exchanging; receiving other aggregated stripe data from each of the other nodes during the second phase of the exchanging; and recovering the full set of gradients for the computational model from the aggregated stripe data and the other aggregated stripe data; determining a respective stripe of a plurality of stripes of gradient matrices of the computational model for a second minibatch of the training data while exchanging the quantized gradient matrices for the first minibatch; and repeating the determining, quantizing, updating, and exchanging steps for each of a plurality of minibatches of the computational model, the plurality of minibatches including the first and the second minibatches. 7. A system comprising: one or more computer-readable media having thereon a plurality of modules and a computational model of an optimization problem; and a plurality of nodes, each including at least one processing unit, each processing unit operably coupled to at least one of the computer-readable media, the processing units adapted to intercommunicate and to execute modules of the plurality of modules comprising: an update-determining module configured to determine modification values of the computational model for a first minibatch of training data, the modification values comprising, for each node of the plurality of nodes, a respective stripe of a plurality stripes of gradient matrices of the computational model for the first minibatch of training data; a quantization module configured to quantize, for each node of the plurality of nodes, the respective stripe of the gradient matrices determined modification values for the first minibatch using stored error values stored by the respective node and to update the stored error values stored by the respective node for the first minibatch using the determined modification values and the quantized stripe of the gradient matrices determined modification values; a transferring module configured to transmit at least some of the quantized stripes of the gradient matrices determined modification values for the first minibatch synchronously to at least one other of the plurality of nodes, wherein the transferring module is configured to: partition each of the quantized stripes of the gradient matrices determined modification values for the first minibatch into a plurality of data stripes; provide, to each of the plurality of nodes, a respective data stripe of the plurality of data stripes of the respective node during a first phase of transferring; receive, by each of the plurality of nodes, respective data stripe from each of the other nodes during the first phase of the transferring; aggregate, for each of the plurality of nodes, the received data stripes into aggregated stripe data; transmit, to each of the plurality of nodes, the aggregated stripe data of the respective node during a second phase of the transferring; receive, by each of the plurality of nodes, other aggregated stripe data from each of the other nodes during the second phase of the transferring; and recover, by each of the plurality of nodes, the full set of gradients for the computational model from the aggregated stripe data and the other aggregated stripe data; and an updating module configured to modify the stored computational model according to the recovered full set of gradients; wherein the update-determining module is further configured to determine modification values of the computational model for a second minibatch of the training data while th

Assignees

Inventors

Classifications

  • Probabilistic graphical models, e.g. probabilistic networks · CPC title

  • Combinations of networks · CPC title

  • G06N3/084Primary

    Backpropagation, e.g. using gradient descent · CPC title

  • G06N3/082Primary

    modifying the architecture, e.g. adding, deleting or silencing nodes or connections · CPC title

  • Distributed learning, e.g. federated learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11049006B2 cover?
Techniques and constructs can reduce the time required to determine solutions to optimization problems such as training of neural networks. Modifications to a computational model can be determined by a plurality of nodes operating in parallel. Quantized modification values can be transmitted between the nodes to reduce the volume of data to be transferred. The quantized values can be as small a…
Who is the assignee on this patent?
Microsoft Technology Licensing Llc
What technology area does this patent fall under?
Primary CPC classification G06N3/084. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jun 29 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).