What technology area does this patent fall under?

Primary CPC classification G06N3/084. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Apr 05 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Robust gradient weight compression schemes for deep learning applications

US11295208B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11295208-B2
Application number	US-201715830170-A
Country	US
Kind code	B2
Filing date	Dec 4, 2017
Priority date	Dec 4, 2017
Publication date	Apr 5, 2022
Grant date	Apr 5, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Embodiments of the present invention provide a computer-implemented method for adaptive residual gradient compression for training of a deep learning neural network (DNN). The method includes obtaining, by a first learner, a current gradient vector for a neural network layer of the DNN, in which the current gradient vector includes gradient weights of parameters of the neural network layer that are calculated from a mini-batch of training data. A current residue vector is generated that includes residual gradient weights for the mini-batch. A compressed current residue vector is generated based on dividing the residual gradient weights of the current residue vector into a plurality of bins of a uniform size and quantizing a subset of the residual gradient weights of one or more bins of the plurality of bins. The compressed current residue vector is then transmitted to a second learner of the plurality of learners or to a parameter server.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method for adaptive residual gradient compression for training of a deep learning neural network (DNN), the computer implemented method comprising: obtaining, by a processor of a first learner of a plurality of learners, a current gradient vector for a neural network layer of the DNN, wherein the current gradient vector comprises gradient weights of parameters of the neural network layer that are calculated by training the neural network layer of the DNN using a mini-batch of training data, wherein training the neural network layer of the DNN using the mini-batch of training data comprises: receiving the training data comprising a plurality of input samples; determining a mini-batch from the training data; performing a forward pass and a backward pass through the DNN to calculate a current gradient vector; and updating one or more gradient weights for the DNN based on the current gradient vector; generating, by the processor, a current residue vector comprising residual gradient weights for the mini-batch, wherein generating the current residue vector comprises summing a prior residue vector and the current gradient vector; generating, by the processor, a compressed current residue vector based at least in part on dividing the residual gradient weights of the current residue vector into a plurality of bins of a uniform size and quantizing a subset of the residual gradient weights of one or more bins of the plurality of bins, wherein quantizing the subset of the residual gradient weights is based at least in part on calculating a scaling parameter for the mini-batch and calculating a local maximum of each bin, wherein the uniform size of the bins is a hyper-parameter of the DNN; transmitting, by the processor, the compressed current residue vector to a second learner of the plurality of learners; and updating, at each of the plurality of learners, the gradient weights of the parameters of the neural network layer. 2. The computer-implemented method of claim 1 , wherein generating the compressed current residue vector comprises: generating, by the processor, a scaled current residue vector comprising scaled residual gradient weights for the mini batch, wherein generating the scaled current residue vector comprises multiplying the current gradient vector by the scaling parameter and summing the prior residue vector with the multiplied gradient vector; dividing the residual gradient weights of the current residue vector into the plurality of bins of the uniform size; identifying, for each bin of the plurality of bins, a local maximum of the absolute value of the residual gradient weights of the bin; determining, for each residual gradient weight of each bin, that a corresponding scaled residual gradient weight of the scaled residue vector exceeds the local maximum of the bin; and upon identifying, for each residual gradient weight of each bin, that the corresponding scaled residual gradient weight of the scaled residue vector exceeds the local maximum of the bin, generating a quantizing value for the give residual gradient weight and updating the current residue vector by substituting the residual gradient weight of the current residue vector with the quantized value. 3. The computer-implemented method of claim 2 , wherein the scale parameter is calculated by minimizing quantization error according to L2 normalization. 4. The computer-implemented method of claim 2 , wherein: the DNN includes one or more convolution network layers; and the size of the plurality of bins is set to 50 for the one or more convolution layers. 5. The computer-implemented method of claim 2 , wherein: the DNN includes at least one of more fully connected layers; and the size of the bins is set to 500 for the one or more fully connected layers. 6. A system for adaptive residual gradient compression for training of a deep learning neural network (DNN), the system comprising a plurality of learners, wherein at least one leaner of the plurality of learners is configured to perform a method comprising: obtaining a current gradient vector for a neural network layer of the DNN, wherein the current gradient vector comprises gradient weights of parameters of the neural network layer that are calculated by training the neural network layer of the DNN using a mini-batch of training data, wherein training the neural network layer of the DNN using the mini-batch of training data comprises: receiving the training data comprising a plurality of input samples; determining a mini-batch from the training data; performing a forward pass and a backward pass through the DNN to calculate a current gradient vector; and updating one or more gradient weights for the DNN based on the current gradient vector; generating a current residue vector comprising residual gradient weights for the mini-batch, wherein generating the current residue vector comprises summing a prior residue vector and the current gradient vector; generating a compressed current residue vector based at least in part on dividing the residual gradient weights of the current residue vector into a plurality of bins of a uniform size and quantizing a subset of the residual gradient weights of one or more bins of the plurality of bins, wherein quantizing the subset of the residual gradient weights is based at least in part on calculating a scaling parameter for the mini-batch and calculating a local maximum of each bin, wherein the uniform size of the bins is a hyper-parameter of the DNN; transmitting the compressed current residue vector to a second learner of the plurality of learners; and updating, at each of the plurality of learners, the gradient weights of the parameters of the neural network layer. 7. The system of claim 6 , wherein generating the compressed current residue vector comprises: generating, by the processor, a scaled current residue vector comprising scaled residual gradient weights for the mini batch, wherein generating the scaled current residue vector comprises multiplying the current gradient vector by the scaling parameter and summing the prior residue vector with the multiplied gradient vector; dividing the residual gradient weights of the current residue vector into the plurality of bins of the uniform size; identifying, for each bin of the plurality of bins, a local maximum of the absolute value of the residual gradient weights of the bin; determining, for each residual gradient weight of each bin, that a corresponding scaled residual gradient weight of the scaled residue vector exceeds the local maximum of the bin; and upon identifying, for each residual gradient weight of each bin, that the corresponding scaled residual gradient weight of the scaled residue vector exceeds the local maximum of the bin, generating a quantizing value for the give residual gradient weight and updating the current residue vector by substituting the residual gradient weight of the current residue vector with the quantized value. 8. The system of claim 7 , wherein the scale parameter is calculated by minimizing quantization error according to L2 normalization. 9. The system of claim 7 , wherein: the DNN includes one or more convolution network layers; and the size of the plurality of bins is set to 50 for the one or more convolution layers. 10. The system of claim 7 , wherein: the DNN includes at least one of more fully connected layers; and the size of the bins is set to 500 for the one or more fully connected layers. 11. A computer program product for adaptive residual gradient compression for training of a deep learning neural network (DNN), the computer program product comprising a comp

Assignees

Inventors

Classifications

G06N3/044
Recurrent networks, e.g. Hopfield networks · CPC title
G06N3/045
Combinations of networks · CPC title
G06N3/084Primary
Backpropagation, e.g. using gradient descent · CPC title
G06N3/0495
Quantised networks; Sparse networks; Compressed networks · CPC title
G06N3/098
Distributed learning, e.g. federated learning · CPC title

Patent family

Related publications grouped by family.

View patent family 66659264

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11295208B2 cover?: Embodiments of the present invention provide a computer-implemented method for adaptive residual gradient compression for training of a deep learning neural network (DNN). The method includes obtaining, by a first learner, a current gradient vector for a neural network layer of the DNN, in which the current gradient vector includes gradient weights of parameters of the neural network layer that…
Who is the assignee on this patent?: IBM
What technology area does this patent fall under?: Primary CPC classification G06N3/084. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Apr 05 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Predictive Modeling from Distributed Datasets

Mirror deep neural networks that regularize to linear networks

Method and apparatus for neural network quantization

Efficient training of neural networks

Systems and methods for combining stochastic average gradient and hessian-free optimization for sequence training of deep neural networks

Asynchronous stochastic gradient descent

Discriminative pretraining of deep neural networks

Frequently asked questions