Training a model using parameter server shards

US9218573B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-9218573-B1
Application numberUS-201313826327-A
CountryUS
Kind codeB1
Filing dateMar 14, 2013
Priority dateMay 22, 2012
Publication dateDec 22, 2015
Grant dateDec 22, 2015

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for training a model using parameter server shards. One of the methods includes receiving, at a parameter server shard configured to maintain values of a disjoint partition of the parameters of the model, a succession of respective requests for parameter values from each of a plurality of replicas of the model; in response to each request, downloading a current value of each requested parameter to the replica from which the request was received; receiving a succession of uploads, each upload including respective delta values for each of the parameters in the partition maintained by the shard; and updating values of the parameters in the partition maintained by the parameter server shard repeatedly based on the uploads of delta values to generate current parameter values.

First claim

Opening claim text (preview).

What is claimed is: 1. A system for training a model having parameters by determining a respective parameter value for each of the parameters of the model, the system comprising: a plurality of identical model replicas, wherein each of the plurality of replicas is an identical instance of the model with possibly different parameter values for the parameters of the model, wherein each model replica executes on a respective computing unit, wherein each model replica is configured to operate independently of each other model replica, and wherein each model replica is further configured to perform repeatedly the following operations: receiving, from at least one of a plurality of parameter server shards, current values of one or more of the parameters of the model, wherein each parameter server shard is configured to maintain values of a respective disjoint partition of the parameters of the model; computing respective delta values for each of a plurality of the parameters of the model by performing one or more iterations of a training process; and providing, for each of the plurality of parameters, the delta value for the parameter to the parameter server shard that is configured to maintain the respective partition that includes the parameter. 2. The system of claim 1 , wherein the training process is a stochastic gradient descent process. 3. The system of claim 2 , wherein: performing one or more iterations of the stochastic gradient descent process comprises obtaining a respective batch of training data; and computing the respective delta values for each of the plurality of parameters comprises computing a gradient of an objective function for the model based on the initial values and the batch of training data. 4. The system of claim 1 , wherein: performing one or more iterations of the training process comprises obtaining a respective batch of training data; and computing the respective delta values for each of the plurality of parameters comprises computing a gradient of an objective function for the model based on the initial values and the batch of training data. 5. The system of claim 3 , wherein each model replica obtains a different sequence of training data. 6. The system of claim 3 , wherein each model replica obtains different training data. 7. The system of claim 3 , wherein receiving current values of one or more of the plurality of parameters comprises: identifying one or more parameters for which current values are necessary to perform the one or more iterations of the training process; identifying one or more parameter server shards that are configured to maintain values of the one or more parameters; and requesting parameter values only from the one or more parameter server shards. 8. The system of claim 1 , further comprising: the plurality of parameter server shards, wherein each shard is configured to perform repeatedly the following operations asynchronously with respect to every other shard: receive a succession of respective requests for parameter values from each of the plurality of replicas of the model; in response to each request, download a current value of each requested parameter to the replica from which the request was received; receive, from each of the plurality of replicas, a succession of uploads, each upload including respective delta values for each of the parameters in the partition maintained by the shard; and update values of the parameters in the partition maintained by the parameter server shard repeatedly based on the uploads of delta values to generate current parameter values. 9. The system of claim 8 , wherein the updated value of a parameter (p u ) satisfies: p u =p c −α×Δp r , wherein p c is a current value of the parameter, α is a learning rate, and Δp r is a received delta value for the parameter. 10. The system of claim 9 , wherein the learning rate is an adaptive learning rate that varies between parameters. 11. The system of claim 9 , wherein the learning rate is an adaptive learning rate that varies between iterations of the training process. 12. A method for training a model having parameters by determining a respective parameter value for each of the parameters of the model, the method comprising: receiving, from at least one of a plurality of parameter server shards and at a model replica of a plurality of model replicas, current values of one or more of the parameters of the model, wherein each parameter server shard is configured to maintain values of a respective disjoint partition of the parameters of the model, and wherein each of the plurality of replicas is an identical instance of the model with possibly different parameter values for the parameters of the model; computing, by the model replica, respective delta values for each of a plurality of the parameters of the model by performing one or more iterations of a training process; and providing, by the model replica and for each of the plurality of parameters, the delta value for the parameter to the parameter server shard that is configured to maintain the respective partition that includes the parameter. 13. The method of claim 12 , wherein the training process is a stochastic gradient descent process. 14. The method of claim 13 , wherein: performing one or more iterations of the stochastic gradient descent process comprises obtaining a respective batch of training data; and computing the respective delta values for each of the plurality of parameters comprises computing a gradient of an objective function for the model based on the initial values and the batch of training data. 15. The method of claim 12 , wherein: performing one or more iterations of the training process comprises obtaining a respective batch of training data; and computing the respective delta values for each of the plurality of parameters comprises computing a gradient of an objective function for the model based on the initial values and the batch of training data. 16. The method of claim 14 , wherein each model replica obtains a different sequence of training data. 17. The method of claim 14 , wherein each model replica obtains different training data. 18. The method of claim 14 , wherein receiving current values of one or more of the plurality of parameters comprises: identifying one or more parameters for which current values are necessary to perform the one or more iterations of the training process; identifying one or more parameter server shards that are configured to maintain values of the one or more parameters; and requesting parameter values only from the one or more parameter server shards. 19. The method of claim 14 , further comprising: receiving, at a parameter server shard of the plurality of parameter server shards, a succession of respective requests for parameter values from each of the plurality of replicas of the model; in response to each request, downloading, by the parameter server shard, a current value of each requested parameter to the replica from which the request was received; receiving, at the parameter server shard and from each of the plurality of replicas, a succession of uploads, each upload including respective delta values for each of the parameters in the partition maintained by the shard; and updating, by the parameter server shard, values of the parameters in the partition maintained by the parameter server shard repeatedly based on the uploads of delta values to generate current parameter values. 20. The method of cla

Assignees

Inventors

Classifications

  • Generating training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title

  • Probabilistic graphical models, e.g. probabilistic networks · CPC title

  • based on the proximity to a decision surface, e.g. support vector machines · CPC title

  • Quantised networks; Sparse networks; Compressed networks · CPC title

  • Supervised learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9218573B1 cover?
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for training a model using parameter server shards. One of the methods includes receiving, at a parameter server shard configured to maintain values of a disjoint partition of the parameters of the model, a succession of respective requests for parameter values from each of a plurality of replicas of…
Who is the assignee on this patent?
Google Inc
What technology area does this patent fall under?
Primary CPC classification G06N99/005. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Dec 22 2015 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).