Distributed training and prediction using elastic resources

US11003992B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11003992-B2
Application numberUS-201715785074-A
CountryUS
Kind codeB2
Filing dateOct 16, 2017
Priority dateOct 16, 2017
Publication dateMay 11, 2021
Grant dateMay 11, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

In one embodiment, a method includes establishing access to first and second different computing systems. A machine learning model is assigned for training to the first computing system, and the first computing system creates a check-point during training in response to a first predefined triggering event. The check-point may be a record of an execution state in the training of the machine learning model by the first computing system. In response to a second predefined triggering event, the training of the machine learning model on the first computing system is halted, and in response to a third predefined triggering event, the training of the machine learning model is transferred to the second computing system, which continues training the machine learning model starting from the execution state recorded by the check-point.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: by a first computing system, establishing access to a second computing system and to a third computing system different than the second computing system; by the first computing system, assigning a machine learning model for training to the second computing system, wherein the second computing system is configured to create a check-point in response to a first predefined triggering event, the check-point being a record of an execution state in the training of the machine learning model by the second computing system; by the first computing system, in response to a second predefined triggering event, halting the training of the machine learning model on the second computing system; by the first computing system, adjusting the check-point based on a comparison between configurations of the second and third computing systems; and by the first computing system, in response to a third predefined triggering event, assigning to the third computing system, the machine learning model for training continuing at the execution state recorded by the check-point based on the adjusted check-point. 2. The method of claim 1 , wherein: the assigning of the machine learning model for training to the second computing system includes specifying a first value for a training parameter related to the training of the machine learning model, the first value being based on select performance characteristics as determined for the second computing system; the assigning of the machine learning model for training to the third computing system includes specifying a second value for the training parameter based on said select performance characteristics as determined for the third computing system, the second value being different than the first value; and the second value is determined to cause the third computing system to produce training results similar to, within a predefined percentage range, training results achievable by the second computing system with its training parameter set to the first value. 3. The method of claim 1 , wherein: the second computing system and third computing system train the machine learning model using gradient descent and configurable hyper-parameters, including a learning rate and a batch size of training samples; the assigning of the machine learning model for training to the second computing system includes specifying a first learning rate and a first batch size corresponding to the learning rate and the batch size of training samples, respectively, of the configurable hyper-parameters used by the second computing system; and the assigning of the machine learning model for training to the third computing system includes specifying a second learning rate based on a second batch size supported by the third computing system, the second learning rate being directly proportional to a batch-ratio of the second batch size to the first batch size. 4. The method of claim 3 , wherein the second batch size and the first batch size differ by a range of from one to three orders of magnitude. 5. The method of claim 3 , wherein the gradient descent is stochastic gradient descent, and the second learning rate is based on a product of the batch-ratio and the first learning rate. 6. The method of claim 3 , wherein: the second computing system and third computing system each trains the machine learning model in a series of iterative cycles, with each iterative cycle being a full training propagation sequence through the machine learning model; and the assigning of the machine learning model for training to the third computing system includes, incrementally ramping, in discrete steps at consecutive iterative cycles, the learning rate of the third computing system starting from the first learning rate to the second learning rate. 7. The method of claim 1 , wherein the first predefined triggering event is one of a regular time interval, a specified number of training iteration cycles, or an instruction from the first computing system to halt execution of the training of the machine learning model. 8. The method of claim 1 , wherein: the machine learning model is defined by an operational nodal graph, where graph nodes of the operational graph model correspond to operations of the machine learning model and interconnections between graph nodes correspond to operational relationships between operations of the machine learning model; and the second computing system creates the check-point based on the operational nodal graph. 9. The method of claim 8 , wherein: in response to the first predefined triggering event, the second computing system continues training the machine learning model according to the operational nodal graph, and creates the check-point when the execution state reaches a predefined execution point in the operational nodal graph. 10. The method of claim 9 , wherein the predefined execution point includes at least one of: an end of a current iteration of the operational nodal graph, finishing processing of a predefined layer of nodal operations within the operational nodal graph, or finishing processing of any nodes being executed when the first predefined triggering event occurred. 11. The method of claim 1 , wherein: the second computing system is characterized by a peak-usage period during which it is not to train the machine learning model; and the second predefined triggering event is based on the peak-usage period. 12. The method of claim 1 , wherein the third predefined triggering event includes the third computing system becoming available. 13. The method of claim 1 , wherein the third predefined triggering event is based on a determination that the third computing system is capable of training the machine learning model and is available. 14. The method of claim 13 , further comprising: by the first computing system, establishing access to a plurality of said third computing systems, each having different computing characteristics; wherein the second and third predefined triggering events include, by the first computing system, determining that one of the third computing systems has computing resources more closely matching computing requirements of the machine learning model than the second computing system and is available. 15. The method of claim 1 , wherein the third predefined triggering event is triggered based on a determination that the third computing system is able to meet a service-level agreement associated with the machine learning model and that the second computing system is unable to meet the service-level agreement. 16. One or more computer-readable non-transitory storage media embodying software that is operable when executed to: establish access to a first computing system and to a second computing system different than the first computing system; (ii) assign a machine learning model for training to the first computing system, wherein the first computing system is configured to create a check-point in response to a first predefined triggering event, the check-point being a record of an execution state in the training of the machine learning model by the first computing system; (iii) in response to a second predefined triggering event, halt the training of the machine learning model on the first computing system; (iv) adjust the check-point based on a comparison between configurations of the first and second computing systems; and (v) in response to a third predefined triggering event, assign to the second computing system, the machine learning model for training continuing at the execution state recorded b

Assignees

Inventors

Classifications

  • G06N3/084Primary

    Backpropagation, e.g. using gradient descent · CPC title

  • Supervised learning · CPC title

  • Distributed learning, e.g. federated learning · CPC title

  • Hyperparameter optimisation; Meta-learning; Learning-to-learn · CPC title

  • Quantised networks; Sparse networks; Compressed networks · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11003992B2 cover?
In one embodiment, a method includes establishing access to first and second different computing systems. A machine learning model is assigned for training to the first computing system, and the first computing system creates a check-point during training in response to a first predefined triggering event. The check-point may be a record of an execution state in the training of the machine lear…
Who is the assignee on this patent?
Facebook Inc
What technology area does this patent fall under?
Primary CPC classification G06N3/084. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue May 11 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).