Distributed Machine Learning On Heterogeneous Data Platforms
US-2019019104-A1 · Jan 17, 2019 · US
US11003992B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11003992-B2 |
| Application number | US-201715785074-A |
| Country | US |
| Kind code | B2 |
| Filing date | Oct 16, 2017 |
| Priority date | Oct 16, 2017 |
| Publication date | May 11, 2021 |
| Grant date | May 11, 2021 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
In one embodiment, a method includes establishing access to first and second different computing systems. A machine learning model is assigned for training to the first computing system, and the first computing system creates a check-point during training in response to a first predefined triggering event. The check-point may be a record of an execution state in the training of the machine learning model by the first computing system. In response to a second predefined triggering event, the training of the machine learning model on the first computing system is halted, and in response to a third predefined triggering event, the training of the machine learning model is transferred to the second computing system, which continues training the machine learning model starting from the execution state recorded by the check-point.
Opening claim text (preview).
What is claimed is: 1. A method comprising: by a first computing system, establishing access to a second computing system and to a third computing system different than the second computing system; by the first computing system, assigning a machine learning model for training to the second computing system, wherein the second computing system is configured to create a check-point in response to a first predefined triggering event, the check-point being a record of an execution state in the training of the machine learning model by the second computing system; by the first computing system, in response to a second predefined triggering event, halting the training of the machine learning model on the second computing system; by the first computing system, adjusting the check-point based on a comparison between configurations of the second and third computing systems; and by the first computing system, in response to a third predefined triggering event, assigning to the third computing system, the machine learning model for training continuing at the execution state recorded by the check-point based on the adjusted check-point. 2. The method of claim 1 , wherein: the assigning of the machine learning model for training to the second computing system includes specifying a first value for a training parameter related to the training of the machine learning model, the first value being based on select performance characteristics as determined for the second computing system; the assigning of the machine learning model for training to the third computing system includes specifying a second value for the training parameter based on said select performance characteristics as determined for the third computing system, the second value being different than the first value; and the second value is determined to cause the third computing system to produce training results similar to, within a predefined percentage range, training results achievable by the second computing system with its training parameter set to the first value. 3. The method of claim 1 , wherein: the second computing system and third computing system train the machine learning model using gradient descent and configurable hyper-parameters, including a learning rate and a batch size of training samples; the assigning of the machine learning model for training to the second computing system includes specifying a first learning rate and a first batch size corresponding to the learning rate and the batch size of training samples, respectively, of the configurable hyper-parameters used by the second computing system; and the assigning of the machine learning model for training to the third computing system includes specifying a second learning rate based on a second batch size supported by the third computing system, the second learning rate being directly proportional to a batch-ratio of the second batch size to the first batch size. 4. The method of claim 3 , wherein the second batch size and the first batch size differ by a range of from one to three orders of magnitude. 5. The method of claim 3 , wherein the gradient descent is stochastic gradient descent, and the second learning rate is based on a product of the batch-ratio and the first learning rate. 6. The method of claim 3 , wherein: the second computing system and third computing system each trains the machine learning model in a series of iterative cycles, with each iterative cycle being a full training propagation sequence through the machine learning model; and the assigning of the machine learning model for training to the third computing system includes, incrementally ramping, in discrete steps at consecutive iterative cycles, the learning rate of the third computing system starting from the first learning rate to the second learning rate. 7. The method of claim 1 , wherein the first predefined triggering event is one of a regular time interval, a specified number of training iteration cycles, or an instruction from the first computing system to halt execution of the training of the machine learning model. 8. The method of claim 1 , wherein: the machine learning model is defined by an operational nodal graph, where graph nodes of the operational graph model correspond to operations of the machine learning model and interconnections between graph nodes correspond to operational relationships between operations of the machine learning model; and the second computing system creates the check-point based on the operational nodal graph. 9. The method of claim 8 , wherein: in response to the first predefined triggering event, the second computing system continues training the machine learning model according to the operational nodal graph, and creates the check-point when the execution state reaches a predefined execution point in the operational nodal graph. 10. The method of claim 9 , wherein the predefined execution point includes at least one of: an end of a current iteration of the operational nodal graph, finishing processing of a predefined layer of nodal operations within the operational nodal graph, or finishing processing of any nodes being executed when the first predefined triggering event occurred. 11. The method of claim 1 , wherein: the second computing system is characterized by a peak-usage period during which it is not to train the machine learning model; and the second predefined triggering event is based on the peak-usage period. 12. The method of claim 1 , wherein the third predefined triggering event includes the third computing system becoming available. 13. The method of claim 1 , wherein the third predefined triggering event is based on a determination that the third computing system is capable of training the machine learning model and is available. 14. The method of claim 13 , further comprising: by the first computing system, establishing access to a plurality of said third computing systems, each having different computing characteristics; wherein the second and third predefined triggering events include, by the first computing system, determining that one of the third computing systems has computing resources more closely matching computing requirements of the machine learning model than the second computing system and is available. 15. The method of claim 1 , wherein the third predefined triggering event is triggered based on a determination that the third computing system is able to meet a service-level agreement associated with the machine learning model and that the second computing system is unable to meet the service-level agreement. 16. One or more computer-readable non-transitory storage media embodying software that is operable when executed to: establish access to a first computing system and to a second computing system different than the first computing system; (ii) assign a machine learning model for training to the first computing system, wherein the first computing system is configured to create a check-point in response to a first predefined triggering event, the check-point being a record of an execution state in the training of the machine learning model by the first computing system; (iii) in response to a second predefined triggering event, halt the training of the machine learning model on the first computing system; (iv) adjust the check-point based on a comparison between configurations of the first and second computing systems; and (v) in response to a third predefined triggering event, assign to the second computing system, the machine learning model for training continuing at the execution state recorded b
Backpropagation, e.g. using gradient descent · CPC title
Supervised learning · CPC title
Distributed learning, e.g. federated learning · CPC title
Hyperparameter optimisation; Meta-learning; Learning-to-learn · CPC title
Quantised networks; Sparse networks; Compressed networks · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.