Determining a number of nodes required in a networked virtualization system based on increasing node density
US-2020026576-A1 · Jan 23, 2020 · US
US11537439B1 · US · B1
| Field | Value |
|---|---|
| Publication number | US-11537439-B1 |
| Application number | US-201815934046-A |
| Country | US |
| Kind code | B1 |
| Filing date | Mar 23, 2018 |
| Priority date | Nov 22, 2017 |
| Publication date | Dec 27, 2022 |
| Grant date | Dec 27, 2022 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Techniques for intelligent compute resource selection and utilization for machine learning training jobs are described. At least a portion of a machine learning (ML) training job is executed a plurality of times using a plurality of different resource configurations, where each of the plurality of resource configurations includes at least a different type or amount of compute instances. A performance metric is measured for each of the plurality of the executions, and can be used along with a desired performance characteristic to generate a recommended resource configuration for the ML training job. The ML training job is executed using the recommended resource configuration.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method comprising: receiving a request to analyze a machine learning (ML) training job, wherein the request identifies a type of ML model to be trained and further indicates training data to be used for the ML training job; executing at least a portion of the ML training job a plurality of times using the training data and a plurality of different resource configurations, the plurality of different resource configurations including a different type, a different amount, or both a different type and a different amount, of virtual machines; measuring, for each of the plurality of times the at least a portion of the ML training job is executed, a measured performance metric for each of the plurality of different resource configurations; generating, using a ML model trained on measured performance metrics from a plurality of other ML training jobs, and using the measured performance metrics from the ML training job as input to the ML model, and based at least in part on a desired performance characteristic, one or more recommended resource configurations for the ML training job; causing data describing the one or more recommended resource configurations for the ML training job to be transmitted to an electronic device of a user; receiving a request to execute the ML training job from the electronic device of the user, wherein the request to execute the ML training job identifies a selected resource configuration from among the one or more recommended resource configurations; and executing the ML training job using the selected resource configuration. 2. The computer-implemented method of claim 1 , wherein the selected resource configuration includes a plurality of virtual machines acting as a cluster. 3. The computer-implemented method of claim 1 , wherein the desired performance characteristic is one of: an amount of training time for the ML training job; an amount of training cost for the ML training job; or an accuracy of a model being trained by the ML training job. 4. A computer-implemented method comprising: executing at least a portion of a desired machine learning (ML) training job a plurality of times using training data and a plurality of different resource configurations, the plurality of different resource configurations including a different type, a different amount, or both a different type and a different amount, of virtual machines; measuring, for each of the plurality of times the at least a portion of the desired ML training job is executed, a measured performance metric for each of the plurality of different resource configurations; generating, using a ML model trained on measured performance metrics from a plurality of other ML training jobs, and using the measured performance metrics from the desired ML training job as input to the ML model, and based at least in part on a desired performance characteristic, a recommended resource configuration for the desired ML training job; sending data indicating one or more recommended resource configurations for the desired ML training job to an electronic device of a user, the one or more recommended resource configurations including the recommended resource configuration; receiving a request to execute the desired ML training job from the electronic device of the user, wherein the request to execute the desired ML training job identifies the recommended resource configuration; and executing the desired ML training job using the recommended resource configuration. 5. The computer-implemented method of claim 4 , wherein the desired performance characteristic is one of: an amount of training time for the desired ML training job; an amount of training cost for the desired ML training job; or an accuracy of a model being trained by the desired ML training job. 6. The computer-implemented method of claim 5 , wherein the request to execute the desired ML training job identifies the desired performance characteristic. 7. The computer-implemented method of claim 6 , wherein the request to execute the desired ML training job does not identify a type or amount of compute instances to be used for the desired ML training job. 8. The computer-implemented method of claim 4 , further comprising receiving a request from the electronic device of the user regarding the desired ML training job, the request regarding the desired ML training job identifying a type of ML model to be trained. 9. The computer-implemented method of claim 8 , wherein the request regarding the desired ML training job further indicates the training data to be used for the desired ML training job. 10. The computer-implemented method of claim 4 , wherein the generating of the recommended resource configuration for the desired ML training job is performed by a resource analysis engine of a training configuration system in a provider network. 11. The computer-implemented method of claim 4 , further comprising: training, based at least in part on the measured performance metrics from the plurality of other ML training jobs, the ML model; and receiving a request from the electronic device of the user regarding the desired ML training job, the request regarding the desired ML training job indicating the training data to be used for the desired ML training job. 12. The computer-implemented method of claim 4 , wherein the recommended resource configuration includes a plurality of compute instances acting as a cluster. 13. A system comprising: a model training system implemented by a first one or more electronic devices; and a training configuration system implemented by a second one or more electronic devices, the training configuration system including second instructions that upon execution cause the training configuration system to: execute, via the model training system, at least a portion of a desired machine learning (ML) training job a plurality of times using training data and a plurality of different resource configurations, the plurality of different resource configurations including a different type, a different amount, or both a different type and a different amount, of virtual machines; measure, for each of the plurality of times the at least a portion of the desired ML training job is executed, a measured performance metric for each of the plurality of different resource configurations; generate, using a ML model trained on measured performance metrics from a plurality of other ML training jobs, and using the measured performance metrics from the desired ML training job as input to the ML model, and based at least in part on a desired performance characteristic, a recommended resource configuration for the desired ML training job; send data indicating one or more recommended resource configurations for the desired ML training job to an electronic device of a user, the one or more recommended resource configurations including the recommended resource configuration; receive a request to execute the desired ML training job from the electronic device of the user, wherein the request to execute the desired ML training job identifies the recommended resource configuration; and execute, via the model training system, the desired ML training job using the recommended resource configuration. 14. The system of claim 13 , wherein the desired performance characteristic is one of: an amount of training time for the desired ML training job; an amount of training cost for the desired ML training job; or an accuracy of a model being trained by the desired ML training job. 15. The system of claim 14 , wherein the second instructi
the resource being a machine, e.g. CPUs, Servers, Terminals · CPC title
Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues · CPC title
Machine learning · CPC title
Monitor · CPC title
Performance criteria · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.