Intelligent compute resource selection for machine learning training jobs

US11537439B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-11537439-B1
Application numberUS-201815934046-A
CountryUS
Kind codeB1
Filing dateMar 23, 2018
Priority dateNov 22, 2017
Publication dateDec 27, 2022
Grant dateDec 27, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Techniques for intelligent compute resource selection and utilization for machine learning training jobs are described. At least a portion of a machine learning (ML) training job is executed a plurality of times using a plurality of different resource configurations, where each of the plurality of resource configurations includes at least a different type or amount of compute instances. A performance metric is measured for each of the plurality of the executions, and can be used along with a desired performance characteristic to generate a recommended resource configuration for the ML training job. The ML training job is executed using the recommended resource configuration.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method comprising: receiving a request to analyze a machine learning (ML) training job, wherein the request identifies a type of ML model to be trained and further indicates training data to be used for the ML training job; executing at least a portion of the ML training job a plurality of times using the training data and a plurality of different resource configurations, the plurality of different resource configurations including a different type, a different amount, or both a different type and a different amount, of virtual machines; measuring, for each of the plurality of times the at least a portion of the ML training job is executed, a measured performance metric for each of the plurality of different resource configurations; generating, using a ML model trained on measured performance metrics from a plurality of other ML training jobs, and using the measured performance metrics from the ML training job as input to the ML model, and based at least in part on a desired performance characteristic, one or more recommended resource configurations for the ML training job; causing data describing the one or more recommended resource configurations for the ML training job to be transmitted to an electronic device of a user; receiving a request to execute the ML training job from the electronic device of the user, wherein the request to execute the ML training job identifies a selected resource configuration from among the one or more recommended resource configurations; and executing the ML training job using the selected resource configuration. 2. The computer-implemented method of claim 1 , wherein the selected resource configuration includes a plurality of virtual machines acting as a cluster. 3. The computer-implemented method of claim 1 , wherein the desired performance characteristic is one of: an amount of training time for the ML training job; an amount of training cost for the ML training job; or an accuracy of a model being trained by the ML training job. 4. A computer-implemented method comprising: executing at least a portion of a desired machine learning (ML) training job a plurality of times using training data and a plurality of different resource configurations, the plurality of different resource configurations including a different type, a different amount, or both a different type and a different amount, of virtual machines; measuring, for each of the plurality of times the at least a portion of the desired ML training job is executed, a measured performance metric for each of the plurality of different resource configurations; generating, using a ML model trained on measured performance metrics from a plurality of other ML training jobs, and using the measured performance metrics from the desired ML training job as input to the ML model, and based at least in part on a desired performance characteristic, a recommended resource configuration for the desired ML training job; sending data indicating one or more recommended resource configurations for the desired ML training job to an electronic device of a user, the one or more recommended resource configurations including the recommended resource configuration; receiving a request to execute the desired ML training job from the electronic device of the user, wherein the request to execute the desired ML training job identifies the recommended resource configuration; and executing the desired ML training job using the recommended resource configuration. 5. The computer-implemented method of claim 4 , wherein the desired performance characteristic is one of: an amount of training time for the desired ML training job; an amount of training cost for the desired ML training job; or an accuracy of a model being trained by the desired ML training job. 6. The computer-implemented method of claim 5 , wherein the request to execute the desired ML training job identifies the desired performance characteristic. 7. The computer-implemented method of claim 6 , wherein the request to execute the desired ML training job does not identify a type or amount of compute instances to be used for the desired ML training job. 8. The computer-implemented method of claim 4 , further comprising receiving a request from the electronic device of the user regarding the desired ML training job, the request regarding the desired ML training job identifying a type of ML model to be trained. 9. The computer-implemented method of claim 8 , wherein the request regarding the desired ML training job further indicates the training data to be used for the desired ML training job. 10. The computer-implemented method of claim 4 , wherein the generating of the recommended resource configuration for the desired ML training job is performed by a resource analysis engine of a training configuration system in a provider network. 11. The computer-implemented method of claim 4 , further comprising: training, based at least in part on the measured performance metrics from the plurality of other ML training jobs, the ML model; and receiving a request from the electronic device of the user regarding the desired ML training job, the request regarding the desired ML training job indicating the training data to be used for the desired ML training job. 12. The computer-implemented method of claim 4 , wherein the recommended resource configuration includes a plurality of compute instances acting as a cluster. 13. A system comprising: a model training system implemented by a first one or more electronic devices; and a training configuration system implemented by a second one or more electronic devices, the training configuration system including second instructions that upon execution cause the training configuration system to: execute, via the model training system, at least a portion of a desired machine learning (ML) training job a plurality of times using training data and a plurality of different resource configurations, the plurality of different resource configurations including a different type, a different amount, or both a different type and a different amount, of virtual machines; measure, for each of the plurality of times the at least a portion of the desired ML training job is executed, a measured performance metric for each of the plurality of different resource configurations; generate, using a ML model trained on measured performance metrics from a plurality of other ML training jobs, and using the measured performance metrics from the desired ML training job as input to the ML model, and based at least in part on a desired performance characteristic, a recommended resource configuration for the desired ML training job; send data indicating one or more recommended resource configurations for the desired ML training job to an electronic device of a user, the one or more recommended resource configurations including the recommended resource configuration; receive a request to execute the desired ML training job from the electronic device of the user, wherein the request to execute the desired ML training job identifies the recommended resource configuration; and execute, via the model training system, the desired ML training job using the recommended resource configuration. 14. The system of claim 13 , wherein the desired performance characteristic is one of: an amount of training time for the desired ML training job; an amount of training cost for the desired ML training job; or an accuracy of a model being trained by the desired ML training job. 15. The system of claim 14 , wherein the second instructi

Assignees

Inventors

Classifications

  • G06F9/5027Primary

    the resource being a machine, e.g. CPUs, Servers, Terminals · CPC title

  • Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues · CPC title

  • Machine learning · CPC title

  • Monitor · CPC title

  • Performance criteria · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11537439B1 cover?
Techniques for intelligent compute resource selection and utilization for machine learning training jobs are described. At least a portion of a machine learning (ML) training job is executed a plurality of times using a plurality of different resource configurations, where each of the plurality of resource configurations includes at least a different type or amount of compute instances. A perfo…
Who is the assignee on this patent?
Amazon Tech Inc
What technology area does this patent fall under?
Primary CPC classification G06F9/5027. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Dec 27 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).