Systems and Methods for Efficient Data Preprocessing of Machine Learning Workloads
US-2024403138-A1 · Dec 5, 2024 · US
US10474501B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10474501-B2 |
| Application number | US-201715581987-A |
| Country | US |
| Kind code | B2 |
| Filing date | Apr 28, 2017 |
| Priority date | Apr 28, 2017 |
| Publication date | Nov 12, 2019 |
| Grant date | Nov 12, 2019 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A system for cluster resource allocation includes an interface and a processor. The interface is configured to receive a process and input data. The processor is configured to determine an estimate for resources required for the process to process the input data; determine existing available resources in a cluster for running the process; determine whether the existing available resources are sufficient for running the process; in the event it is determined that the existing available resources are not sufficient for running the process, indicate to add new resources; determine an allocated share of resources in the cluster for running the process; and cause execution of the process using the share of resources.
Opening claim text (preview).
What is claimed is: 1. A system for cluster resource allocation, comprising: an interface configured to: receive a process and input data; and a processor configured to: determine an estimate for resources required for the process to process the input data, comprising to: determine a parallelism level for one or more steps of the process, the parallelism level including a quantity of data per cluster worker, a number of cluster workers, or a quantity of data per processor; determine an implementation type for one or more steps of the process, the implementation type including a sort algorithm, a join algorithm, or a filter algorithm; and determine, based on historical usage data, the estimate for resources required for the process using the parallelism level and the implementation type; determine existing available resources in a cluster for running the process; determine whether the existing available resources are sufficient for running the process based at least in part on the estimate for resources required; in response to determining that the existing available resources are not sufficient for running the process, indicate to add new resources to a cluster; determine an allocated share of resources in the cluster for running the process; cause execution of the process using the allocated share of resources, comprising to: provide an indication of the share of resources, the process, and the input data to a cluster driver; and provide an indication to the cluster driver to execute the process of the input data using the share of resources. 2. The system of claim 1 , wherein determining existing available resources in a cluster for running the process comprises determining resources that are not allocated to other processes. 3. The system of claim 1 , wherein determining existing available resources in a cluster for running the process comprises determining resources that are not currently being used by other processes. 4. The system of claim 1 , wherein determining existing available resources in a cluster for running the process comprises determining the estimated required resources for other processes running in the cluster. 5. The system of claim 1 , wherein the estimate for resources required for the process to process the input data is based at least in part on one or more of the following: the process, the input data, or the historical usage data. 6. The system of claim 1 , wherein the processor is further configured to: in response to determining that the existing available resources are not sufficient for running the process, indicate to add new resources if possible. 7. The system of claim 1 , wherein the processor is further configured to: in response to determining that the existing available resources are more than sufficient for running the process, indicate to remove resources. 8. The system of claim 1 , wherein the process comprises an associated scheduled running time. 9. The system of claim 8 , wherein the processor is further configured to: in response to determining that the existing available resources are not sufficient for running the process, indicate to add new resources in advance of the scheduled running time. 10. The system of claim 1 , wherein the allocated share of resources in the cluster comprises one or more of the following: a share determined by dividing the cluster resources evenly among users, a share determined by dividing the cluster resources according to a predetermined user allocation, an equal share for each running process, or a maximum amount of cluster resources. 11. The system of claim 1 , wherein the allocated share of resources comprises one or more of the following: a number of worker machines, a number of processors, a number of processes, an amount of memory, a number of disks, a number of virtual machines, or a number of containers. 12. The system of claim 1 , wherein the allocated share of resources comprises one of the following: an amount of resources, a fraction of resources, or a priority. 13. The system of claim 1 , wherein the processor is further configured to: determine that the share of resources being used is greater than the allocated share of resources. 14. The system of claim 13 , wherein the processor is further configured to: reduce a second process share of resources associated with a second process. 15. The system of claim 14 , wherein the second process share of resources associated with the second process is not reduced below a threshold share. 16. The system of claim 13 , wherein the processor is further configured to: indicate to add resources. 17. The system of claim 16 , wherein resources are limited to a customer maximum allocation. 18. The system of claim 1 wherein the driver is shared between users of a customer. 19. The system of claim 1 wherein the driver is shared between users of different customers. 20. The system of claim 1 , wherein the threshold comprises one or more of a percentage of the allocated resources or a number of resources. 21. A method for cluster resource allocation, comprising: receiving a process and input data; determining, using a processor, an estimate for resources required for the process to process the input data, comprising: determining a parallelism level for one or more steps of the process, the parallelism level including a quantity of data per cluster worker, a number of cluster workers, or a quantity of data per processor; determining an implementation type for one or more steps of the process, the implementation type including a sort algorithm a join algorithm or a filter algorithm; and determining, based on historical usage data, the estimate for resources required for the process using the parallelism level and the implementation type; determining existing available resources in a cluster for running the process; determining whether the existing available resources are sufficient for running the process based at least in part on the estimate for resources required; in response to determining that the existing available resources are not sufficient for running the process, indicating to add new resources to a cluster; determining an allocated share of resources in the cluster for running the process; causing execution of the process using the allocated share of resources, comprising: providing an indication of the share of resources, the process, and the input data to a cluster driver; and providing an indication to the cluster driver to execute the process of the input data using the share of resources. 22. A computer program product for cluster resource allocation, the computer program product being embodied in a non-transitory computer readable storage medium and comprising computer instructions for: receiving a process and input data; determining an estimate for resources required for the process to process the input data, comprising: determining a parallelism level for one or more steps of the process, the parallelism level including a quantity of data per cluster worker, a number of cluster workers, or a quantity of data per processor; determining an implementation type for one or more steps of the process, the implementation type including a sort algorithm a join algorithm or a filter algorithm; and determining, based on historical usage data, the estimate for resources required for the process using the parallelism level and the implementation type; determining existing available resource
Pool · CPC title
Partitioning or combining of resources · CPC title
Clust · CPC title
to service a request · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.