Service level agreement-based multi-hardware accelerated inference
US-10805179-B2 · Oct 13, 2020 · US
US12159167B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12159167-B2 |
| Application number | US-202318215921-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jun 29, 2023 |
| Priority date | Jan 23, 2020 |
| Publication date | Dec 3, 2024 |
| Grant date | Dec 3, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method for dynamically assigning an inference request is disclosed. A method for dynamically assigning an inference request may include determining at least one model to process an inference request on a plurality of computing platforms, the plurality of computing platforms including at least one Central Processing Unit (CPU) and at least one Graphics Processing Unit (GPU), obtaining, with at least one processor, profile information of the at least one model, the profile information including measured characteristics of the at least one model, dynamically determining a selected computing platform from between the at least one CPU and the at least one GPU for responding to the inference request based on an optimized objective associated with a status of the computing platform and the profile information, and routing, with at least one processor, the inference request to the selected computing platform. A system and computer program product are also disclosed.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method, comprising: determining, with at least one processor, an objective function configured to reduce power consumption while providing a level of throughput within a power budget; statically assigning an inference request for processing based on a predefined latency threshold; determining, with at least one processor based on a platform status, at least one capability of a current state of a plurality of computing platforms; determining, with at least one processor based on profile information, at least one measured characteristic of at least one inference model; dynamically shifting, with at least one processor, the inference request to a selected graphics processing unit (GPU) bound inference model for processing by at least one of the plurality of computing platforms that maximizes the objective function based on the at least one capability of the current state of the plurality of computing platforms and a measured latency in the at least one inference model; routing, with at least one processor, the inference request to the selected GPU bound inference model; and processing the inference request by the selected GPU bound inference model. 2. The method of claim 1 , further comprising: determining, with at least one processor, profile information including the at least one measured characteristic including at least one characteristic for predicting specific requirements and/or capabilities of the at least one inference model; and determining, with at least one processor based on the inference request, the at least one inference model to calculate the inference request on the plurality of computing platforms. 3. The method of claim 2 , wherein the platform status includes real-time feedback including information to determine an availability, workload, and/or efficiency for processing the at least one inference model on at least one computing platform of the plurality of computing platforms, wherein the real-time feedback includes at least one capability of the plurality of computing platforms, including at least one of: a processor utilization, a memory utilization, an inference throughput, a response time, a processor speed, a power consumption, a model central processing unit (CPU) utilization, a model GPU utilization, a model RAM utilization, a model inference throughput, a model latency, a current state or any combination thereof; and wherein the profile information includes the at least one characteristic including at least one of: a size, a power efficiency, an accuracy level, a throughput of the at least one inference model, a current capacity, a bottleneck, a suitability for the inference request, a number of nodes, historical model latency data, historical completion data, or any combination thereof. 4. The method of claim 3 , wherein maximizing the objective function, further comprises: determining real-time feedback and profile information for selecting the at least one inference model based on the current state and capabilities of the plurality of computing platforms and the at least one measured characteristic of the at least one inference model. 5. The method of claim 1 , further comprising: determining the objective function based on a real-time latency in the at least one inference model; and generating a dynamic assignment of the inference request to a central processing unit (CPU)-bound inference model when a throughput is less than a threshold number of transactions over a predetermined period of time. 6. The method of claim 1 , further comprising: determining an availability, a workload, and/or an efficiency of the plurality of computing platforms at a specified time based on real-time feedback; and determining a current capacity, bottleneck, and/or suitability for processing the inference request by the at least one inference model. 7. The method of claim 1 , further comprising: obtaining, with at least one processor, profile information of the at least one inference model, the profile information including the at least one measured characteristic of the at least one inference model; and determining, with at least one processor, the platform status, wherein the platform status includes platform information associated with a current state of the plurality of computing platforms. 8. A system, comprising: at least one processor configured to: determine an objective function configured to reduce power consumption while providing a level of throughput within a power budget; statically assign an inference request for processing based on a predefined latency threshold; determine, based on a platform status, at least one capability of a current state of a plurality of computing platforms; determine, based on profile information, at least one measured characteristic of at least one inference model; dynamically shift the inference request to a selected graphics processing unit (GPU) bound inference model for processing by at least one of the plurality of computing platforms that maximizes the objective function based on the at least one capability of the current state of the plurality of computing platforms and a measured latency in the at least one inference model; route the inference request to the selected GPU bound inference model; and processing the inference request by the selected GPU bound inference model. 9. The system of claim 8 , wherein the at least one processor is further configured to: determine profile information including the at least one measured characteristic including at least one characteristic for predicting specific requirements and/or capabilities of the at least one inference model; and determine, based on the inference request, the at least one inference model to calculate the inference request on the plurality of computing platforms. 10. The system of claim 9 , wherein the platform status includes real-time feedback including information to determine an availability, workload, and/or efficiency for processing the at least one inference model on at least one computing platform of the plurality of computing platforms, wherein the real-time feedback includes at least one capability of the plurality of computing platforms, including at least one of: a processor utilization, a memory utilization, an inference throughput, a response time, a processor speed, a power consumption, a model central processing unit (CPU) utilization, a model GPU utilization, a model RAM utilization, a model inference throughput, a model latency, a current state or any combination thereof; and wherein the profile information includes the at least one characteristic including at least one of: a size, a power efficiency, an accuracy level, a throughput of the at least one inference model, a current capacity, a bottleneck, a suitability for the inference request, a number of nodes, historical model latency data, historical completion data, or any combination thereof. 11. The system of claim 10 , wherein the at least one processor is further configured to maximize the objective function by: determining real-time feedback and profile information for selecting the at least one inference model based on the current state and capabilities of the plurality of computing platforms and the at least one measured characteristic of the at least one inference model. 12. The system of claim 8 , wherein the at least one processor is further configured to: determine the objective function based on a real-time latency in the at least one inference model; and generate a dynamic assignment of the inference request to a central processing unit (CPU)-bound inference model when a throughput is less than
Supervised learning · CPC title
considering software capabilities, i.e. software resources associated or available to the machine · CPC title
involving task migration · CPC title
considering hardware capabilities · CPC title
using a secondary processor, e.g. coprocessor (peripheral processor G06F13/12) · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.