What technology area does this patent fall under?

Primary CPC classification G06F9/5044. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Dec 03 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Method, system, and computer program product for dynamically assigning an inference request to a CPU or GPU

US12159167B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12159167-B2
Application number	US-202318215921-A
Country	US
Kind code	B2
Filing date	Jun 29, 2023
Priority date	Jan 23, 2020
Publication date	Dec 3, 2024
Grant date	Dec 3, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method for dynamically assigning an inference request is disclosed. A method for dynamically assigning an inference request may include determining at least one model to process an inference request on a plurality of computing platforms, the plurality of computing platforms including at least one Central Processing Unit (CPU) and at least one Graphics Processing Unit (GPU), obtaining, with at least one processor, profile information of the at least one model, the profile information including measured characteristics of the at least one model, dynamically determining a selected computing platform from between the at least one CPU and the at least one GPU for responding to the inference request based on an optimized objective associated with a status of the computing platform and the profile information, and routing, with at least one processor, the inference request to the selected computing platform. A system and computer program product are also disclosed.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method, comprising: determining, with at least one processor, an objective function configured to reduce power consumption while providing a level of throughput within a power budget; statically assigning an inference request for processing based on a predefined latency threshold; determining, with at least one processor based on a platform status, at least one capability of a current state of a plurality of computing platforms; determining, with at least one processor based on profile information, at least one measured characteristic of at least one inference model; dynamically shifting, with at least one processor, the inference request to a selected graphics processing unit (GPU) bound inference model for processing by at least one of the plurality of computing platforms that maximizes the objective function based on the at least one capability of the current state of the plurality of computing platforms and a measured latency in the at least one inference model; routing, with at least one processor, the inference request to the selected GPU bound inference model; and processing the inference request by the selected GPU bound inference model. 2. The method of claim 1 , further comprising: determining, with at least one processor, profile information including the at least one measured characteristic including at least one characteristic for predicting specific requirements and/or capabilities of the at least one inference model; and determining, with at least one processor based on the inference request, the at least one inference model to calculate the inference request on the plurality of computing platforms. 3. The method of claim 2 , wherein the platform status includes real-time feedback including information to determine an availability, workload, and/or efficiency for processing the at least one inference model on at least one computing platform of the plurality of computing platforms, wherein the real-time feedback includes at least one capability of the plurality of computing platforms, including at least one of: a processor utilization, a memory utilization, an inference throughput, a response time, a processor speed, a power consumption, a model central processing unit (CPU) utilization, a model GPU utilization, a model RAM utilization, a model inference throughput, a model latency, a current state or any combination thereof; and wherein the profile information includes the at least one characteristic including at least one of: a size, a power efficiency, an accuracy level, a throughput of the at least one inference model, a current capacity, a bottleneck, a suitability for the inference request, a number of nodes, historical model latency data, historical completion data, or any combination thereof. 4. The method of claim 3 , wherein maximizing the objective function, further comprises: determining real-time feedback and profile information for selecting the at least one inference model based on the current state and capabilities of the plurality of computing platforms and the at least one measured characteristic of the at least one inference model. 5. The method of claim 1 , further comprising: determining the objective function based on a real-time latency in the at least one inference model; and generating a dynamic assignment of the inference request to a central processing unit (CPU)-bound inference model when a throughput is less than a threshold number of transactions over a predetermined period of time. 6. The method of claim 1 , further comprising: determining an availability, a workload, and/or an efficiency of the plurality of computing platforms at a specified time based on real-time feedback; and determining a current capacity, bottleneck, and/or suitability for processing the inference request by the at least one inference model. 7. The method of claim 1 , further comprising: obtaining, with at least one processor, profile information of the at least one inference model, the profile information including the at least one measured characteristic of the at least one inference model; and determining, with at least one processor, the platform status, wherein the platform status includes platform information associated with a current state of the plurality of computing platforms. 8. A system, comprising: at least one processor configured to: determine an objective function configured to reduce power consumption while providing a level of throughput within a power budget; statically assign an inference request for processing based on a predefined latency threshold; determine, based on a platform status, at least one capability of a current state of a plurality of computing platforms; determine, based on profile information, at least one measured characteristic of at least one inference model; dynamically shift the inference request to a selected graphics processing unit (GPU) bound inference model for processing by at least one of the plurality of computing platforms that maximizes the objective function based on the at least one capability of the current state of the plurality of computing platforms and a measured latency in the at least one inference model; route the inference request to the selected GPU bound inference model; and processing the inference request by the selected GPU bound inference model. 9. The system of claim 8 , wherein the at least one processor is further configured to: determine profile information including the at least one measured characteristic including at least one characteristic for predicting specific requirements and/or capabilities of the at least one inference model; and determine, based on the inference request, the at least one inference model to calculate the inference request on the plurality of computing platforms. 10. The system of claim 9 , wherein the platform status includes real-time feedback including information to determine an availability, workload, and/or efficiency for processing the at least one inference model on at least one computing platform of the plurality of computing platforms, wherein the real-time feedback includes at least one capability of the plurality of computing platforms, including at least one of: a processor utilization, a memory utilization, an inference throughput, a response time, a processor speed, a power consumption, a model central processing unit (CPU) utilization, a model GPU utilization, a model RAM utilization, a model inference throughput, a model latency, a current state or any combination thereof; and wherein the profile information includes the at least one characteristic including at least one of: a size, a power efficiency, an accuracy level, a throughput of the at least one inference model, a current capacity, a bottleneck, a suitability for the inference request, a number of nodes, historical model latency data, historical completion data, or any combination thereof. 11. The system of claim 10 , wherein the at least one processor is further configured to maximize the objective function by: determining real-time feedback and profile information for selecting the at least one inference model based on the current state and capabilities of the plurality of computing platforms and the at least one measured characteristic of the at least one inference model. 12. The system of claim 8 , wherein the at least one processor is further configured to: determine the objective function based on a real-time latency in the at least one inference model; and generate a dynamic assignment of the inference request to a central processing unit (CPU)-bound inference model when a throughput is less than

Assignees

Visa Int Service Ass

Inventors

Classifications

G06N3/09
Supervised learning · CPC title
G06F9/5055
considering software capabilities, i.e. software resources associated or available to the machine · CPC title
G06F9/5088
involving task migration · CPC title
G06F9/5044Primary
considering hardware capabilities · CPC title
G06F9/3877
using a secondary processor, e.g. coprocessor (peripheral processor G06F13/12) · CPC title

Patent family

Related publications grouped by family.

View patent family 76969457

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12159167B2 cover?: A method for dynamically assigning an inference request is disclosed. A method for dynamically assigning an inference request may include determining at least one model to process an inference request on a plurality of computing platforms, the plurality of computing platforms including at least one Central Processing Unit (CPU) and at least one Graphics Processing Unit (GPU), obtaining, with at…
Who is the assignee on this patent?: Visa Int Service Ass
What technology area does this patent fall under?: Primary CPC classification G06F9/5044. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Dec 03 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Service level agreement-based multi-hardware accelerated inference

Artificial intelligence system for automated adaptation of text-based classification models for multiple languages

Dynamic accuracy-based deployment and monitoring of machine learning models in provider networks

Scalable and Flexible Job Distribution Architecture for a Hybrid Processor System to Serve High Bandwidth Real Time Computational Systems Used in Semiconductor Inspection and Metrology Systems

Frequently asked questions