System and method for managing data processing systems hosting distributed inference models
US-2024177025-A1 · May 30, 2024 · US
US12526230B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12526230-B2 |
| Application number | US-202318518902-A |
| Country | US |
| Kind code | B2 |
| Filing date | Nov 24, 2023 |
| Priority date | Nov 24, 2023 |
| Publication date | Jan 13, 2026 |
| Grant date | Jan 13, 2026 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Load aware routing is performed for requests to managed network endpoints for heterogeneous machine learning models. A request to generate an inference is received via a managed network endpoint that invokes a specified machine learning model. Workloads of the different hosts for respective replicas of the machine learning model are evaluated to select one of the hosts to perform the request.
Opening claim text (preview).
The invention claimed is: 1 . A system, comprising: a plurality of computing devices, respectively comprising at least one processor and a memory, that implement a machine learning service, wherein the machine learning service is configured to: host a managed network endpoint, wherein the managed network endpoint provides access to a plurality of different machine learning models hosted at one or more of a plurality of computing resources associated with the managed network endpoint, including the machine learning model, via requests to invoke specified ones of the plurality of different machine learning models received from one or more clients of the machine learning service; receive, via the managed network endpoint of the machine learning service, a request to generate an inference using a specified machine learning model of the plurality of different machine learning models associated with the managed network endpoint; evaluate, at a router for the managed network endpoint, respective workloads of different hosts for respective replicas of the specified machine learning model, the different hosts being associated with the managed network endpoint; based on the evaluation, select, by the router, one of the different hosts to perform the request; and forward, by the router, the request to generate the inference using the respective replica of the specified machine learning model to the selected one host. 2 . The system of claim 1 , wherein to select the one of the different hosts to perform the request, the machine learning service is configured to apply a selection strategy specified via an interface of the machine learning service. 3 . The system of claim 1 , wherein the respective workloads of different hosts for respective replicas of the specified machine learning model comprises respective numbers of inflight inference requests obtained from the different hosts. 4 . The system of claim 1 , wherein the managed network endpoint is created in response to one or more requests to create the managed network endpoint and add the plurality of different machine learning models to the managed network endpoint, received via an interface of the machine learning service. 5 . A method, comprising: receiving, via a managed network endpoint of a machine learning service, a request to generate an inference using a specified machine learning model of a plurality of machine learning models associated with the managed network endpoint; evaluating, by the machine learning service, respective workloads of different hosts for respective replicas of the specified machine learning model, the different hosts being associated with the managed network endpoint; based on the evaluating, selecting, by the machine learning service, one of the different hosts to perform the request; and performing, by the selected one of the different hosts, the request to generate the inference using the respective replica of the specified machine learning model. 6 . The method of claim 5 , further comprising obtaining, at least part of the respective workloads, from a model registry for the machine learning service to update a model deployment cache for a router of the machine learning service. 7 . The method of claim 6 , wherein the evaluating the respective model workloads comprises accessing the model deployment cache maintained at the router that includes the respective workloads of the different hosts for the respective replicas of the specified machine learning model. 8 . The method of claim 5 , wherein the selecting the one of the different hosts to perform the request comprises applying a selection strategy specified via an interface of the machine learning service. 9 . The method of claim 5 , wherein the selecting the one of the different hosts to perform the request comprises applying a weighted random selection to account for more than one replica of the specified machine learning model being hosted at the different hosts. 10 . The method of claim 5 , wherein the respective workloads of different hosts for respective replicas of the specified machine learning model comprises respective numbers of inflight inference requests obtained from the different hosts. 11 . The method of claim 5 , wherein selection of the one host to perform the request is further based on a determination that the request is associated with a sticky session. 12 . The method of claim 5 , selecting the one host after a prior attempt to send the request to another one of the different hosts failed. 13 . The method of claim 5 , wherein the managed network endpoint is created in response to one or more requests to create the managed network endpoint and add the plurality of different machine learning models to the managed network endpoint, received via an interface of the machine learning service. 14 . One or more non-transitory, computer-readable storage media, storing program instructions that when executed on or across one or more computing devices cause the one or more computing devices to implement a machine learning service that implements: receiving, via a managed network endpoint of the machine learning service, a request to generate an inference using a specified machine learning model of a plurality of machine learning models associated with the managed network endpoint; evaluating respective workloads of different hosts for respective replicas of the specified machine learning model, the different hosts being associated with the managed network endpoint; based on the evaluating, selecting one of the different hosts to perform the request; and causing performance of the request to generate the inference using the respective replica of the specified machine learning model. 15 . The one or more non-transitory, computer-readable storage media of claim 14 , storing further program instructions that when executed on or across the one or more computing devices, cause the one or more computing devices to further implement obtaining, at least part of the respective workloads, from a model registry for the machine learning service to update a model deployment cache for a router of the machine learning service. 16 . The one or more non-transitory, computer-readable storage media of claim 14 , wherein the selecting the one of the different hosts to perform the request comprises applying a selection strategy specified via an interface of the machine learning service. 17 . The one or more non-transitory, computer-readable storage media of claim 14 , wherein the selecting the one of the different hosts to perform the request comprises applying a weighted random selection to account for more than one replica of the specified machine learning model being hosted at the different hosts. 18 . The one or more non-transitory, computer-readable storage media of claim 14 , wherein the respective workloads of different hosts for respective replicas of the specified machine learning model comprises respective numbers of inflight inference requests obtained from the different hosts. 19 . The one or more non-transitory, computer-readable storage media of claim 14 , wherein selection of the one host to perform the request is further based on a determination that the request is associated with a sticky session. 20 . The one or more non-transitory, computer-readable storage media of claim 14 , wherein the managed network endpoint is created in response to one or more requests to create the managed network endpoint and add the plurality of different mac
based on parameters of servers, e.g. available memory or workload (monitoring of computer activity G06F11/30) · CPC title
Techniques for rebalancing the load in a distributed system · CPC title
Flow based routing · CPC title
involving task migration · CPC title
considering the load · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.