What technology area does this patent fall under?

Primary CPC classification H04L45/38. Mapped technology areas include Electricity.

When was this patent published?

Publication date Tue Jan 13 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Load aware routing for heterogeneous machine learning models access via a common network endpoint

US12526230B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12526230-B2
Application number	US-202318518902-A
Country	US
Kind code	B2
Filing date	Nov 24, 2023
Priority date	Nov 24, 2023
Publication date	Jan 13, 2026
Grant date	Jan 13, 2026

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Load aware routing is performed for requests to managed network endpoints for heterogeneous machine learning models. A request to generate an inference is received via a managed network endpoint that invokes a specified machine learning model. Workloads of the different hosts for respective replicas of the machine learning model are evaluated to select one of the hosts to perform the request.

First claim

Opening claim text (preview).

The invention claimed is: 1 . A system, comprising: a plurality of computing devices, respectively comprising at least one processor and a memory, that implement a machine learning service, wherein the machine learning service is configured to: host a managed network endpoint, wherein the managed network endpoint provides access to a plurality of different machine learning models hosted at one or more of a plurality of computing resources associated with the managed network endpoint, including the machine learning model, via requests to invoke specified ones of the plurality of different machine learning models received from one or more clients of the machine learning service; receive, via the managed network endpoint of the machine learning service, a request to generate an inference using a specified machine learning model of the plurality of different machine learning models associated with the managed network endpoint; evaluate, at a router for the managed network endpoint, respective workloads of different hosts for respective replicas of the specified machine learning model, the different hosts being associated with the managed network endpoint; based on the evaluation, select, by the router, one of the different hosts to perform the request; and forward, by the router, the request to generate the inference using the respective replica of the specified machine learning model to the selected one host. 2 . The system of claim 1 , wherein to select the one of the different hosts to perform the request, the machine learning service is configured to apply a selection strategy specified via an interface of the machine learning service. 3 . The system of claim 1 , wherein the respective workloads of different hosts for respective replicas of the specified machine learning model comprises respective numbers of inflight inference requests obtained from the different hosts. 4 . The system of claim 1 , wherein the managed network endpoint is created in response to one or more requests to create the managed network endpoint and add the plurality of different machine learning models to the managed network endpoint, received via an interface of the machine learning service. 5 . A method, comprising: receiving, via a managed network endpoint of a machine learning service, a request to generate an inference using a specified machine learning model of a plurality of machine learning models associated with the managed network endpoint; evaluating, by the machine learning service, respective workloads of different hosts for respective replicas of the specified machine learning model, the different hosts being associated with the managed network endpoint; based on the evaluating, selecting, by the machine learning service, one of the different hosts to perform the request; and performing, by the selected one of the different hosts, the request to generate the inference using the respective replica of the specified machine learning model. 6 . The method of claim 5 , further comprising obtaining, at least part of the respective workloads, from a model registry for the machine learning service to update a model deployment cache for a router of the machine learning service. 7 . The method of claim 6 , wherein the evaluating the respective model workloads comprises accessing the model deployment cache maintained at the router that includes the respective workloads of the different hosts for the respective replicas of the specified machine learning model. 8 . The method of claim 5 , wherein the selecting the one of the different hosts to perform the request comprises applying a selection strategy specified via an interface of the machine learning service. 9 . The method of claim 5 , wherein the selecting the one of the different hosts to perform the request comprises applying a weighted random selection to account for more than one replica of the specified machine learning model being hosted at the different hosts. 10 . The method of claim 5 , wherein the respective workloads of different hosts for respective replicas of the specified machine learning model comprises respective numbers of inflight inference requests obtained from the different hosts. 11 . The method of claim 5 , wherein selection of the one host to perform the request is further based on a determination that the request is associated with a sticky session. 12 . The method of claim 5 , selecting the one host after a prior attempt to send the request to another one of the different hosts failed. 13 . The method of claim 5 , wherein the managed network endpoint is created in response to one or more requests to create the managed network endpoint and add the plurality of different machine learning models to the managed network endpoint, received via an interface of the machine learning service. 14 . One or more non-transitory, computer-readable storage media, storing program instructions that when executed on or across one or more computing devices cause the one or more computing devices to implement a machine learning service that implements: receiving, via a managed network endpoint of the machine learning service, a request to generate an inference using a specified machine learning model of a plurality of machine learning models associated with the managed network endpoint; evaluating respective workloads of different hosts for respective replicas of the specified machine learning model, the different hosts being associated with the managed network endpoint; based on the evaluating, selecting one of the different hosts to perform the request; and causing performance of the request to generate the inference using the respective replica of the specified machine learning model. 15 . The one or more non-transitory, computer-readable storage media of claim 14 , storing further program instructions that when executed on or across the one or more computing devices, cause the one or more computing devices to further implement obtaining, at least part of the respective workloads, from a model registry for the machine learning service to update a model deployment cache for a router of the machine learning service. 16 . The one or more non-transitory, computer-readable storage media of claim 14 , wherein the selecting the one of the different hosts to perform the request comprises applying a selection strategy specified via an interface of the machine learning service. 17 . The one or more non-transitory, computer-readable storage media of claim 14 , wherein the selecting the one of the different hosts to perform the request comprises applying a weighted random selection to account for more than one replica of the specified machine learning model being hosted at the different hosts. 18 . The one or more non-transitory, computer-readable storage media of claim 14 , wherein the respective workloads of different hosts for respective replicas of the specified machine learning model comprises respective numbers of inflight inference requests obtained from the different hosts. 19 . The one or more non-transitory, computer-readable storage media of claim 14 , wherein selection of the one host to perform the request is further based on a determination that the request is associated with a sticky session. 20 . The one or more non-transitory, computer-readable storage media of claim 14 , wherein the managed network endpoint is created in response to one or more requests to create the managed network endpoint and add the plurality of different mac

Assignees

Amazon Tech Inc

Inventors

Classifications

H04L67/1008
based on parameters of servers, e.g. available memory or workload (monitoring of computer activity G06F11/30) · CPC title
G06F9/5083
Techniques for rebalancing the load in a distributed system · CPC title
H04L45/38Primary
Flow based routing · CPC title
G06F9/5088
involving task migration · CPC title
G06F9/505
considering the load · CPC title

Patent family

Related publications grouped by family.

View patent family 95823362

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12526230B2 cover?: Load aware routing is performed for requests to managed network endpoints for heterogeneous machine learning models. A request to generate an inference is received via a managed network endpoint that invokes a specified machine learning model. Workloads of the different hosts for respective replicas of the machine learning model are evaluated to select one of the hosts to perform the request.
Who is the assignee on this patent?: Amazon Tech Inc
What technology area does this patent fall under?: Primary CPC classification H04L45/38. Mapped technology areas include Electricity.
When was this patent published?: Publication date Tue Jan 13 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).