What technology area does this patent fall under?

Primary CPC classification G06F13/20. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Nov 25 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Tree-based network architecture for accelerating machine learning collective operations

US12481608B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12481608-B2
Application number	US-202418411299-A
Country	US
Kind code	B2
Filing date	Jan 12, 2024
Priority date	Jan 12, 2024
Publication date	Nov 25, 2025
Grant date	Nov 25, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Aspects of the disclosure are directed to a tree-based network architecture for serving and/or training machine learning models. The architecture includes one or more multi-chip packages having a plurality of compute-memory stacks connected via an input/output (I/O) die. The I/O die includes an aggregator to aggregate computations from the compute-memory stacks. The architecture can further include a plurality of the multi-chip packages connected on a server via a server level aggregator and a plurality of the servers connected on a rack via a rack level aggregator for further aggregation of the computations from the compute-memory stacks. The tree-based network architecture allows for fewer hops, resulting in lower latency and savings in bandwidth when serving and/or training machine learning models.

First claim

Opening claim text (preview).

The invention claimed is: 1 . A tree-based network architecture comprising: a server comprising a plurality of multi-chip packages connected to a server level aggregator; the multi-chip packages each comprising a plurality of compute-memory stacks connected to a chip level input/output (I/O) die, each chip level I/O die configured to aggregate computations performed by the compute-memory stacks to generate a chip level aggregated computation and output the chip level aggregated computation to the server level aggregator; and the server level aggregator comprising a server level I/O die configured to aggregate the chip level aggregated computation from each of the plurality of multi-chip packages to generate a server level aggregated computation and output the server level aggregated computation. 2 . The architecture of claim 1 , further comprising a rack comprising the server and a plurality of additional servers connected to a rack level aggregator, the rack level aggregator comprising a rack level I/O die configured to aggregate the server level aggregated computation from the server and each of the plurality of additional servers to generate a rack level aggregated computation and output the rack level aggregated computation. 3 . The architecture of claim 1 , wherein the server level aggregated computation is output to at least one of the plurality of compute-memory stacks. 4 . The architecture of claim 1 , wherein the server level aggregated computation is output to an additional server. 5 . The architecture of claim 1 , wherein a compute-memory stack of the plurality of compute-memory stacks comprises a plurality of memory die stacked on top of a compute die. 6 . The architecture of claim 1 , wherein a multi-chip package of the plurality of multi-chip packages comprises a spare compute-memory stack. 7 . The architecture of claim 1 , wherein the server level aggregator further comprises a plurality of downstream ports and an upstream port. 8 . The architecture of claim 7 , wherein the number of the plurality of downstream ports corresponds to the number of the multi-chip packages. 9 . The architecture of claim 7 , wherein the server level aggregator further comprises a spare downstream port and a spare upstream port. 10 . The architecture of claim 7 , wherein the server level aggregator is configured to perform at least one of an all-reduce, an all-gather, or an all-broadcast operation to aggregate the chip level aggregated computation. 11 . The architecture of claim 1 , wherein the chip level I/O dies are packaged with respective multi-chip packages and the server level I/O die is packaged with the server level aggregator. 12 . The architecture of claim 1 , wherein the computations performed by the compute-memory stacks are for at least one of serving or training a machine learning model. 13 . The architecture of claim 12 , wherein the machine learning model is a large model processing unit. 14 . A method for processing computations in a tree-based network architecture, the method comprising: computing, by each of a plurality of compute-memory stacks in a multi-chip package, a respective computation; aggregating, by a chip level input/output (I/O) die connected to the plurality of compute-memory stacks in the multi-chip package, the respective computations to generate a chip level aggregated computation; aggregating, by a server level I/O die of a server level aggregator in a server, the chip level aggregated computation with additional chip level aggregated computations to generate a server level aggregated computation; and outputting, by the server level aggregator, the server level aggregated computation. 15 . The method of claim 14 , further comprising: aggregating, by a rack level I/O die of a rack level aggregator in a rack, the server level aggregated computation with additional server level aggregated computations to generate a rack level aggregated computation; and outputting, by the rack level aggregator, the rack level aggregated computation. 16 . The method of claim 14 , further comprising outputting the server level aggregated computation to at least one of the plurality of compute-memory stacks. 17 . The method of claim 14 , further comprising outputting the server level aggregated computation to an additional server. 18 . The method of claim 14 , wherein aggregating the respective computations to generate a chip level aggregated computation comprises performing at least one of an all-reduce, an all-gather, or an all-broadcast operation. 19 . A large model processing unit comprising a plurality of tree-based network architectures, each of the tree-based network architectures comprising: a server comprising a plurality of multi-chip packages connected to a server level aggregator; the multi-chip packages each comprising a plurality of compute-memory stacks connected to a chip level input/output (I/O) die, each chip level I/O die configured to aggregate computations performed by the compute-memory stacks to generate a chip level aggregated computation and output the chip level aggregated computation to the server level aggregator; and the server level aggregator comprising a server level I/O die configured to aggregate the chip level aggregated computation from each of the plurality of multi-chip packages to generate a server level aggregated computation and output the server level aggregated computation. 20 . The large model processing unit of claim 19 , wherein each of the tree-based network architectures further comprises a rack comprising the server and a plurality of additional servers connected to a rack level aggregator, the rack level aggregator comprising a rack level I/O die configured to aggregate the server level aggregated computation from the server and each of the plurality of additional servers to generate a rack level aggregated computation and output the rack level aggregated computation.

Assignees

Google Llc

Inventors

Classifications

G06F2213/40
Bus coupling · CPC title
G06N5/01
Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound · CPC title
G06N3/063
using electronic means · CPC title
G06F13/20Primary
for access to input/output bus · CPC title
G06F15/785Primary
with decentralized control, e.g. smart memories · CPC title

Patent family

Related publications grouped by family.

View patent family 93117414

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12481608B2 cover?: Aspects of the disclosure are directed to a tree-based network architecture for serving and/or training machine learning models. The architecture includes one or more multi-chip packages having a plurality of compute-memory stacks connected via an input/output (I/O) die. The I/O die includes an aggregator to aggregate computations from the compute-memory stacks. The architecture can further inc…
Who is the assignee on this patent?: Google Llc
What technology area does this patent fall under?: Primary CPC classification G06F13/20. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Nov 25 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).