Tree-based network architecture for accelerating machine learning collective operations

US12481608B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12481608-B2
Application numberUS-202418411299-A
CountryUS
Kind codeB2
Filing dateJan 12, 2024
Priority dateJan 12, 2024
Publication dateNov 25, 2025
Grant dateNov 25, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Aspects of the disclosure are directed to a tree-based network architecture for serving and/or training machine learning models. The architecture includes one or more multi-chip packages having a plurality of compute-memory stacks connected via an input/output (I/O) die. The I/O die includes an aggregator to aggregate computations from the compute-memory stacks. The architecture can further include a plurality of the multi-chip packages connected on a server via a server level aggregator and a plurality of the servers connected on a rack via a rack level aggregator for further aggregation of the computations from the compute-memory stacks. The tree-based network architecture allows for fewer hops, resulting in lower latency and savings in bandwidth when serving and/or training machine learning models.

First claim

Opening claim text (preview).

The invention claimed is: 1 . A tree-based network architecture comprising: a server comprising a plurality of multi-chip packages connected to a server level aggregator; the multi-chip packages each comprising a plurality of compute-memory stacks connected to a chip level input/output (I/O) die, each chip level I/O die configured to aggregate computations performed by the compute-memory stacks to generate a chip level aggregated computation and output the chip level aggregated computation to the server level aggregator; and the server level aggregator comprising a server level I/O die configured to aggregate the chip level aggregated computation from each of the plurality of multi-chip packages to generate a server level aggregated computation and output the server level aggregated computation. 2 . The architecture of claim 1 , further comprising a rack comprising the server and a plurality of additional servers connected to a rack level aggregator, the rack level aggregator comprising a rack level I/O die configured to aggregate the server level aggregated computation from the server and each of the plurality of additional servers to generate a rack level aggregated computation and output the rack level aggregated computation. 3 . The architecture of claim 1 , wherein the server level aggregated computation is output to at least one of the plurality of compute-memory stacks. 4 . The architecture of claim 1 , wherein the server level aggregated computation is output to an additional server. 5 . The architecture of claim 1 , wherein a compute-memory stack of the plurality of compute-memory stacks comprises a plurality of memory die stacked on top of a compute die. 6 . The architecture of claim 1 , wherein a multi-chip package of the plurality of multi-chip packages comprises a spare compute-memory stack. 7 . The architecture of claim 1 , wherein the server level aggregator further comprises a plurality of downstream ports and an upstream port. 8 . The architecture of claim 7 , wherein the number of the plurality of downstream ports corresponds to the number of the multi-chip packages. 9 . The architecture of claim 7 , wherein the server level aggregator further comprises a spare downstream port and a spare upstream port. 10 . The architecture of claim 7 , wherein the server level aggregator is configured to perform at least one of an all-reduce, an all-gather, or an all-broadcast operation to aggregate the chip level aggregated computation. 11 . The architecture of claim 1 , wherein the chip level I/O dies are packaged with respective multi-chip packages and the server level I/O die is packaged with the server level aggregator. 12 . The architecture of claim 1 , wherein the computations performed by the compute-memory stacks are for at least one of serving or training a machine learning model. 13 . The architecture of claim 12 , wherein the machine learning model is a large model processing unit. 14 . A method for processing computations in a tree-based network architecture, the method comprising: computing, by each of a plurality of compute-memory stacks in a multi-chip package, a respective computation; aggregating, by a chip level input/output (I/O) die connected to the plurality of compute-memory stacks in the multi-chip package, the respective computations to generate a chip level aggregated computation; aggregating, by a server level I/O die of a server level aggregator in a server, the chip level aggregated computation with additional chip level aggregated computations to generate a server level aggregated computation; and outputting, by the server level aggregator, the server level aggregated computation. 15 . The method of claim 14 , further comprising: aggregating, by a rack level I/O die of a rack level aggregator in a rack, the server level aggregated computation with additional server level aggregated computations to generate a rack level aggregated computation; and outputting, by the rack level aggregator, the rack level aggregated computation. 16 . The method of claim 14 , further comprising outputting the server level aggregated computation to at least one of the plurality of compute-memory stacks. 17 . The method of claim 14 , further comprising outputting the server level aggregated computation to an additional server. 18 . The method of claim 14 , wherein aggregating the respective computations to generate a chip level aggregated computation comprises performing at least one of an all-reduce, an all-gather, or an all-broadcast operation. 19 . A large model processing unit comprising a plurality of tree-based network architectures, each of the tree-based network architectures comprising: a server comprising a plurality of multi-chip packages connected to a server level aggregator; the multi-chip packages each comprising a plurality of compute-memory stacks connected to a chip level input/output (I/O) die, each chip level I/O die configured to aggregate computations performed by the compute-memory stacks to generate a chip level aggregated computation and output the chip level aggregated computation to the server level aggregator; and the server level aggregator comprising a server level I/O die configured to aggregate the chip level aggregated computation from each of the plurality of multi-chip packages to generate a server level aggregated computation and output the server level aggregated computation. 20 . The large model processing unit of claim 19 , wherein each of the tree-based network architectures further comprises a rack comprising the server and a plurality of additional servers connected to a rack level aggregator, the rack level aggregator comprising a rack level I/O die configured to aggregate the server level aggregated computation from the server and each of the plurality of additional servers to generate a rack level aggregated computation and output the rack level aggregated computation.

Assignees

Inventors

Classifications

  • Bus coupling · CPC title

  • Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound · CPC title

  • using electronic means · CPC title

  • G06F13/20Primary

    for access to input/output bus · CPC title

  • G06F15/785Primary

    with decentralized control, e.g. smart memories · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12481608B2 cover?
Aspects of the disclosure are directed to a tree-based network architecture for serving and/or training machine learning models. The architecture includes one or more multi-chip packages having a plurality of compute-memory stacks connected via an input/output (I/O) die. The I/O die includes an aggregator to aggregate computations from the compute-memory stacks. The architecture can further inc…
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification G06F13/20. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Nov 25 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).