3d semiconductor device and structure
US-2021159110-A1 · May 27, 2021 · US
US12481608B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12481608-B2 |
| Application number | US-202418411299-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jan 12, 2024 |
| Priority date | Jan 12, 2024 |
| Publication date | Nov 25, 2025 |
| Grant date | Nov 25, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Aspects of the disclosure are directed to a tree-based network architecture for serving and/or training machine learning models. The architecture includes one or more multi-chip packages having a plurality of compute-memory stacks connected via an input/output (I/O) die. The I/O die includes an aggregator to aggregate computations from the compute-memory stacks. The architecture can further include a plurality of the multi-chip packages connected on a server via a server level aggregator and a plurality of the servers connected on a rack via a rack level aggregator for further aggregation of the computations from the compute-memory stacks. The tree-based network architecture allows for fewer hops, resulting in lower latency and savings in bandwidth when serving and/or training machine learning models.
Opening claim text (preview).
The invention claimed is: 1 . A tree-based network architecture comprising: a server comprising a plurality of multi-chip packages connected to a server level aggregator; the multi-chip packages each comprising a plurality of compute-memory stacks connected to a chip level input/output (I/O) die, each chip level I/O die configured to aggregate computations performed by the compute-memory stacks to generate a chip level aggregated computation and output the chip level aggregated computation to the server level aggregator; and the server level aggregator comprising a server level I/O die configured to aggregate the chip level aggregated computation from each of the plurality of multi-chip packages to generate a server level aggregated computation and output the server level aggregated computation. 2 . The architecture of claim 1 , further comprising a rack comprising the server and a plurality of additional servers connected to a rack level aggregator, the rack level aggregator comprising a rack level I/O die configured to aggregate the server level aggregated computation from the server and each of the plurality of additional servers to generate a rack level aggregated computation and output the rack level aggregated computation. 3 . The architecture of claim 1 , wherein the server level aggregated computation is output to at least one of the plurality of compute-memory stacks. 4 . The architecture of claim 1 , wherein the server level aggregated computation is output to an additional server. 5 . The architecture of claim 1 , wherein a compute-memory stack of the plurality of compute-memory stacks comprises a plurality of memory die stacked on top of a compute die. 6 . The architecture of claim 1 , wherein a multi-chip package of the plurality of multi-chip packages comprises a spare compute-memory stack. 7 . The architecture of claim 1 , wherein the server level aggregator further comprises a plurality of downstream ports and an upstream port. 8 . The architecture of claim 7 , wherein the number of the plurality of downstream ports corresponds to the number of the multi-chip packages. 9 . The architecture of claim 7 , wherein the server level aggregator further comprises a spare downstream port and a spare upstream port. 10 . The architecture of claim 7 , wherein the server level aggregator is configured to perform at least one of an all-reduce, an all-gather, or an all-broadcast operation to aggregate the chip level aggregated computation. 11 . The architecture of claim 1 , wherein the chip level I/O dies are packaged with respective multi-chip packages and the server level I/O die is packaged with the server level aggregator. 12 . The architecture of claim 1 , wherein the computations performed by the compute-memory stacks are for at least one of serving or training a machine learning model. 13 . The architecture of claim 12 , wherein the machine learning model is a large model processing unit. 14 . A method for processing computations in a tree-based network architecture, the method comprising: computing, by each of a plurality of compute-memory stacks in a multi-chip package, a respective computation; aggregating, by a chip level input/output (I/O) die connected to the plurality of compute-memory stacks in the multi-chip package, the respective computations to generate a chip level aggregated computation; aggregating, by a server level I/O die of a server level aggregator in a server, the chip level aggregated computation with additional chip level aggregated computations to generate a server level aggregated computation; and outputting, by the server level aggregator, the server level aggregated computation. 15 . The method of claim 14 , further comprising: aggregating, by a rack level I/O die of a rack level aggregator in a rack, the server level aggregated computation with additional server level aggregated computations to generate a rack level aggregated computation; and outputting, by the rack level aggregator, the rack level aggregated computation. 16 . The method of claim 14 , further comprising outputting the server level aggregated computation to at least one of the plurality of compute-memory stacks. 17 . The method of claim 14 , further comprising outputting the server level aggregated computation to an additional server. 18 . The method of claim 14 , wherein aggregating the respective computations to generate a chip level aggregated computation comprises performing at least one of an all-reduce, an all-gather, or an all-broadcast operation. 19 . A large model processing unit comprising a plurality of tree-based network architectures, each of the tree-based network architectures comprising: a server comprising a plurality of multi-chip packages connected to a server level aggregator; the multi-chip packages each comprising a plurality of compute-memory stacks connected to a chip level input/output (I/O) die, each chip level I/O die configured to aggregate computations performed by the compute-memory stacks to generate a chip level aggregated computation and output the chip level aggregated computation to the server level aggregator; and the server level aggregator comprising a server level I/O die configured to aggregate the chip level aggregated computation from each of the plurality of multi-chip packages to generate a server level aggregated computation and output the server level aggregated computation. 20 . The large model processing unit of claim 19 , wherein each of the tree-based network architectures further comprises a rack comprising the server and a plurality of additional servers connected to a rack level aggregator, the rack level aggregator comprising a rack level I/O die configured to aggregate the server level aggregated computation from the server and each of the plurality of additional servers to generate a rack level aggregated computation and output the rack level aggregated computation.
Bus coupling · CPC title
Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound · CPC title
using electronic means · CPC title
for access to input/output bus · CPC title
with decentralized control, e.g. smart memories · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.