Method of accelerating training process of neural network and neural network device thereof
US-2020065659-A1 · Feb 27, 2020 · US
US11625356B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11625356-B2 |
| Application number | US-202117211232-A |
| Country | US |
| Kind code | B2 |
| Filing date | Mar 24, 2021 |
| Priority date | Mar 26, 2020 |
| Publication date | Apr 11, 2023 |
| Grant date | Apr 11, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A computer comprising a plurality of interconnected processing nodes arranged in a configuration in which multiple layers of interconnected nodes are arranged along an axis, each layer comprising at least four processing nodes connected in a non-axial ring by at least respective intralayer link between each pair of neighbouring processing nodes, wherein each of the at least four processing nodes in each layer is connected to a respective corresponding node in one or more adjacent layer by a respective interlayer link, the computer being programmed to provide in the configuration two embedded one dimensional paths and to transmit data around each of the two embedded one dimensional paths, each embedded one dimensional path using all processing nodes of the computer in such a manner that the two embedded one dimensional paths operate simultaneously without sharing links.
Opening claim text (preview).
The invention claimed is: 1. A computer comprising: a plurality of interconnected processing nodes arranged in a configuration in which multiple layers of interconnected nodes are arranged along an axis, each layer comprising at least four processing nodes connected in a non-axial ring by at least one respective intralayer links between each pair of neighbouring processing nodes, wherein each of the at least four processing nodes in each layer is connected to a respective corresponding node in one or more adjacent layer by a respective interlayer link, the computer being programmed to provide in the configuration two embedded one-dimensional paths and to transmit data around each of the two embedded one-dimensional paths, each embedded one-dimensional path using all processing nodes of the computer in such a manner that the two embedded one-dimensional paths operate simultaneously without sharing links, wherein the multiple layers comprise first and second endmost layers and at least one intermediate layer between the first and second endmost layers, wherein each processing node in the first endmost layer is connected to a non-neighbouring node in the first endmost layer in addition to its neighbouring node, and each processing node in the second endmost layer is connected to a non-neighbouring node in the second endmost layer in addition to its neighbouring node, and wherein at least one of the interlayer and intralayer links of processing nodes in the first endmost layer comprise switching circuitry operable to disconnect the processing node from its corresponding node in the second endmost layer and connect it to a non-neighbouring node in the first endmost layer. 2. The computer of claim 1 , wherein the configuration is a toroid configuration in which respective connected corresponding nodes of the multiple layers form at least four axial rings. 3. The computer of claim 1 wherein at least one of the interlayer and intralayer links comprise switching circuitry operable to connect one of the processing nodes selectively to one of multiple other processing nodes. 4. The computer of claim 1 , wherein each processing node is configured to output data on its respective intralayer and interlayer links with the same bandwidth utilisation on each of the intralayer and interlayer links of the processing node. 5. The computer of claim 1 , wherein each layer of the multiple layers has exactly four nodes. 6. The computer of claim 1 which comprises a number of layers arranged along the axis which is greater than the number of processing nodes in each layer. 7. The computer of claim 1 which comprises a number of layers arranged along the axis which is the same as the number of nodes in each layer. 8. The computer of claim 1 wherein the intralayer and interlayer links comprise fixed connections between the processing nodes. 9. The computer of claim 1 wherein at least one of the interlayer links of processing nodes in the first endmost layer comprise switching circuitry operable to disconnect the processing node from its neighbouring node in the first endmost layer and connect it to a corresponding node in the second endmost layer. 10. The computer of claim 1 wherein each embedded one-dimensional path comprises alternating sequences of one of the interlayer links and one of the intralayer links. 11. The computer of claim 1 in which each one-dimensional embedded path comprises a sequence of processing nodes which are visited in a direction in each layer which is the same in all layers within each one-dimensional path. 12. The computer of claim 1 in which each one-dimensional embedded path comprises a sequence of processing nodes which are visited in a direction in each layer which is different in successive layers within each one-dimensional path. 13. The computer of claim 1 comprising six layers, each having four processing nodes connected in a non-axial ring. 14. The computer of claim 1 which comprises eight layers, each having eight processing nodes connected in a non-axial ring. 15. The computer of claim 1 which comprises eight layers each having four processing nodes connected in a ring. 16. The computer of claim 1 which comprises four layers, each having four processing nodes connected in a ring. 17. A computer comprising: a plurality of interconnected processing nodes arranged in a configuration in which multiple layers of interconnected nodes are arranged along an axis, each layer comprising at least four processing nodes connected in a non-axial ring by at least one respective intralayer links between each pair of neighbouring processing nodes, wherein each of the at least four processing nodes in each layer is connected to a respective corresponding node in one or more adjacent layer by a respective interlayer link, the computer being programmed to provide in the configuration two embedded one-dimensional paths and to transmit data around each of the two embedded one-dimensional paths, each embedded one-dimensional path using all processing nodes of the computer in such a manner that the two embedded one-dimensional paths operate simultaneously without sharing links, wherein each processing node is programmed to divide a respective partial vector of that processing node into fragments and to transmit the data in the form of successive fragments around each embedded one-dimensional path. 18. The computer of claim 17 which is programmed to operate each path as a set of logical rings, wherein the successive fragments are transmitted around each logical ring in simultaneous transmission steps. 19. The computer of claim 17 , wherein each processing node is configured to output a respective fragment on each of two links simultaneously, wherein the fragment output on each of the links has approximately the same size. 20. The computer of claim 17 , wherein each processing node is configured to reduce multiple incoming fragments with multiple respective corresponding locally stored fragments. 21. The computer of claim 20 , wherein each processing node is configured to transmit fully reduced fragments on each of its intralayer and interlayer links simultaneously in an Allgather phase of an Allreduce collective. 22. The computer of claim 1 , programmed to transmit the data in data transmission steps such that each link of a processing node is utilised with the same bandwidth as other links of that processing node in each data transmission step. 23. A method of generating a set of programs to be executed in parallel on a computer comprising a plurality of processing nodes connected in a configuration with multiple layers arranged along an axis, each layer comprising at least four processing nodes connected in a non-axial ring by a respective intralayer link between each pair of neighbouring processing nodes, wherein processing nodes in each layer are connected to respective corresponding nodes in each adjacent layer by an interlayer link, the method comprising: generating a first data transmission instruction for a first program to define a first data transmission stage in which data is transmitted from a first node executing the first program, wherein the first data transmission instruction comprises a first link identifier which defines a first outgoing link on which data is to be transmitted from the first node in the first data transmission stage; generating a second data transmission instruction for a second program to define a second data transmission stage in whi
One dimensional, e.g. linear array, ring · CPC title
Two dimensional, e.g. mesh, torus · CPC title
Parallel communications techniques, e.g. gather, scatter, reduce, roadcast, multicast, all to all · CPC title
Three dimensional, e.g. hypercubes · CPC title
Electrical coupling · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.