Collective communication operation
US-2019045003-A1 · Feb 7, 2019 · US
US11614946B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11614946-B2 |
| Application number | US-202016831564-A |
| Country | US |
| Kind code | B2 |
| Filing date | Mar 26, 2020 |
| Priority date | Mar 27, 2019 |
| Publication date | Mar 28, 2023 |
| Grant date | Mar 28, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A computer comprising a plurality of processing nodes is provided. Each processing node has at least one processor configured to process input data to generate an array of data items. The processing nodes are arranged in cliques in which each processing node of a clique is connected to each other processing node in the clique by first and second clique links. The cliques are inter-connected in rings such that each processing node is a member of a single clique and a single ring. The processing nodes of all cliques are configured to exchange in each exchange step of a machine learning collective via the respective first and second clique links at least two data items with the other processing node(s) in its clique, and all processing nodes are configured to reduce each received data item with the data item in the corresponding position in the array on that processing node.
Opening claim text (preview).
The invention claimed is: 1. A computer comprising a plurality of processing nodes, each processing nodes having at least one processor configured to process input data to generate output data in the form of an array of data items; the plurality of processing nodes arranged in cliques in which each processing node of a clique is connected to each other processing node in the clique by first and second clique links, the cliques being inter-connected in rings such that each processing node is a member of a single clique and a single ring, the processing nodes being configured to exchange data items in respective exchange steps of a machine learning collective, wherein the processing nodes of all cliques are configured to exchange in each exchange step via the respective first and second clique links at least two data items with the other processing node(s) in its clique, and all processing nodes are configured to reduce each received data item with the data item in the corresponding position in the array on that processing node, wherein the machine learning collective is an Allreduce collective and each processing node is configured to exchange data items in exchange steps of an Allgather phase, following a reduce-scatter phase of the Allreduce collective, wherein in each step of the Allgather phase reduced data items are exchanged between processing nodes in a clique, and wherein the processing nodes are each configured to transmit data items in a forwards direction to its adjacent processing node in the ring in at least some of the exchange steps in the reduce-scatter phase. 2. The computer according to claim 1 , wherein each processing node comprises memory configured to store an array of data items ready to be exchanged in the reduce-scatter phase, wherein each data item is respectively positioned in the array with corresponding data items being respectively positioned at corresponding locations in the arrays of other processing nodes. 3. The computer according to claim 1 , wherein the processing nodes are configured to transmit data items to their forwards adjacent processing node in the ring for all exchange steps of the reduce-scatter phase apart from a first step, in which no data items are transmitted between processing nodes connected in a ring. 4. The computer according to claim 1 , wherein the array at each processing node comprises two sub arrays and processing nodes are inter-connected by bi-directional links, wherein in each exchange step of the reduce-scatter phase, all processing nodes are configured to exchange with the other processing node(s) of their clique, two data items from one sub array and two further data items from the other sub array wherein the two data items and the further two data items are exchanged over the same bi-directional link in opposite directions. 5. The computer according to claim 4 , wherein the processing nodes are each configured to transmit data items in a forwards direction to its adjacent processing node in the ring in at least some of the exchange steps in the reduce-scatter phase and wherein in at least some exchange steps of the reduce-scatter phase each processing node is configured to transmit data items to its adjacent backwards processing node in the ring, wherein the transmission in each of the forwards and backwards direction from each processing node is carried out on the same bi-directional link. 6. The computer according to claim 2 , wherein each array represents at least part of a vector of partial deltas, each partial delta representing an adjustment to a value stored at each processing node. 7. The computer according to claim 6 , wherein each processing node is configured to generate the vector of partial deltas in a compute step. 8. The computer according to claim 7 , wherein each processing node is configured to divide the vector into two sub arrays for separate exchange and reduction in the reduce-scatter phase. 9. The computer according to claim 7 , wherein each processing node is configured to generate the vector of partial deltas by carrying out a compute function on a set of values and a batch of incoming deltas, the partial deltas being the output of the compute function. 10. The computer according to claim 9 , which is configured to implement a machine learning model wherein the incoming batch data is training data, and the values are weights of the machine learning model. 11. A method of operating a computer comprising a plurality of processing nodes, each processing node having at least one processor configured to process input data to generate output data in the form of an array of data items, the plurality of processing nodes arranged in cliques in which each processing node of a clique is connected to each other processing node in the clique by first and second clique links, the cliques being interconnected in rings such that each processing node is a member of a single clique and a single ring, the method comprising exchanging data item in respect of exchange steps of a first phase of a machine learning collective, wherein in each exchange step the processing nodes of all cliques exchange via the respective first and second clique links at least two data items with the other processing nodes in its clique, and all processing nodes reduce each received data item with the data item in the corresponding position in the array on that processing node, wherein the machine learning collective is an Allreduce collective and each processing node exchanges data items in exchange steps of an Allgather phase, following a reduce-scatter phase of the Allreduce collective, wherein in each step of the Allgather phase reduced data items are exchanged between processing nodes in a clique, and wherein each processing node transmits data items in a forwards direction to its adjacent processing node in the ring in at least some of the exchange steps in the reduce scatter phase. 12. The method according to claim 11 , wherein each processing node comprises memory configured to store an array of data items ready to be exchanged in the reduce-scatter phase, wherein each data item is respectively positioned in the array with corresponding data items being respectively positioned at corresponding locations in the arrays of other processing nodes. 13. The method according to claim 11 , wherein each processing node transmits data items to their forwards adjacent processing node in the ring for all exchange steps of the reduce-scatter phase apart from a first step, in which no data items are transmitted between processing nodes connected in a ring. 14. The method according to claim 11 , wherein the array at each processing node comprises two sub arrays and processing nodes are inter-connected by bi-directional links, wherein in each exchange step of the reduce-scatter phase, all processing nodes exchange with the other processing node(s) of their clique, two data items from one sub array and two further data items from the other sub array wherein the two data items and the further two data items are exchanged over the same bi-directional link in opposite directions. 15. The method according to claim 14 , wherein each processing node transmits data items in a forwards direction to its adjacent processing node in the ring in at least some of the exchange steps in the reduce-scatter phase and wherein in at least some exchange steps of the reduce-scatter phase each processing node transmits data items to its adjacent backwards processing node in the ring, wherein the transmission in each of the forwards and backwards direction from each processing node is carried out on the same bi-directional
Distributed learning, e.g. federated learning · CPC title
Supervised learning · CPC title
Convolutional networks [CNN, ConvNet] · CPC title
Transfer mode dependent, e.g. ATM · CPC title
Learning methods · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.