Efficient and more advanced implementation of ring-AllReduce algorithm for distributed parallel deep learning

US11520640B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11520640-B2
Application numberUS-202016777711-A
CountryUS
Kind codeB2
Filing dateJan 30, 2020
Priority dateJan 30, 2020
Publication dateDec 6, 2022
Grant dateDec 6, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The present disclosure provides a method for syncing data of a computing task across a plurality of groups of computing nodes, each group comprising a set of computing nodes A-D, a set of intra-group interconnects that communicatively couple computing node A with computing nodes B and C and computing node D with computing nodes B and C, and a set of inter-group interconnects that communicatively couple a computing node A of a first group of the plurality of groups with a computing node A of a second group neighboring the first group, a computing node B of the first group with a computing node B of the second group, a computing node C of the first group with the computing node C of the second group, and a computing node D of the first group with a computing node D of the second group, the method comprising: syncing across a first dimension of computing nodes using a first set of ring connections, wherein the first set of ring connections are formed using inter-group and intra-group interconnects that communicatively couple the computing nodes along the first dimension; and broadcasting synced data across a second dimension of computing nodes using a second ring connection.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for syncing data of a computing task across a plurality of groups of computing nodes, each group comprising a set of computing nodes A-D, a set of intra-group interconnects that communicatively couple computing node A with computing nodes B and C and computing node D with computing nodes B and C, and a set of inter-group interconnects that communicatively couple a computing node A of a first group of the plurality of groups with a computing node A of a second group neighboring the first group and a computing node A of a third group neighboring the first group, a computing node B of the first group with a computing node B of the second group and a computing node B of the third group, a computing node C of the first group with a computing node C of the second group and a computing node C of the third group, and a computing node D of the first group with a computing node D of the second group and a computing node D of the third group, wherein the second group and the third group are aligned with the first group in different dimensions, the method comprising: syncing data across a first dimension of computing nodes of the first group and the second group using a first set of ring connections, wherein the first set of ring connections are formed using inter-group and intra-group interconnects that communicatively couple the computing nodes of the first group and the second group along the first dimension, and the syncing data across the first dimension comprises transferring, in a unit time, sub-data from a computing node along the first dimension to another computing node in a unit time via a connection on the first set of ring connections; syncing data across a second dimension of computing nodes of the first group and the third group using a second set of ring connections, wherein the second set of ring connections are formed using inter-group and intra-group interconnects that communicatively couple the computing nodes of the first group and the third group along the second dimension, and syncing data across the second dimension comprises transferring, in a unit time, sub-data from a computing node along the second dimension to another computing node in a unit time via a connection on the second set of ring connections; and broadcasting synced data across the first dimension or the second dimension of computing nodes using the first set of ring connections or the second set of ring connections. 2. The method of claim 1 , wherein the data to be synced comprises a plurality of sub-data, and each computing node comprises a different version of each sub-data. 3. The method of claim 2 , wherein: syncing data across a first dimension of computing nodes using a first set of ring connections further comprises: in a clock cycle, receiving a version of sub-data by each computing node in a row, wherein the version of the sub-data is transferred from another computing node in the row via the connection on the first set of ring connections; and continuing data transferring until each computing node along the first dimension receives all versions of a sub-data from all computing nodes in the row. 4. The method of claim 3 , wherein: syncing data across a second dimension of computing nodes using a second set of ring connections further comprises: in a clock cycle, receiving a version of sub-data by each computing node in a column, wherein the version of the sub-data is transferred from another computing node in the column via the connection on the second set of ring connections; and continuing data transferring until each computing node along the second dimension receives all versions of a sub-data from all computing nodes. 5. The method claim 3 , wherein broadcasting synced data across the first dimension or the second dimension of computing nodes using the first set of ring connections or the second set of ring connections further comprises: in a clock cycle, receiving sub-data by each computing node across the first dimension or the second dimension, wherein the sub-data transferred from another computing node across the first dimension or the second dimension via the connection on the first set of ring connections or the connection on the second set of ring connections; and continuing data transferring until all computing nodes along the first dimension or the second dimension receive sub-data from all computing nodes. 6. The method of claim 1 , wherein the set of intra-group interconnects and the set of inter-group interconnects comprise inter-chip interconnects. 7. The method of claim 1 , wherein the computing nodes are artificial intelligence (“AI”) training processors, AI training chips, neural processing units (“NPU”), or graphic processing units (“GPU”). 8. The method of claim 7 , wherein the computing task is an AI computing task involving an allreduce algorithm. 9. A system for syncing data of a computing task across a plurality of groups of computing nodes, each group comprising a set of computing nodes A-D, a set of intra-group interconnects that communicatively couple computing node A with computing nodes B and C and computing node D with computing nodes B and C, and a set of inter-group interconnects that communicatively couple a computing node A of a first group of the plurality of groups with a computing node A of a second group neighboring the first group, and a computing node A of a third group neighboring the first group a computing node B of the first group with a computing node B of the second group and a computing node B of the third group, a computing node C of the first group with a computing node C of the second group and a computing node C of the third group, and a computing node D of the first group with a computing node D of the second group and a computing node D of the third group, wherein the second group and the third group are aligned with the first group in different dimensions, the system comprising: a memory storing a set of instructions; and one or more processors configured to execute the set of instructions to cause the system to: sync data across a first dimension of computing nodes of the first group and the second group using a first set of ring connections, wherein the first set of ring connections are formed using inter-group and intra-group interconnects that communicatively couple the computing nodes of the first group and the second group along the first dimension, and the syncing data across the first dimension comprises transferring, in a unit time, sub-data from a computing node along the first dimension to another computing node in a unit time via a connection on the first set of ring connections; sync data across a second dimension of computing nodes of the first group and the third group using a second set of ring connections, wherein the second set of ring connections are formed using inter-group and intra-group interconnects that communicatively couple the computing nodes of the first group and the third group along the second dimension, and syncing data across the second dimension comprises transferring, in a unit time, sub-data from a computing node along the second dimension to another computing node in a unit time via a connection on the second set of ring connections; and broadcast synced data across the first dimension or the second dimension of computing nodes using the first set of ring connections or the second set of ring connections. 10. The system of claim 9 , wherein the data to be synced comprises a plurality of sub-data, and each computing node comprises a different version of each sub-data. 11. The system of claim 10 , wherein the one or more processors are further configured to execute the set of instr

Assignees

Inventors

Classifications

  • G06F9/52Primary

    Program synchronisation; Mutual exclusion, e.g. by means of semaphores · CPC title

  • G06F9/5038Primary

    considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration (scheduling strategies G06F9/4881 and subgroups) · CPC title

  • using electronic means · CPC title

  • Event management; Broadcasting; Multicasting; Notifications · CPC title

  • Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11520640B2 cover?
The present disclosure provides a method for syncing data of a computing task across a plurality of groups of computing nodes, each group comprising a set of computing nodes A-D, a set of intra-group interconnects that communicatively couple computing node A with computing nodes B and C and computing node D with computing nodes B and C, and a set of inter-group interconnects that communicativel…
Who is the assignee on this patent?
Alibaba Group Holding Ltd
What technology area does this patent fall under?
Primary CPC classification G06F9/52. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Dec 06 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).