Technologies for managing disaggregated resources in a data center
US-2020257566-A1 · Aug 13, 2020 · US
US11315013B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11315013-B2 |
| Application number | US-201815960472-A |
| Country | US |
| Kind code | B2 |
| Filing date | Apr 23, 2018 |
| Priority date | Apr 23, 2018 |
| Publication date | Apr 26, 2022 |
| Grant date | Apr 26, 2022 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Techniques are provided for implementing a parameter server within a networking infrastructure of a computing system to reduce the communication bandwidth and latency for performing communication synchronization operations of the parameter server. For example, a method includes executing a distributed deep learning (DL) model training process to train model parameters of a DL model using a plurality of worker nodes executing on one or more server nodes of a computing system, and executing a parameter server within a networking infrastructure of the computing system to aggregate local model parameters computed by the plurality of worker nodes and to distribute aggregated model parameters to the plurality of worker nodes using the networking infrastructure of the computing system.
Opening claim text (preview).
What is claimed is: 1. A method, comprising: executing a distributed deep learning (DL) model training process to train a DL model using a plurality of server nodes comprising at least a first server node and a second server node, wherein the first server node comprises a first processor, a first set of accelerator devices, and a first network interface component, wherein the second server node comprises a second processor, a second set of accelerator devices, and a second network interface component, wherein executing the DL model training process comprises performing an iterative process, wherein at least one iteration of the DL model training process comprises: distributing, by the first and second processors, a batch of training data to the respective first and second set of accelerator devices, wherein the accelerator devices of the first and second set of accelerator devices each receive a respective portion of the batch of training data; executing a first set of worker processes on the first set of accelerator devices, and a second set of worker processes on the second set of accelerator devices, wherein the worker processes of the first and second set of worker processes compute respective local parameters using the respective portions of the batch of training data; performing, by the worker processes of the first set of worker processes, respective direct memory copy operations to copy the respective local parameters to a first memory associated with the first network interface component; performing, by the worker processes of the second set of worker processes, respective direct memory copy operations to copy the respective local parameters to a second memory associated with the second network interface component; aggregating, by a first parameter server process executing on the first network interface component, the local parameters provided by the first set of worker processes to thereby generate a first set of local aggregated parameters, wherein the first parameter server process comprises a master parameter server process; aggregating, by a second parameter server process executing on the second network interface component, the local parameters provided by the second set of worker processes to thereby generate a second set of local aggregated parameters; performing, by the second parameter server process, a direct memory copy operation to copy the second set of local aggregated parameters to the first memory associated with the first network interface component; aggregating, by the first parameter server process, at least the first and second set of local aggregated parameters to thereby generate a global set of parameters; and performing, by the first parameter server process, a direct memory copy operation to copy the global set of parameters to the first memory associated with the first network interface component. 2. The method of claim 1 , wherein the first and second set of worker processes are managed by respective virtual worker nodes. 3. The method of claim 1 , wherein the first and second set of accelerator devices comprise graphics processing unit devices. 4. The method of claim 1 , wherein the first and second network interface components comprise respective first and second network interface cards of the respective first and second server nodes. 5. The method of claim 4 , wherein the first and second network interface cards comprise virtual network interface cards. 6. The method of claim 4 , wherein the first and second network interface cards comprise respective first and second physical network interface cards. 7. The method of claim 1 , wherein the direct memory copy operations, which are performed by the worker processes of the first and second set of worker processes to copy the respective local parameters to the respective first and second memories associated with the respective first and second network interface components, are implemented using a direct memory access (DMA) protocol. 8. The method of claim 1 , wherein the direct memory copy operations, which are performed by the first and second parameter server processes, are implemented using a remote direct memory access (RDMA) protocol. 9. An article of manufacture comprising a processor-readable storage medium having stored program code of one or more software programs, wherein the program code is executable by one or more processors to implement method steps comprising: executing a distributed deep learning (DL) model training process to train a DL model using a plurality of server nodes comprising at least a first server node and a second server node, wherein the first server node comprises a first processor, a first set of accelerator devices, and a first network interface component, wherein the second server node comprises a second processor, a second set of accelerator devices, and a second network interface component, wherein executing the DL model training process comprises performing an iterative process, wherein at least one iteration of the DL model training process comprises: distributing, by the first and second processors, a batch of training data to the respective first and second set of accelerator devices, wherein the accelerator devices of the first and second set of accelerator devices each receive a respective portion of the batch of training data; executing a first set of worker processes on the first set of accelerator devices, and a second set of worker processes on the second set of accelerator devices, wherein the worker processes of the first and second set of worker processes compute respective local parameters using the respective portions of the batch of training data; performing, by the worker processes of the first set of worker processes, respective direct memory copy operations to copy the respective local parameters to a first memory associated with the first network interface component; performing, by the worker processes of the second set of worker processes, respective direct memory copy operations to copy the respective local parameters to a second memory associated with the second network interface component; aggregating, by a first parameter server process executing on the first network interface component, the local parameters provided by the first set of worker processes to thereby generate a first set of local aggregated parameters, wherein the first parameter server process comprises a master parameter server process; aggregating, by a second parameter server process executing on the second network interface component, the local parameters provided by the second set of worker processes to thereby generate a second set of local aggregated parameters; performing, by the second parameter server process, a direct memory copy operation to copy the second set of local aggregated parameters to the first memory associated with the first network interface component; aggregating, by the first parameter server process, at least the first and second set of local aggregated parameters to thereby generate a global set of parameters; and performing, by the first parameter server process, a direct memory copy operation to copy the global set of parameters to the first memory associated with the first network interface component. 10. The article of manufacture of claim 9 , wherein the first and second set of worker processes are managed by respective virtual worker nodes. 11. The article of manufacture of claim 9 , wherein the first and second set of accelerator devices comprise graphics processing unit devices. 12. The article of manufacture of claim 9 , wherein the first and second network interface components comprise respective first and second ne
Related publications grouped by family.
Answers are generated from the same data shown on this page.