Data parallel processing method and apparatus based on multiple graphic processing units

US10282809B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10282809-B2
Application numberUS-201615210278-A
CountryUS
Kind codeB2
Filing dateJul 14, 2016
Priority dateJun 20, 2014
Publication dateMay 7, 2019
Grant dateMay 7, 2019

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A parallel data processing method based on multiple graphic processing units (GPUs) is provided, including: creating, in a central processing unit (CPU), a plurality of worker threads for controlling a plurality of worker groups respectively, the worker groups including one or more GPUs; binding each worker thread to a corresponding GPU; loading a plurality of batches of training data from a nonvolatile memory to GPU video memories in the plurality of worker groups; and controlling the plurality of GPUs to perform data processing in parallel through the worker threads. The method can enhance efficiency of multi-GPU parallel data processing. In addition, a parallel data processing apparatus is further provided.

First claim

Opening claim text (preview).

What is claimed is: 1. A parallel data processing method based on multiple graphic processing units (GPUs), comprising: creating, in a central processing unit (CPU), a plurality of worker threads for controlling a plurality of worker groups respectively, the worker groups comprising one or more GPUs; binding each worker thread to a corresponding GPU; loading a plurality of batches of training data from a nonvolatile memory to GPU video memories in the plurality of worker groups; and controlling the plurality of GPUs to perform data processing in parallel through the worker threads, including: performing, by the plurality of GPUs, batch trainings on the plurality of batches of training data in parallel; obtaining, by the plurality of GPUs, first parameters included in training results of the batch trainings, wherein each GPU obtains one of the first parameters after training one batch of training data; exchanging, among the plurality of GPUs, the first parameters, wherein each GPU receives first parameters obtained by remaining GPUs of the plurality of GPUs; and updating, by the plurality of GPUs, model parameters based on the first parameters, wherein the model parameters are used by a GPU in at least one of: training a next batch of training data or updating a second parameter; wherein exchanging, among the plurality of GPUs, the first parameters comprises: dividing a first matrix and a second matrix into partitions spatially, a number of the partitions depending on a number of the worker groups, wherein the first matrix stores the first parameters and the second matrix stores the model parameters; in a cycle of a parameter exchanging process, data of the partitions is pushed from an upstream worker group to a downstream worker group by replicating the data of the partitions from the upstream worker group and combining the replicated data locally; and performing the cycle of the parameter exchanging process by a preset number of times to complete exchanging the first parameters, wherein the preset number equals the number of the worker groups minus one. 2. The method according to claim 1 , further comprising: creating one I/O thread, and loading the plurality of batches of training data into a random access memory (RAM) through the I/O thread; and pre-processing the training data on the CPU through a thread pool, wherein the I/O thread, threads in the thread pool, the worker threads and data processing in the CPU are performed in parallel. 3. The method according to claim 1 , further comprising: dividing a storage region in each GPU where the model parameters and gradients are stored into N partitions according to the number of the GPUs 2N, wherein the gradients are the first parameters, wherein N is an integer; presetting sequence numbers of the 2N GPUs to be 0, 1, 2 . . . 2N−1 respectively; within a cycle where the sequence number is k (k is an integer and 1≤k≤2N−1), replicating a preset partition in the N partitions from a GPU whose sequence number is i to a GPU whose sequence number is j, and merging the gradients, wherein i is an integer and i=(2m+k+1)% N, j is an integer and j=(2m+k+2)% N, m is an integer and 0≤m≤N−1; and for partition owners in the 2N GPUs, updating the model parameters according to gradient merging results in the corresponding partitions, wherein the partition owners are GPUs having gradient merging results in all other GPUs for a preset partition. 4. The method according to claim 3 , further comprising: within a cycle where the sequence number is k, replicating a preset partition in the N partitions from a GPU whose sequence number is a to a GPU whose sequence number is b, wherein a is an integer and a=(2m+k) % N, and b is an integer and b=(2m+k+1)% N. 5. The method according to claim 3 , further comprising: for the partition owners, computing an adaptive learning rate learning_rate i of a parameter in the position i according to the following adaptive learning rate updating formula: ⁢ helper_sum i ′ = ∑ j = 0 GROUP_NUM ⁢ ⁢ helper_sum ⁢ _part i j ′ learning_rate i = aggregating_rate * adagrad_rho adagrad_rho + helper_sum i ′ wherein GROUP_NUM denotes the number of worker groups, aggregating_rate denotes an aggregating learning rate, and adagrad_rho denotes auxiliary quantity for computing an adaptive learning rate; and for non-partition owners, updating the adaptive learning rate learning_rate i according to the following formula: helper_sum_part ij ′=0; wherein the adaptive learning rate is the second parameter. 6. The method according to claim 1 , further comprising: binding a plurality of GPUs to the same worker group; and controlling the plurality of GPUs bound to the same worker group to respectively train different parts of the same model through the worker threads. 7. The method according to claim 1 , further comprising: loading a hierarchical model according to a model configuration file of a convolutional neural network (CNN); and if it is identified that two adjacent layers in the hierarchical model are completed by different GPUs, adding a data transport layer between the two adjacent layers, the data transport layer being configured to transmit data between two GPUs through peer to peer. 8. The method according to claim 1 , further comprising: opening up write cache and read cache in a RAM, sizes of the write cache and the read cache being the size of a storage structure configured to store one batch of training data*the total number of worker groups; making processing of all the worker threads in a barrier state before the write cache is full; and exchanging preset indexes pointing to the write cache and the read cache after the write cache is full. 9. A data parallel processing apparatus based on multiple graphic processing units (GPUs), comprising: one or more processors; memory; and a plurality of program modules stored in the memory and to be executed by the one or more processors, the plurality of program modules further comprising: a thread creation module, configured to create, in a central processing unit (CPU), a plurality of worker threads for controlling a plurality of worker groups respectively, the worker groups comprising one or more GPUs; a thread binding module, configured to bind eac

Assignees

Inventors

Classifications

  • Combinations of networks · CPC title

  • Memory management · CPC title

  • the resource being the memory · CPC title

  • G06T1/20Primary

    Processor architectures; Processor configuration, e.g. pipelining · CPC title

  • Backpropagation, e.g. using gradient descent · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10282809B2 cover?
A parallel data processing method based on multiple graphic processing units (GPUs) is provided, including: creating, in a central processing unit (CPU), a plurality of worker threads for controlling a plurality of worker groups respectively, the worker groups including one or more GPUs; binding each worker thread to a corresponding GPU; loading a plurality of batches of training data from a no…
Who is the assignee on this patent?
Tencent Tech Shenzhen Co Ltd
What technology area does this patent fall under?
Primary CPC classification G06T1/20. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue May 07 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).