Model parallel processing method and apparatus based on multiple graphic processing units

US9607355B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9607355-B2
Application numberUS-201515107475-A
CountryUS
Kind codeB2
Filing dateJun 19, 2015
Priority dateJun 20, 2014
Publication dateMar 28, 2017
Grant dateMar 28, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A parallel data processing method based on multiple graphic processing units (GPUs) is provided, including: creating, in a central processing unit (CPU), a plurality of worker threads for controlling a plurality of worker groups respectively, the worker groups including a plurality of GPUs; binding each worker thread to a corresponding GPU; loading one batch of training data from a nonvolatile memory to a GPU video memory corresponding to one worker group; transmitting, between a plurality of GPUs corresponding to one worker group, data required by data processing performed by the GPUs through peer to peer; and controlling the plurality of GPUs to perform data processing in parallel through the worker threads.

First claim

Opening claim text (preview).

What is claimed is: 1. A parallel data processing method based on multiple graphic processing units (GPUs), comprising: creating, in a central processing unit (CPU), a plurality of worker threads for controlling a plurality of worker groups respectively, the worker groups comprising a plurality of GPUs; binding each worker thread to a corresponding GPU; loading one batch of training data from a nonvolatile memory to a GPU video memory corresponding to one worker group; transmitting, between a plurality of GPUs corresponding to one worker group, data required by data processing performed by the GPUs through peer to peer; and controlling the plurality of GPUs to perform data processing in parallel through the worker threads; the method further comprising: opening up write cache and read cache in a random access memory (RAM), sizes of the write cache and the read cache being the size of a storage structure configured to store one batch of training data*the total number of worker groups; processing all the worker threads in a barrier state before the write cache is full; and exchanging preset indexes pointing to the write cache and the read cache after the write cache is full. 2. The method according to claim 1 , comprising: creating one I/O thread, and loading the plurality of batches of training data into a random memory through the I/O thread; and pre-processing the training data on the CPU through a thread pool, wherein the I/O thread, threads in the thread pool, the worker threads and data processing in the CPU are performed in parallel. 3. The method according to claim 1 , comprising: dividing a storage region in each GPU where model parameters and gradients are stored into N partitions according to the number of the GPUs 2N; presetting sequence numbers of the 2N GPUs to be 0, 1, 2 . . . 2N−1 respectively; within a cycle where the sequence number is k (k is an integer and 1≦k≦2N−1), replicating a preset partition in the N partitions from a GPU whose sequence number is i to a GPU whose sequence number is j, and merging the gradients, wherein i=(2m+k+1)% N, j=(2m+k+2)% N, m is an integer and 0≦m≦N−1; and for partition owners in the 2N GPUs, updating the model parameters according to gradient merging results in the corresponding partitions, wherein the partition owners are GPUs having gradient merging results in all other GPUs for a preset partition. 4. The method according to claim 3 , comprising: within a cycle where the sequence number is k, replicating a preset partition in the N partitions from a GPU whose sequence number is a to a GPU whose sequence number is b, wherein a=(2m+k) % N, and b=(2m+k+1)% N. 5. The method according to claim 3 , comprising: for the partition owners, computing a learning rate learning_rate i of a parameter in the position i according to the following adaptive learning rate updating formula: ⁢ helper_sum i ′ = ∑ j = 0 GROUP ⁢ ⁢ _ ⁢ ⁢ NUM ⁢ ⁢ helper_sum ⁢ _part i j ′ learning_rate i = aggregating_rate * adagrad_rho adagrad_rho + helper_sum i ′ wherein GROUP_NUM denotes the number of worker groups, aggregating rate denotes an aggregating learning rate, and adagrad_rho denotes auxiliary quantity for computing an adaptive learning rate; and for non-partition owners, computing a learning rate learning_rate i of a parameter in the position i according to the following adaptive learning rate updating formula: helper_sum_part i j r =0. 6. The method according to claim 1 , comprising: loading a hierarchical model according to a model configuration file of a convolutional neural network; and if it is identified that two adjacent layers in the hierarchical model are completed by different GPUs, adding a data transport layer between the two adjacent layers, the data transport layer being configured to perform the step of transmitting, between a plurality of GPUs corresponding to one worker group, data required by data processing performed by the GPUs through peer to peer. 7. The method according to claim 1 , wherein the controlling the plurality of GPUs to perform data processing in parallel through the worker threads comprises: controlling a plurality of GPUs in the same worker group to respectively train different parts of the same model through the worker threads. 8. A data parallel processing apparatus based on multiple graphic processing units (GPUs), comprising: a thread creation module, configured to create, in a central processing unit (CPU), a plurality of worker threads for controlling a plurality of worker groups respectively, the worker groups comprising a plurality of GPUs; a thread binding module, configured to bind each worker thread to a corresponding GPU; a data distribution module, configured to load one batch of training data from a nonvolatile memory to a GPU video memory corresponding to one worker group; a transmission module, configured to transmit, between a plurality of GPUs corresponding to one worker group, data required by data processing performed by the GPUs through peer to peer; and a data processing control module, configured to control the plurality of GPUs to perform data processing in parallel through the worker threads; the apparatus further comprising: a cache creation module, configured to open up write cache and read cache in a random access memory (RAM), sizes of the write cache and the read cache being the size of a storage structure configured to store one batch of training data*the total number of worker groups; a thread barrier module, configured to process all the worker threads in a barrier state before the write cache is full; and a cache exchange module, configured to exchange preset indexes pointing to the write cache and the read cache after the write cache is full. 9. The apparatus according to claim 8 , wherein the thread creation module is further configured to create one I/O thread, and load the plurality of batches of training data into a rando

Assignees

Inventors

Classifications

  • G06T1/20Primary

    Processor architectures; Processor configuration, e.g. pipelining · CPC title

  • involving image processing hardware · CPC title

  • Barrier synchronisation · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9607355B2 cover?
A parallel data processing method based on multiple graphic processing units (GPUs) is provided, including: creating, in a central processing unit (CPU), a plurality of worker threads for controlling a plurality of worker groups respectively, the worker groups including a plurality of GPUs; binding each worker thread to a corresponding GPU; loading one batch of training data from a nonvolatile …
Who is the assignee on this patent?
Tencent Tech Shenzhen Co Ltd
What technology area does this patent fall under?
Primary CPC classification G06T1/20. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 28 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).