Automated inspection system
US-2024420305-A1 · Dec 19, 2024 · US
US9607355B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-9607355-B2 |
| Application number | US-201515107475-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jun 19, 2015 |
| Priority date | Jun 20, 2014 |
| Publication date | Mar 28, 2017 |
| Grant date | Mar 28, 2017 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A parallel data processing method based on multiple graphic processing units (GPUs) is provided, including: creating, in a central processing unit (CPU), a plurality of worker threads for controlling a plurality of worker groups respectively, the worker groups including a plurality of GPUs; binding each worker thread to a corresponding GPU; loading one batch of training data from a nonvolatile memory to a GPU video memory corresponding to one worker group; transmitting, between a plurality of GPUs corresponding to one worker group, data required by data processing performed by the GPUs through peer to peer; and controlling the plurality of GPUs to perform data processing in parallel through the worker threads.
Opening claim text (preview).
What is claimed is: 1. A parallel data processing method based on multiple graphic processing units (GPUs), comprising: creating, in a central processing unit (CPU), a plurality of worker threads for controlling a plurality of worker groups respectively, the worker groups comprising a plurality of GPUs; binding each worker thread to a corresponding GPU; loading one batch of training data from a nonvolatile memory to a GPU video memory corresponding to one worker group; transmitting, between a plurality of GPUs corresponding to one worker group, data required by data processing performed by the GPUs through peer to peer; and controlling the plurality of GPUs to perform data processing in parallel through the worker threads; the method further comprising: opening up write cache and read cache in a random access memory (RAM), sizes of the write cache and the read cache being the size of a storage structure configured to store one batch of training data*the total number of worker groups; processing all the worker threads in a barrier state before the write cache is full; and exchanging preset indexes pointing to the write cache and the read cache after the write cache is full. 2. The method according to claim 1 , comprising: creating one I/O thread, and loading the plurality of batches of training data into a random memory through the I/O thread; and pre-processing the training data on the CPU through a thread pool, wherein the I/O thread, threads in the thread pool, the worker threads and data processing in the CPU are performed in parallel. 3. The method according to claim 1 , comprising: dividing a storage region in each GPU where model parameters and gradients are stored into N partitions according to the number of the GPUs 2N; presetting sequence numbers of the 2N GPUs to be 0, 1, 2 . . . 2N−1 respectively; within a cycle where the sequence number is k (k is an integer and 1≦k≦2N−1), replicating a preset partition in the N partitions from a GPU whose sequence number is i to a GPU whose sequence number is j, and merging the gradients, wherein i=(2m+k+1)% N, j=(2m+k+2)% N, m is an integer and 0≦m≦N−1; and for partition owners in the 2N GPUs, updating the model parameters according to gradient merging results in the corresponding partitions, wherein the partition owners are GPUs having gradient merging results in all other GPUs for a preset partition. 4. The method according to claim 3 , comprising: within a cycle where the sequence number is k, replicating a preset partition in the N partitions from a GPU whose sequence number is a to a GPU whose sequence number is b, wherein a=(2m+k) % N, and b=(2m+k+1)% N. 5. The method according to claim 3 , comprising: for the partition owners, computing a learning rate learning_rate i of a parameter in the position i according to the following adaptive learning rate updating formula: helper_sum i ′ = ∑ j = 0 GROUP _ NUM helper_sum _part i j ′ learning_rate i = aggregating_rate * adagrad_rho adagrad_rho + helper_sum i ′ wherein GROUP_NUM denotes the number of worker groups, aggregating rate denotes an aggregating learning rate, and adagrad_rho denotes auxiliary quantity for computing an adaptive learning rate; and for non-partition owners, computing a learning rate learning_rate i of a parameter in the position i according to the following adaptive learning rate updating formula: helper_sum_part i j r =0. 6. The method according to claim 1 , comprising: loading a hierarchical model according to a model configuration file of a convolutional neural network; and if it is identified that two adjacent layers in the hierarchical model are completed by different GPUs, adding a data transport layer between the two adjacent layers, the data transport layer being configured to perform the step of transmitting, between a plurality of GPUs corresponding to one worker group, data required by data processing performed by the GPUs through peer to peer. 7. The method according to claim 1 , wherein the controlling the plurality of GPUs to perform data processing in parallel through the worker threads comprises: controlling a plurality of GPUs in the same worker group to respectively train different parts of the same model through the worker threads. 8. A data parallel processing apparatus based on multiple graphic processing units (GPUs), comprising: a thread creation module, configured to create, in a central processing unit (CPU), a plurality of worker threads for controlling a plurality of worker groups respectively, the worker groups comprising a plurality of GPUs; a thread binding module, configured to bind each worker thread to a corresponding GPU; a data distribution module, configured to load one batch of training data from a nonvolatile memory to a GPU video memory corresponding to one worker group; a transmission module, configured to transmit, between a plurality of GPUs corresponding to one worker group, data required by data processing performed by the GPUs through peer to peer; and a data processing control module, configured to control the plurality of GPUs to perform data processing in parallel through the worker threads; the apparatus further comprising: a cache creation module, configured to open up write cache and read cache in a random access memory (RAM), sizes of the write cache and the read cache being the size of a storage structure configured to store one batch of training data*the total number of worker groups; a thread barrier module, configured to process all the worker threads in a barrier state before the write cache is full; and a cache exchange module, configured to exchange preset indexes pointing to the write cache and the read cache after the write cache is full. 9. The apparatus according to claim 8 , wherein the thread creation module is further configured to create one I/O thread, and load the plurality of batches of training data into a rando
Processor architectures; Processor configuration, e.g. pipelining · CPC title
involving image processing hardware · CPC title
Barrier synchronisation · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.