Task parallel processing method, apparatus and system, storage medium and computer device
US-2020012521-A1 · Jan 9, 2020 · US
US11561826B1 · US · B1
| Field | Value |
|---|---|
| Publication number | US-11561826-B1 |
| Application number | US-202017096136-A |
| Country | US |
| Kind code | B1 |
| Filing date | Nov 12, 2020 |
| Priority date | Nov 12, 2020 |
| Publication date | Jan 24, 2023 |
| Grant date | Jan 24, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Scheduling work of a machine learning application includes instantiating kernel objects by a computer processor in response to input of kernel definitions. Each kernel object is of a kernel type indicating a compute circuit. The computer processor generates a graph in a memory. Each node represents a task and specifies an assignment of the task to one or more of the kernel objects, and each edge represents a data dependency. Task queues are created in the memory and assigned to queue tasks represented by the nodes. Kernel objects are assigned to the task queues, and the tasks are enqueued by threads executing the kernel objects, based on assignments of the kernel objects to the task queues and assignments of the tasks to the kernel objects. Tasks are dequeued by the threads, and the compute circuits are activated to initiate processing of the dequeued tasks.
Opening claim text (preview).
What is claimed is: 1. A method comprising: instantiating a plurality of kernel objects by a computer processor in response to input of a plurality of kernel definitions, respectively, wherein each kernel object is of a kernel type of a plurality of kernel types, and each kernel type indicates a compute circuit of a heterogeneous plurality of compute circuits; generating a graph in a memory by the computer processor, wherein the graph has nodes and edges, each node represents a task and specifies an assignment of the task to one or more of the kernel objects, and each edge represents a data dependency between the nodes; creating a plurality of task queues in the memory, and assigning each task queue to queue tasks represented by one or more of the nodes; assigning each of the kernel objects to one of the task queues; enqueuing the tasks represented by the nodes in the plurality of task queues by threads executing the kernel objects on the computer processor, based on assignments of the kernel objects to the task queues and assignments of the tasks to the kernel objects; and dequeuing the tasks from the plurality of task queues by the threads executing the kernel objects based on the assignments of the kernel objects to the task queues and the assignments of the tasks to the kernel objects, and activating ones of the compute circuits by the threads executing the kernel objects to initiate processing of the dequeued tasks. 2. The method of claim 1 , further comprising: waiting, in response to definition of a kernel object of the plurality of kernel objects being blocking, to dequeue another task from a task queue of the plurality of task queues by the thread executing the kernel object, until processing of a previous task dequeued by the thread executing the kernel object is complete; and continuing to the dequeuing, in response to definition of a kernel object of the plurality of kernel objects being non-blocking, of another task from a task queue of the plurality of task queues by the thread executing the kernel object without waiting for processing of the previous task dequeued by the thread executing the kernel object to complete. 3. The method of claim 1 , further comprising: generating a plurality of graphs in the memory, wherein the tasks represented by each graph are independent of data processed by the tasks represented by each other graph, and each graph has nodes and edges, each node represents a task and specifies an assignment of the task to one or more of the kernel objects, and each edge represents a data dependency between the nodes; and wherein the enqueuing includes enqueuing in at least one task queue of the plurality of task queues, a first task represented by a node in a first graph of the plurality of graphs and a second task represented by a node in a second graph of the plurality of graphs. 4. The method of claim 1 , wherein: the respective kernel types include a first kernel type, and the plurality of kernel objects includes a first kernel object and a second kernel object of the first kernel type; the dequeuing includes: dequeuing a first task from a first task queue of the plurality of task queues by a thread executing the first kernel object, and activating a first compute circuit of the plurality of compute circuits indicated by the first kernel type to initiate processing of the first task; and dequeuing a second task from the first task queue by a thread executing the second kernel object, and activating the first compute circuit to initiate processing of the second task. 5. The method of claim 1 , wherein: the respective kernel types include a first kernel type and a second kernel type, and the plurality of kernel objects includes a first kernel object of the first kernel type and a second kernel object of the second kernel type; the dequeuing includes: dequeuing a first task from a first task queue of the plurality of task queues by a thread executing the first kernel object, and activating a first compute circuit of the plurality of compute circuits indicated by the first kernel type to initiate processing of the first task; and dequeuing a second task from the first task queue by a thread executing the second kernel object, and activating a second compute circuit indicated by the second kernel type to initiate processing of the second task. 6. The method of claim 1 , wherein: the instantiating the plurality of kernel objects includes executing at least one of the kernel objects by two or more threads; and the dequeuing includes dequeuing, based assignment of at least one of the kernel objects to one of the task queues, tasks from the one of the task queues by the two or more threads, and activating the compute circuit indicated by the kernel type of the at least one of the kernel objects by the two or more threads to initiate processing of the dequeued tasks. 7. The method of claim 6 , wherein the instantiating the plurality of kernel objects includes selecting a number of threads to execute by the at least one kernel object in response to a configurable parameter associated with the at least one kernel object. 8. The method of claim 7 , further comprising: measuring performance of the plurality of compute circuits by the kernel objects; and adjusting the configurable parameter to adjust the number of threads to execute in response to the performance measured for the compute circuit associated with the at least one kernel object. 9. The method of claim 1 , further comprising: generating a plurality of graphs in the memory, wherein the tasks represented by each graph are independent of data processed by the tasks represented by each other graph, each graph includes respective storage areas for output data generated by the nodes of the graph, each graph has nodes and edges, each node represents a task and specifies an assignment of the task to one or more of the kernel objects, and each edge represents a data dependency between the nodes; wherein the activating of a compute circuit of the ones of the compute circuits to initiate processing of a dequeued task includes: communicating one or more memory addresses of one or more of the respective storage areas to the compute circuit, wherein each of the one or more of the respective storage areas has output data produced in processing a task represented by a parent node of the node of the dequeued task, and the dequeued task is dependent on the output data; and communicating to the compute circuit, a memory address of the respective storage area for output data generated in processing of the dequeued task. 10. The method of claim 9 , wherein: each graph includes storage for respective in-degree counts of the nodes of the graph, and each in-degree count indicates a number of parent nodes of a node of the graph on which the task represented by the node is dependent on output data; decrementing an in-degree count of a node of a graph by thread executing a kernel object in response to completion of a task represented by a parent node of the node; and wherein the enqueuing includes enqueuing the task represented by the node of the graph in response to the in-degree count indicating completion of each task of each parent node of the node. 11. The method of claim 1 , wherein: the instantiating the plurality of kernel objects includes executing each of the kernel objects by one or more threads; determining respective utilization levels of the kernel objects; and automatically increasing a number of threads executing a kernel object in response to the respective utilization level of the kernel object being greater than a threshold. 12. The method of claim 1 , furth
Graphs; Linked lists (G06F16/9027 takes precedence) · CPC title
Learning methods · CPC title
Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues · CPC title
considering hardware capabilities · CPC title
considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration (scheduling strategies G06F9/4881 and subgroups) · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.