Scheduling processing of machine learning tasks on heterogeneous compute circuits

US11561826B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-11561826-B1
Application numberUS-202017096136-A
CountryUS
Kind codeB1
Filing dateNov 12, 2020
Priority dateNov 12, 2020
Publication dateJan 24, 2023
Grant dateJan 24, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Scheduling work of a machine learning application includes instantiating kernel objects by a computer processor in response to input of kernel definitions. Each kernel object is of a kernel type indicating a compute circuit. The computer processor generates a graph in a memory. Each node represents a task and specifies an assignment of the task to one or more of the kernel objects, and each edge represents a data dependency. Task queues are created in the memory and assigned to queue tasks represented by the nodes. Kernel objects are assigned to the task queues, and the tasks are enqueued by threads executing the kernel objects, based on assignments of the kernel objects to the task queues and assignments of the tasks to the kernel objects. Tasks are dequeued by the threads, and the compute circuits are activated to initiate processing of the dequeued tasks.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: instantiating a plurality of kernel objects by a computer processor in response to input of a plurality of kernel definitions, respectively, wherein each kernel object is of a kernel type of a plurality of kernel types, and each kernel type indicates a compute circuit of a heterogeneous plurality of compute circuits; generating a graph in a memory by the computer processor, wherein the graph has nodes and edges, each node represents a task and specifies an assignment of the task to one or more of the kernel objects, and each edge represents a data dependency between the nodes; creating a plurality of task queues in the memory, and assigning each task queue to queue tasks represented by one or more of the nodes; assigning each of the kernel objects to one of the task queues; enqueuing the tasks represented by the nodes in the plurality of task queues by threads executing the kernel objects on the computer processor, based on assignments of the kernel objects to the task queues and assignments of the tasks to the kernel objects; and dequeuing the tasks from the plurality of task queues by the threads executing the kernel objects based on the assignments of the kernel objects to the task queues and the assignments of the tasks to the kernel objects, and activating ones of the compute circuits by the threads executing the kernel objects to initiate processing of the dequeued tasks. 2. The method of claim 1 , further comprising: waiting, in response to definition of a kernel object of the plurality of kernel objects being blocking, to dequeue another task from a task queue of the plurality of task queues by the thread executing the kernel object, until processing of a previous task dequeued by the thread executing the kernel object is complete; and continuing to the dequeuing, in response to definition of a kernel object of the plurality of kernel objects being non-blocking, of another task from a task queue of the plurality of task queues by the thread executing the kernel object without waiting for processing of the previous task dequeued by the thread executing the kernel object to complete. 3. The method of claim 1 , further comprising: generating a plurality of graphs in the memory, wherein the tasks represented by each graph are independent of data processed by the tasks represented by each other graph, and each graph has nodes and edges, each node represents a task and specifies an assignment of the task to one or more of the kernel objects, and each edge represents a data dependency between the nodes; and wherein the enqueuing includes enqueuing in at least one task queue of the plurality of task queues, a first task represented by a node in a first graph of the plurality of graphs and a second task represented by a node in a second graph of the plurality of graphs. 4. The method of claim 1 , wherein: the respective kernel types include a first kernel type, and the plurality of kernel objects includes a first kernel object and a second kernel object of the first kernel type; the dequeuing includes: dequeuing a first task from a first task queue of the plurality of task queues by a thread executing the first kernel object, and activating a first compute circuit of the plurality of compute circuits indicated by the first kernel type to initiate processing of the first task; and dequeuing a second task from the first task queue by a thread executing the second kernel object, and activating the first compute circuit to initiate processing of the second task. 5. The method of claim 1 , wherein: the respective kernel types include a first kernel type and a second kernel type, and the plurality of kernel objects includes a first kernel object of the first kernel type and a second kernel object of the second kernel type; the dequeuing includes: dequeuing a first task from a first task queue of the plurality of task queues by a thread executing the first kernel object, and activating a first compute circuit of the plurality of compute circuits indicated by the first kernel type to initiate processing of the first task; and dequeuing a second task from the first task queue by a thread executing the second kernel object, and activating a second compute circuit indicated by the second kernel type to initiate processing of the second task. 6. The method of claim 1 , wherein: the instantiating the plurality of kernel objects includes executing at least one of the kernel objects by two or more threads; and the dequeuing includes dequeuing, based assignment of at least one of the kernel objects to one of the task queues, tasks from the one of the task queues by the two or more threads, and activating the compute circuit indicated by the kernel type of the at least one of the kernel objects by the two or more threads to initiate processing of the dequeued tasks. 7. The method of claim 6 , wherein the instantiating the plurality of kernel objects includes selecting a number of threads to execute by the at least one kernel object in response to a configurable parameter associated with the at least one kernel object. 8. The method of claim 7 , further comprising: measuring performance of the plurality of compute circuits by the kernel objects; and adjusting the configurable parameter to adjust the number of threads to execute in response to the performance measured for the compute circuit associated with the at least one kernel object. 9. The method of claim 1 , further comprising: generating a plurality of graphs in the memory, wherein the tasks represented by each graph are independent of data processed by the tasks represented by each other graph, each graph includes respective storage areas for output data generated by the nodes of the graph, each graph has nodes and edges, each node represents a task and specifies an assignment of the task to one or more of the kernel objects, and each edge represents a data dependency between the nodes; wherein the activating of a compute circuit of the ones of the compute circuits to initiate processing of a dequeued task includes: communicating one or more memory addresses of one or more of the respective storage areas to the compute circuit, wherein each of the one or more of the respective storage areas has output data produced in processing a task represented by a parent node of the node of the dequeued task, and the dequeued task is dependent on the output data; and communicating to the compute circuit, a memory address of the respective storage area for output data generated in processing of the dequeued task. 10. The method of claim 9 , wherein: each graph includes storage for respective in-degree counts of the nodes of the graph, and each in-degree count indicates a number of parent nodes of a node of the graph on which the task represented by the node is dependent on output data; decrementing an in-degree count of a node of a graph by thread executing a kernel object in response to completion of a task represented by a parent node of the node; and wherein the enqueuing includes enqueuing the task represented by the node of the graph in response to the in-degree count indicating completion of each task of each parent node of the node. 11. The method of claim 1 , wherein: the instantiating the plurality of kernel objects includes executing each of the kernel objects by one or more threads; determining respective utilization levels of the kernel objects; and automatically increasing a number of threads executing a kernel object in response to the respective utilization level of the kernel object being greater than a threshold. 12. The method of claim 1 , furth

Assignees

Inventors

Classifications

  • Graphs; Linked lists (G06F16/9027 takes precedence) · CPC title

  • Learning methods · CPC title

  • G06F9/4881Primary

    Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues · CPC title

  • G06F9/5044Primary

    considering hardware capabilities · CPC title

  • considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration (scheduling strategies G06F9/4881 and subgroups) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11561826B1 cover?
Scheduling work of a machine learning application includes instantiating kernel objects by a computer processor in response to input of kernel definitions. Each kernel object is of a kernel type indicating a compute circuit. The computer processor generates a graph in a memory. Each node represents a task and specifies an assignment of the task to one or more of the kernel objects, and each edg…
Who is the assignee on this patent?
Xilinx Inc
What technology area does this patent fall under?
Primary CPC classification G06F9/4881. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 24 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).