Systems and methods for deep learning processor
US-2017316312-A1 · Nov 2, 2017 · US
US12112174B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12112174-B2 |
| Application number | US-201816226534-A |
| Country | US |
| Kind code | B2 |
| Filing date | Dec 19, 2018 |
| Priority date | Feb 8, 2018 |
| Publication date | Oct 8, 2024 |
| Grant date | Oct 8, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A programmable hardware system for machine learning (ML) includes a core and a streaming engine. The core receives a plurality of commands and a plurality of data from a host to be analyzed and inferred via machine learning. The core transmits a first subset of commands of the plurality of commands that is performance-critical operations and associated data thereof of the plurality of data for efficient processing thereof. The first subset of commands and the associated data are passed through via a function call. The streaming engine is coupled to the core and receives the first subset of commands and the associated data from the core. The streaming engine streams a second subset of commands of the first subset of commands and its associated data to an inference engine by executing a single instruction.
Opening claim text (preview).
What is claimed is: 1. A programmable hardware system for machine learning (ML), comprising: a core configured to receive a plurality of commands and data from a host to be analyzed and inferred via machine learning; divide the plurality of commands into a first subset of commands associated with performance-critical operations and a second subset of commands associated with performance-noncritical operations, wherein the performance-critical operations include at least one or more of a matrix operation, tanh operation, sigmoid operation, memory transpose operation, addition operations, and operations on one or more of any of a tree, a graph, and a priority queue, wherein the performance-noncritical operations include at least one or more of data collection and data mapping, wherein the performance-critical operations exclude any of data collection and data mapping, wherein the performance-noncritical operations exclude any of a matrix operation, tanh operation, sigmoid operation, memory transpose operation, addition operations, and operations on one or more of a tree, a graph, and a priority queue; transmit each command of the first subset of commands of the plurality of commands for performance-critical operations and associated data thereof to an inference engine for processing via a function call, wherein the each command of the first subset of commands and/or the associated data are encapsulated as parameters in the function call, wherein the second subset of commands associated with performance-noncritical operations is not transmitted to the inference engine; an instruction streaming engine coupled to the core and further coupled to the inference engine, wherein the streaming engine is configured to retrieve and maintain the each command of the first subset of commands and/or the associated data from the function call at a specific location in a buffer; stream the each command of the first subset of commands and/or its associated data to the inference engine from the buffer; and said inference engine configured to retrieve the each command of the first subset of commands and/or its associated data streamed from the buffer; perform the performance-critical operations according to the each command of the first subset of commands; analyze the data; and infer a subject from the data. 2. The programmable hardware system of claim 1 , wherein the streaming engine is further configured to receive a streamed inferred data from the inference engine. 3. The programmable hardware system of claim 2 , wherein the streaming engine is further configured to stream the received inferred data to the core. 4. The programmable hardware system of claim 1 , wherein the streaming engine comprises: an instruction streaming engine coupled to the core, wherein the instruction streaming engine is configured to stream the first subset of commands to the inference engine; and a data steaming engine coupled to the inference engine and configured to generate one or more streams of data associated with the first subset of commands, and wherein the data streaming engine is configured to stream the one or more streams of data to the inference engine be analyzed and inferred. 5. The programmable hardware system of claim 1 , wherein the streaming engine is configured to stream instructions to the inference engine in an instruction set architecture that is different from an instruction set architecture format received from the core. 6. The programmable hardware system of claim 1 , wherein the buffer is coupled to the core and further coupled to the streaming engine, wherein the core continuously writes to the buffer until a certain condition is met, and wherein the streaming engine continuously reads from the buffer until another certain condition is met. 7. The programmable hardware system of claim 6 , wherein the certain condition is when available buffer associated with the buffer is below a threshold value. 8. The programmable hardware system of claim 7 , wherein the available buffer is tracked using a head pointer maintained by the core locally, and wherein the head pointer is incremented each time the core writes to the buffer and the available buffer associated with the buffer is decremented each time the core writes to the buffer. 9. The programmable hardware system of claim 7 , wherein the core reads a value stored in a memory mapped input/output (MIMO) responsive to the certain condition being met, wherein the MIMO stores a value of the head pointer and a tail pointer associated with a location the streaming engine reads from the buffer, and wherein the core is configured to set the available buffer size. 10. The programmable hardware system of claim 9 , wherein the core is configured to set the available buffer size to the tail pointer minus the head pointer and result thereof modulo actual size of the buffer. 11. The programmable hardware system of claim 6 , wherein the another certain condition is when buffer size to read from is greater than zero. 12. The programmable hardware system of claim 11 , wherein the buffer size to read from is tracked using a tail pointer maintained by the streaming engine locally, and wherein the tail pointer is incremented each time the streaming engine reads from the buffer and wherein the buffer size to read from is the tail pointer minus a head pointer and result thereof modulo actual size of the buffer, wherein the head pointer is maintained by the core locally and incremented each time the core writes to the buffer. 13. The programmable hardware system of claim 1 wherein the core maintains a head pointer where the core writes to and wherein the streaming engine maintains a tail pointer where the streaming engine reads from, and wherein the head pointer and the tail pointer are stored in a memory mapped input/output (MMIO) space that is mapped into registers in the streaming engine. 14. The programmable hardware system of claim 1 , wherein the buffer is a circular buffer allocated in a DDR memory, and wherein a size of the buffer is fixed a-priori at compile time. 15. A programmable hardware system for machine learning (ML), comprising: a core configured to receive a plurality of commands and a plurality of data from a host to be analyzed and inferred via machine learning, wherein the core is configured to divide the plurality of commands into a first subset of commands associated with performance-critical operations and a second subset of commands associated with performance-noncritical operations, wherein the performance-critical operations include at least one or more of a matrix operation, tanh operation, sigmoid operation, memory transpose operation, addition operations, and operations on one or more of any of a tree, a graph, and a priority queue, wherein the performance-noncritical operations include at least one or more of, data collection and data mapping, wherein the performance-critical operations exclude any of data collection and data mapping, wherein the performance-noncritical operations exclude any of a matrix operation, tanh operation, sigmoid operation, memory transpose operation, addition operations, and operations on one or more of a tree, a graph, and a priority queue and wherein the core is further configured to transmit the first subset of commands of the plurality of commands that is performance-critical operations and associated data thereof of the plurality of data for processing, wherein the first subset of commands and the associated data are passed through via a function call, wherein the second subset of commands associated with performance noncritical operations is
System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package · CPC title
Matrix or vector computation {, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization (matrix transposition G06F7/78)} · CPC title
from multiple instruction streams, e.g. multistreaming · CPC title
for non-native instruction set, e.g. Javabyte, legacy code · CPC title
Inference or reasoning models · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.