Array-based inference engine for machine learning
US-2019243800-A1 · Aug 8, 2019 · US
US11029963B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11029963-B2 |
| Application number | US-201816226559-A |
| Country | US |
| Kind code | B2 |
| Filing date | Dec 19, 2018 |
| Priority date | Feb 8, 2018 |
| Publication date | Jun 8, 2021 |
| Grant date | Jun 8, 2021 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A processing unit of an inference engine for machine learning (ML) includes a first data load steamer, a second data load streamer, an operator component, and a store streamer. The first data load streamer streams a first data stream from an on-chip memory (OCM) to the operator component. The second data load streamer streams a second data stream from the OCM to the operator component. The operator component performs a matrix operation on the first data stream and the second data stream. The store streamer receives a data output stream from the operator component and to store the data output stream in a buffer.
Opening claim text (preview).
What is claimed is: 1. A processing unit of an inference engine for machine learning (ML), comprising: a first data load streamer configured to stream a first data stream comprising a first plurality of data sections from an on-chip memory (OCM), using a single instruction, to an operator component by using an address of the OCM and a pattern of data to be loaded for the first data stream to be read and streamed; a second data load streamer configured to stream a second data stream comprising a second plurality of data sections from the OCM, using a single instruction, to the operator component by using an address of the OCM and a pattern of data to be loaded for the second data stream to be read and streamed; the operator component configured to perform a data operation on the first data stream and the second data stream; and a store streamer configured to receive a data output stream from the operator component and to store the data output stream in a buffer, wherein the pattern of data to be loaded for the first data stream includes a stride to a next block and a stride between lines, wherein the first data stream pattern is specified by one or more of a starting address, number of lines to read for each operation, number of bytes per line, and a number of blocks to read. 2. The processing unit of claim 1 , wherein the data operation is a matrix multiplication operation and is selected from a group consisting of determining a maximum value, calculating an average value for a stream of data, calculating an addition of the first data stream to the second data stream, calculating a multiplication of the first data stream to the second data stream, rewriting the first data stream in a different pattern for matrix transformation, Tanh operation, Sigmoid operation, spatial batch normalization operation, and local response normalization. 3. The processing unit of claim 1 further comprising an instruction controller configured to store instructions received from a core engine. 4. The processing unit of claim 1 , wherein the buffer is configured to stream the data output stream to the OCM for storage thereof. 5. The processing unit of claim 1 , wherein the data output stream is specified by one or more of a starting address, a number of lines to write, line stride between lines, a number of bytes per line, and stride to a next block. 6. The processing unit of claim 1 , wherein the first data load streamer, the second data load streamer, the operator component, and the store streamer are configured to iteratively execute and process data until a termination condition is met. 7. A processing unit of an inference engine for machine learning (ML), comprising: a first data load streamer configured to stream a first data stream comprising a first plurality of data sections from an on-chip memory (OCM), using a single instruction, to an operator component by using an address of the OCM and a pattern of data to be loaded for the first data stream to be read and streamed; a second data load streamer configured to stream a second data stream comprising a second plurality of data sections from the OCM, using a single instruction, to the operator component by using an address of the OCM and a pattern of data to be loaded for the second data stream to be read and streamed; the operator component configured to perform a matrix operation on the first data stream and the second data stream, wherein the matrix operation is performed by another processing unit that reads data within each matrix only once and wherein the another processing unit is configured to receive data within the each matrix as a data stream using a single instruction and further configured to operate on the each matrix as the data stream using a single instruction to generate an output matrix; and a store streamer configured to receive a data output stream from the operator component and to store the data output stream in a buffer, wherein the pattern of data to be loaded for the first data stream includes a stride to a next block and a stride between lines, wherein the data output stream is specified by a starting address, a number of lines to write, line stride between lines, a number of bytes per line, and stride to a next block. 8. The processing unit of claim 7 , wherein the matrix operation is a matrix multiplication operation and is selected from a group consisting of determining a maximum value, calculating an average value for a stream of data, calculating an addition of the first data stream to the second data stream, calculating a multiplication of the first data stream to the second data stream, rewriting the first data stream in a different pattern for matrix transformation, Tanh operation, Sigmoid operation, spatial batch normalization operation, and local response normalization. 9. The processing unit of claim 7 further comprising an instruction controller configured to store instructions received from a core engine. 10. The processing unit of claim 7 , wherein the first data stream pattern is specified by a starting address, number of lines to read for each operation, number of bytes per line, and a number of blocks to read. 11. The processing unit of claim 7 , wherein the buffer is configured to stream the data output stream to the OCM for storage thereof. 12. The processing unit of claim 7 , wherein the first data load streamer, the second data load streamer, the operator component, and the store streamer are configured to iteratively execute and process data until a termination condition is met. 13. A method comprising: streaming a first data stream comprising a first plurality of data sections from an on-chip memory (OCM), using a single instruction, to an operator component by using an address of the OCM and a pattern of data to be loaded for the first data stream to be read and streamed; streaming a second data stream comprising a second plurality of data sections from the OCM to the operator component by using an address of the OCM and a pattern of data to be loaded for the second data stream to be read and streamed; performing a data operation on the first data stream and the second data stream; streaming a data output stream resulting from the performing; and storing the data output stream, wherein the pattern of data to be loaded for the first data stream includes a stride to a next block and a stride between lines, wherein the first data stream pattern is specified by a starting address, number of lines to read for each operation, number of bytes per line, and a number of blocks to read. 14. The method of claim 13 , wherein the data operation is a matrix multiplication and is selected from a group consisting of determining a maximum value, calculating an average value for a stream of data, calculating an addition of the first data stream to the second data stream, calculating a multiplication of the first data stream to the second data stream, rewriting the first data stream in a different pattern for matrix transformation, Tanh operation, Sigmoid operation, spatial batch normalization operation, and local response normalization. 15. The method of claim 13 further comprising storing instructions received from a core engine. 16. The method of claim 13 , wherein the data output stream is specified by a starting address, a number of lines to write, line stride between lines, a number of bytes per line, and stride to a next block. 17. The method of claim 13 further comprising iteratively repeating the streaming the first data stream, the streaming the second data stream, the performing the po
from multiple instruction streams, e.g. multistreaming · CPC title
System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package · CPC title
on more than one IC chip · CPC title
using a secondary processor, e.g. coprocessor (peripheral processor G06F13/12) · CPC title
Ensemble learning · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.