Architecture for irregular operations in machine learning inference engine

US11029963B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11029963-B2
Application numberUS-201816226559-A
CountryUS
Kind codeB2
Filing dateDec 19, 2018
Priority dateFeb 8, 2018
Publication dateJun 8, 2021
Grant dateJun 8, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A processing unit of an inference engine for machine learning (ML) includes a first data load steamer, a second data load streamer, an operator component, and a store streamer. The first data load streamer streams a first data stream from an on-chip memory (OCM) to the operator component. The second data load streamer streams a second data stream from the OCM to the operator component. The operator component performs a matrix operation on the first data stream and the second data stream. The store streamer receives a data output stream from the operator component and to store the data output stream in a buffer.

First claim

Opening claim text (preview).

What is claimed is: 1. A processing unit of an inference engine for machine learning (ML), comprising: a first data load streamer configured to stream a first data stream comprising a first plurality of data sections from an on-chip memory (OCM), using a single instruction, to an operator component by using an address of the OCM and a pattern of data to be loaded for the first data stream to be read and streamed; a second data load streamer configured to stream a second data stream comprising a second plurality of data sections from the OCM, using a single instruction, to the operator component by using an address of the OCM and a pattern of data to be loaded for the second data stream to be read and streamed; the operator component configured to perform a data operation on the first data stream and the second data stream; and a store streamer configured to receive a data output stream from the operator component and to store the data output stream in a buffer, wherein the pattern of data to be loaded for the first data stream includes a stride to a next block and a stride between lines, wherein the first data stream pattern is specified by one or more of a starting address, number of lines to read for each operation, number of bytes per line, and a number of blocks to read. 2. The processing unit of claim 1 , wherein the data operation is a matrix multiplication operation and is selected from a group consisting of determining a maximum value, calculating an average value for a stream of data, calculating an addition of the first data stream to the second data stream, calculating a multiplication of the first data stream to the second data stream, rewriting the first data stream in a different pattern for matrix transformation, Tanh operation, Sigmoid operation, spatial batch normalization operation, and local response normalization. 3. The processing unit of claim 1 further comprising an instruction controller configured to store instructions received from a core engine. 4. The processing unit of claim 1 , wherein the buffer is configured to stream the data output stream to the OCM for storage thereof. 5. The processing unit of claim 1 , wherein the data output stream is specified by one or more of a starting address, a number of lines to write, line stride between lines, a number of bytes per line, and stride to a next block. 6. The processing unit of claim 1 , wherein the first data load streamer, the second data load streamer, the operator component, and the store streamer are configured to iteratively execute and process data until a termination condition is met. 7. A processing unit of an inference engine for machine learning (ML), comprising: a first data load streamer configured to stream a first data stream comprising a first plurality of data sections from an on-chip memory (OCM), using a single instruction, to an operator component by using an address of the OCM and a pattern of data to be loaded for the first data stream to be read and streamed; a second data load streamer configured to stream a second data stream comprising a second plurality of data sections from the OCM, using a single instruction, to the operator component by using an address of the OCM and a pattern of data to be loaded for the second data stream to be read and streamed; the operator component configured to perform a matrix operation on the first data stream and the second data stream, wherein the matrix operation is performed by another processing unit that reads data within each matrix only once and wherein the another processing unit is configured to receive data within the each matrix as a data stream using a single instruction and further configured to operate on the each matrix as the data stream using a single instruction to generate an output matrix; and a store streamer configured to receive a data output stream from the operator component and to store the data output stream in a buffer, wherein the pattern of data to be loaded for the first data stream includes a stride to a next block and a stride between lines, wherein the data output stream is specified by a starting address, a number of lines to write, line stride between lines, a number of bytes per line, and stride to a next block. 8. The processing unit of claim 7 , wherein the matrix operation is a matrix multiplication operation and is selected from a group consisting of determining a maximum value, calculating an average value for a stream of data, calculating an addition of the first data stream to the second data stream, calculating a multiplication of the first data stream to the second data stream, rewriting the first data stream in a different pattern for matrix transformation, Tanh operation, Sigmoid operation, spatial batch normalization operation, and local response normalization. 9. The processing unit of claim 7 further comprising an instruction controller configured to store instructions received from a core engine. 10. The processing unit of claim 7 , wherein the first data stream pattern is specified by a starting address, number of lines to read for each operation, number of bytes per line, and a number of blocks to read. 11. The processing unit of claim 7 , wherein the buffer is configured to stream the data output stream to the OCM for storage thereof. 12. The processing unit of claim 7 , wherein the first data load streamer, the second data load streamer, the operator component, and the store streamer are configured to iteratively execute and process data until a termination condition is met. 13. A method comprising: streaming a first data stream comprising a first plurality of data sections from an on-chip memory (OCM), using a single instruction, to an operator component by using an address of the OCM and a pattern of data to be loaded for the first data stream to be read and streamed; streaming a second data stream comprising a second plurality of data sections from the OCM to the operator component by using an address of the OCM and a pattern of data to be loaded for the second data stream to be read and streamed; performing a data operation on the first data stream and the second data stream; streaming a data output stream resulting from the performing; and storing the data output stream, wherein the pattern of data to be loaded for the first data stream includes a stride to a next block and a stride between lines, wherein the first data stream pattern is specified by a starting address, number of lines to read for each operation, number of bytes per line, and a number of blocks to read. 14. The method of claim 13 , wherein the data operation is a matrix multiplication and is selected from a group consisting of determining a maximum value, calculating an average value for a stream of data, calculating an addition of the first data stream to the second data stream, calculating a multiplication of the first data stream to the second data stream, rewriting the first data stream in a different pattern for matrix transformation, Tanh operation, Sigmoid operation, spatial batch normalization operation, and local response normalization. 15. The method of claim 13 further comprising storing instructions received from a core engine. 16. The method of claim 13 , wherein the data output stream is specified by a starting address, a number of lines to write, line stride between lines, a number of bytes per line, and stride to a next block. 17. The method of claim 13 further comprising iteratively repeating the streaming the first data stream, the streaming the second data stream, the performing the po

Assignees

Inventors

Classifications

  • from multiple instruction streams, e.g. multistreaming · CPC title

  • System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package · CPC title

  • on more than one IC chip · CPC title

  • G06F9/3877Primary

    using a secondary processor, e.g. coprocessor (peripheral processor G06F13/12) · CPC title

  • Ensemble learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11029963B2 cover?
A processing unit of an inference engine for machine learning (ML) includes a first data load steamer, a second data load streamer, an operator component, and a store streamer. The first data load streamer streams a first data stream from an on-chip memory (OCM) to the operator component. The second data load streamer streams a second data stream from the OCM to the operator component. The oper…
Who is the assignee on this patent?
Cavium Llc, Marvell Asia Pte Ltd
What technology area does this patent fall under?
Primary CPC classification G06F9/3877. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jun 08 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).