Pipelined convolutional operations for processing clusters

US9886377B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9886377-B2
Application numberUS-201514874784-A
CountryUS
Kind codeB2
Filing dateOct 5, 2015
Priority dateOct 5, 2015
Publication dateFeb 6, 2018
Grant dateFeb 6, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Described herein are one or more integrated circuits (ICs) comprising controller circuitry to receive a command to execute an operation for data inputs stored in an external memory or a local memory, and convert the operation into a set of matrix operations to operate on sub-portions of the data inputs. The IC(s) further comprise at least one processing circuitry to execute the set of matrix operations, the processing circuitry to include ALUs, a local memory external to the ALUs and accessible by the ALUs, and processing control circuitry to create at least one matrix operand in the local memory (from the data inputs of the operation) comprising at least one of a scalar, a vector, or a 2D matrix, and provide memory handles corresponding to each of the matrix operands to one of the ALUs to access the respective matrix operands when executing a matrix operation.

First claim

Opening claim text (preview).

The invention claimed is: 1. One or more integrated circuits (ICs) comprising: controller circuitry to: receive a command to execute an operation for a plurality of data inputs stored in an external memory or a local memory; and convert the operation into a set of matrix operations, wherein the set of matrix operations are to each operate on respective sub-portions of the plurality of data inputs; and at least one processing circuitry to execute the set of matrix operations, the processing circuitry to include: a plurality of arithmetic logic units (ALUs); a local memory external to the ALUs and accessible by the ALUs; and processing control circuitry to: create a plurality of matrix operands in the local memory from the plurality of data inputs of the operation, wherein each of the plurality of matrix operands respectively comprises one of a scalar, a vector, or a two-dimensional (2D) matrix; and provide a plurality of memory handles to the plurality of ALUs, wherein each of the memory handles corresponds to a respective one of the matrix operands, and the plurality of ALUs are to access the respective matrix operands using the memory handles in association with executing the matrix operations. 2. The one or more ICs of claim 1 , wherein the processing control circuitry of the processing circuitry is to further store the output of one of the ALUs in the local memory of the processing circuitry. 3. The one or more ICs of claim 2 , wherein the processing control circuitry comprises a plurality of pipeline stages configured to execute operations to create matrix operands, provide memory handles, and store the output of the ALUs substantially in parallel. 4. The one or more ICs of claim 1 , wherein the processing control circuitry is to further: create matrix operands by loading data from the data inputs stored in the external memory into memory rows of the local memory; and overwrite a memory row in response to completion of a matrix operation. 5. The one or more ICs of claim 1 , wherein the processing control circuitry is to further: identify matrix operations corresponding to an operation that can be executed in parallel by the ALUs of the processing circuitry; and fetch non-contiguous data from the plurality of data inputs of the operation stored in the external memory to be stored contiguously in the local memory for the processing control circuitry to create matrix operands for parallel execution of matrix operations. 6. The one or more ICs of claim 5 , wherein the processing control circuitry is to further: ensure the local memory of the processing circuitry includes only data accessed by the processing control circuitry or the ALUs during parallel execution of matrix operations. 7. The one or more ICs of claim 1 , wherein the operation comprises a convolution operation, the plurality of inputs comprises image data, one or more filters, or index data, and the at least one matrix operand comprises a first matrix operand comprising data from the image data and a second matrix operand comprising data from the one or more filters or the index data. 8. The one or more ICs of claim 7 , wherein the convolution operation comprises a strided convolution operation, and the processing control circuitry is to further: create a first matrix operand from the image data according to a stride value of the strided convolution operation. 9. The one or more ICs of claim 1 , wherein the operation comprises at least one of a linear contrast operation, a local response normalization operation, or a max pooling operation. 10. The one or more ICs of claim 1 , wherein the processing control circuitry is to further: provide an output of the ALUs to another processing circuitry. 11. The one or more ICs of claim 10 , wherein the processing control circuitry is to further: identify an output of an ALU as a partial product of a matrix multiplication operation; and provide the partial product output to another ALU for adding to partial products generated by one or more other ALUs or store the partial product in the external memory for subsequent addition with other partial products. 12. The one or more ICs of claim 1 , wherein the processing control circuitry is to further: write-out an output of the ALUs to a data output object stored in the external memory. 13. The one or more ICs of claim 12 , wherein the operation comprises a backpropagation operation, the data inputs of the backpropagation operation include a set of generated output values and a set of expected output values, and the processing control circuitry is to further: write-out an output of the ALUs to a sequence of weight values stored in the external memory. 14. The one or more ICs of claim 13 , wherein the processing control circuitry is to further: execute matrix operations comprising operands with sub-patterns of zeros by executing them as matrix operations with smaller operands that do not contain the sub-patterns of zeros. 15. The one or more ICs of claim 1 , wherein the controller circuitry is to further: convert the operation into a set of matrix operations that operate on at least some non-contiguous or overlapping sub-portions of the plurality of data inputs. 16. The one or more ICs of claim 1 , wherein the processing circuitry is to further: bypass the ALUs and execute some of the operations. 17. A system comprising: a host processor; a host memory; an input/output (I/O) interface; a memory separate from the host memory; and one or more integrated circuits (ICs) comprising: controller circuitry to: receive a command to execute an operation for a plurality of data inputs stored in an external memory or a local memory; and convert the operation into a set of matrix operations, wherein the set of matrix operations are to each operate on respective sub-portions of the plurality of data inputs; and at least one processing circuitry to execute the set of matrix operations, the processing circuitry to include: a plurality of arithmetic logic units (ALUs); a local memory external to the ALUs and accessible by the ALUs; and processing control circuitry to: create plurality of matrix operands in the local memory from the plurality of data inputs of the operation, wherein each of the plurality of matrix operands respectively comprises one of a scalar, a vector, or a two-dimensional (2D) matrix; and provide a plurality of memory handles to the plurality of ALUs, wherein each of the memory handles corresponds to a respective one of the matrix operands, and the plurality of ALUs are to access the respective matrix operands using the memory handles in association with executing the matrix operations. 18. The system of claim 17 , wherein the host processor, the memory separate from the host memory, and the one or more ICs are included in a self-hosting device. 19. The system of claim 17 , wherein the host processor is to further execute a neural network machine learning module. 20. The system of claim 17 , wherein the one or more ICs are included in one of a plurality of peripheral apparatuses included in the system, and further comprise: one or more inter-chip interfaces for coupling to one or more other peripheral apparatuses included in the system; wherein the peripheral apparatuses included in the system are interconnected in a multi-dimensional array.

Assignees

Inventors

Classifications

  • G06T1/20Primary

    Processor architectures; Processor configuration, e.g. pipelining · CPC title

  • Architectures of general purpose stored program computers (with program plugboard G06F15/08; multicomputers G06F15/16) · CPC title

  • G06F12/023Primary

    Free address space management · CPC title

  • Local memory within processor subsystem · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9886377B2 cover?
Described herein are one or more integrated circuits (ICs) comprising controller circuitry to receive a command to execute an operation for data inputs stored in an external memory or a local memory, and convert the operation into a set of matrix operations to operate on sub-portions of the data inputs. The IC(s) further comprise at least one processing circuitry to execute the set of matrix op…
Who is the assignee on this patent?
Intel Corp
What technology area does this patent fall under?
Primary CPC classification G06T1/20. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Feb 06 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).