Vector processing in an active memory device

US9575755B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9575755-B2
Application numberUS-201213566135-A
CountryUS
Kind codeB2
Filing dateAug 3, 2012
Priority dateAug 3, 2012
Publication dateFeb 21, 2017
Grant dateFeb 21, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Embodiments relate to vector processing in an active memory device. An aspect includes a method for vector processing in an active memory device that includes memory and a processing element. The method includes decoding, in the processing element, an instruction including a plurality of sub-instructions to execute in parallel. An iteration count to repeat execution of the sub-instructions in parallel is determined. Based on the iteration count, execution of the sub-instructions in parallel is repeated for multiple iterations by the processing element. Multiple locations in the memory are accessed in parallel based on the execution of the sub-instructions.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for vector processing in an active memory device that includes memory and a processing element, the method comprising: decoding, in the processing element, an instruction comprising a plurality of sub-instructions to execute in parallel; determining an iteration count to repeat execution of the sub-instructions in parallel based on decoding an iteration count source field of the instruction that defines whether to set the iteration count based on an iteration count field of the instruction or based on an iteration count register; repeating execution of the sub-instructions in parallel for multiple iterations, by the processing element, based on the iteration count; accessing multiple locations in the memory in parallel based on the execution of the sub-instructions; identifying a lane control sub-instruction in the instruction based on the decoding of the instruction, the lane control sub-instruction controlling a sequence of instruction execution and positioned in parallel with the sub-instructions to execute in parallel; and executing the lane control sub-instruction, by the processing element, only once after execution of the sub-instructions is performed in parallel for multiple iterations. 2. The method of claim 1 , wherein the sub-instructions comprise at least a pair of a memory access sub-instruction in parallel with an arithmetic-logical sub-instruction, and further comprising: flowing the memory access sub-instruction to a load-store unit in the processing element; and flowing the arithmetic-logical sub-instruction to an arithmetic logic unit in the processing element to execute the memory access sub-instruction in parallel with the arithmetic-logical sub-instruction. 3. The method of claim 2 , further comprising: accessing one or more of: a vector computation register file and a scalar computation register file in the processing element for operands to execute the memory access sub-instruction in the load-store unit; and accessing one or more of: the vector computation register file and the scalar computation register file in the processing element for operands to execute the arithmetic-logical sub-instruction in the arithmetic logic unit. 4. The method of claim 3 , further comprising: partitioning at least one of the operands as a plurality of sub-elements based on a data type of the arithmetic-logical sub-instruction; performing, by the arithmetic logic unit, an operation of the arithmetic-logical sub-instruction in parallel execution slots on each of the sub-elements; and computing, by the load-store unit, an address per sub-element. 5. The method of claim 3 , further comprising: flowing an output of the load-store unit to one or more of: the load-store unit, an effective-to-real address translation unit, a load-store queue, the vector computation register file, and the scalar computation register file; and flowing an output of the arithmetic logic unit to one or more of: the arithmetic logic unit, the load-store unit, the vector computation register file, and the scalar computation register file. 6. The method of claim 3 , wherein the processing element is partitioned into multiple processing slices operable in parallel, each processing slice comprising a pair of the load-store unit and the arithmetic logic unit, and an associated pair of the vector computation register file and the scalar computation register file, the method further comprising: flowing an output of the arithmetic logic unit of one processing slice to an input of one or more of: the load-store unit and the arithmetic logic unit. 7. The method of claim 3 , further comprising: performing an error check on the operands prior to executing the memory access sub-instruction and the arithmetic-logical sub-instruction. 8. The method of claim 1 , wherein the lane control sub-instruction is a branch sub-instruction executed by the processing element during execution of a last iteration of the instruction based on conditions evaluated during execution of a first element of the instruction. 9. A method for vector processing in an active memory device that includes memory and a processing element, the method comprising: receiving, in the processing element, a command from a requestor; fetching, in the processing element, an instruction based on the command, the instruction being fetched from an instruction buffer in the processing element; decoding, in the processing element, the instruction comprising a plurality of sub-instructions to execute in parallel; determining an iteration count to repeat execution of the sub-instructions in parallel based on decoding an iteration count source field of the instruction that defines whether to set the iteration count based on an iteration count field of the instruction or based on an iteration count register; repeating execution of the sub-instructions in parallel for multiple iterations, by the processing element, based on the iteration count; accessing multiple locations in the memory in parallel based on the execution of the sub-instructions; identifying a lane control sub-instruction in the instruction based on the decoding of the instruction, the lane control sub-instruction controlling a sequence of instruction execution and positioned in parallel with the sub-instructions to execute in parallel; and executing the lane control sub-instruction, by the processing element, only once after execution of the sub-instructions is performed in parallel for multiple iterations. 10. The method of claim 9 , wherein the requestor comprises one of: a main processor, a network interface, an I/O device, and an additional active memory device, configured to communicate with the active memory device. 11. The method of claim 9 , further comprising: fetching a special instruction from the instruction buffer to load a new instruction from the memory; and replacing an entry in the instruction buffer with the new instruction based on executing the special instruction. 12. The method of claim 9 , wherein the active memory device is a three-dimensional memory cube, the memory is divided into three-dimensional blocked regions as memory vaults, and accessing multiple locations in the memory is performed through one or more memory controllers in the active memory device. 13. The method of claim 9 , wherein the sub-instructions comprise at least a pair of a memory access sub-instruction in parallel with an arithmetic-logical sub-instruction, and the method further comprises: flowing the memory access sub-instruction to a load-store unit in the processing element; and flowing the arithmetic-logical sub-instruction to an arithmetic logic unit in the processing element to execute the memory access sub-instruction in parallel with the arithmetic-logical sub-instruction. 14. The method of claim 13 , further comprising: accessing one or more of: a vector computation register file and a scalar computation register file in the processing element for operands to execute the memory access sub-instruction in the load-store unit; and accessing one or more of: the vector computation register file and the scalar computation register file in the processing element for operands to execute the arithmetic-logical sub-instruction in the arithmetic logic unit. 15. The method of claim 14 , further comprising: partitioning at least one of the operands as a plurality of sub-elements based on a data type of the arithmetic-logical sub-instruction; performing, by the arithmetic logic unit, an operation of the arithmetic-logical sub-instruction in parallel execution slots on ea

Assignees

Inventors

Classifications

  • controlled by a single instruction for multiple data lanes [SIMD] · CPC title

  • Instructions to perform operations on packed data, e.g. vector, tile or matrix operations · CPC title

  • for non-native instruction execution, e.g. executing a command; for Java instruction set · CPC title

  • Vector processors · CPC title

  • Special arrangements thereof, e.g. mask or switch · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9575755B2 cover?
Embodiments relate to vector processing in an active memory device. An aspect includes a method for vector processing in an active memory device that includes memory and a processing element. The method includes decoding, in the processing element, an instruction including a plurality of sub-instructions to execute in parallel. An iteration count to repeat execution of the sub-instructions in p…
Who is the assignee on this patent?
Fleischer Bruce M, Fox Thomas W, Jacobson Hans M, and 3 more
What technology area does this patent fall under?
Primary CPC classification G06F9/30036. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Feb 21 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).