Vector instructions to enable efficient synchronization and parallel reduction operations

US9513905B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9513905-B2
Application numberUS-7977408-A
CountryUS
Kind codeB2
Filing dateMar 28, 2008
Priority dateMar 28, 2008
Publication dateDec 6, 2016
Grant dateDec 6, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

In one embodiment, a processor may include a vector unit to perform operations on multiple data elements responsive to a single instruction, and a control unit coupled to the vector unit to provide the data elements to the vector unit, where the control unit is to enable an atomic vector operation to be performed on at least some of the data elements responsive to a first vector instruction to be executed under a first mask and a second vector instruction to be executed under a second mask. Other embodiments are described and claimed.

First claim

Opening claim text (preview).

What is claimed is: 1. A processor comprising: a core including: a single instruction multiple data (SIMD) unit to perform operations on a plurality of data elements responsive to a single instruction; a control unit coupled to the SIMD unit to provide the plurality of data elements to the SIMD unit, wherein the control unit is to enable an atomic SIMD operation comprising a read-write-modify operation to be performed on at least some of the plurality of data elements responsive to a first SIMD instruction to be executed under a first mask to identify the at least some of the plurality of data elements for which the first SIMD instruction was successfully performed and a second SIMD instruction to be executed under a second mask to identify the at least some of the plurality of data elements for which the second SIMD instruction was successfully performed; and a cache memory including a plurality of cache lines and a plurality of control store entries each associated with one of the plurality of cache lines, the plurality of control store entries to store a valid indicator and a hardware thread identifier, wherein the cache memory is to update the valid indicator and the hardware thread identifier for at least some of the plurality of control store entries each associated with one of the at least some of the plurality of data elements, responsive to execution of the first SIMD instruction, wherein the control unit is to enable the second SIMD instruction to be executed under the second mask for the at least some of the plurality of data entries for which the cache memory updated the valid indicator and the hardware thread identifier of the corresponding control store entry, responsive to execution of the first SIMD instruction and a value of the hardware thread identifier of the corresponding plurality of control store entries matching a hardware thread identifier of a requester of the second SIMD instruction. 2. The processor of claim 1 , wherein the cache memory is to update the valid indicator and the hardware thread identifier for at least some of the plurality of control store entries each associated with one of the at least some of the plurality of data elements, responsive to execution of the second SIMD instruction. 3. The processor of claim 1 , wherein responsive to execution of the second SIMD instruction for a first data element, the cache memory is to update the valid indicator of the control store entry of the cache memory associated with the first data element. 4. The processor of claim 3 , wherein responsive to execution of the second SIMD instruction for the first data element, the cache memory is to modify data stored in the cache line of the cache memory associated with the first data element. 5. The processor of claim 1 , wherein the control unit is to enable the SIMD unit to perform a third SIMD instruction to compare a second vector having a second plurality of data elements and to output a shuffle control to indicate groups of the data elements having the same value, and to set indicators of a third mask to indicate the non-unique data elements. 6. The processor of claim 5 , wherein the control unit is to enable the SIMD unit to perform a fourth SIMD instruction to identify a third vector, a destination storage, and a fourth mask to generate a count of identical elements of the third vector having a third plurality of data elements, the count of identical elements corresponding to a population count of unique integer values in the third vector, and to store the population count of unique integer values in a data element of the destination storage corresponding to one of the third plurality of data elements having the unique integer value. 7. The processor of claim 6 , wherein the control unit is to further write an indicator of a fourth mask to indicate each unique integer element, to compute a histogram. 8. The processor of claim 7 , wherein the control unit is to compute the histogram in parallel. 9. The processor of claim 6 , wherein other data elements of the destination storage corresponding to the unique integer value are set at a don't care state. 10. The processor of claim 1 , wherein the first SIMD instruction is to obtain the plurality of data elements from first memory locations and reserve the first memory locations, pursuant to an input mask corresponding to the first mask. 11. The processor of claim 10 , wherein the second SIMD instruction is to store a second plurality of data elements from a source location to the first memory locations that are reserved, pursuant to an input mask corresponding to the second mask, and wherein the first SIMD instruction is to cause generation of the second mask. 12. A system comprising: a processor including logic to execute a histogram single instruction multiple data (SIMD) instruction to perform a histogram calculation, the histogram SIMD instruction including an opcode and to identify a source vector, a destination vector, and a mask, wherein the logic is, responsive to the histogram SIMD instruction, to perform the histogram calculation on data elements of the source vector to determine unique data elements of the source vector, generate a count of occurrences of each unique data element of the source vector corresponding to a population count for each unique data element, store the population count for each unique data element in an element of the destination vector corresponding to one of the data elements of the source vector being the unique data element, and update an element of the mask corresponding to one of the data elements of the source vector being the unique data element; and a dynamic random access memory (DRAM) coupled to the processor. 13. The system of claim 12 , wherein the logic is to store the population count in a least significant element of the destination vector for the unique data element. 14. The system of claim 13 , wherein the logic is to set a first value in a least significant element the destination vector for the unique data element. 15. The system of claim 12 , wherein the logic is to execute a first stage to read the source vector and the mask. 16. The system of claim 15 , wherein the logic is to execute a second stage to perform an all-to-all comparison on the source vector to identify the unique data elements of the source vector. 17. The system of claim 16 , wherein the logic is to execute a third stage to count the occurrence of each unique data element of the source vector. 18. The system of claim 17 , wherein the logic is to execute a fourth stage to store each occurrence count into a corresponding element of the destination vector. 19. The system of claim 18 , wherein the logic is to store first entries in the mask to a first state to indicate a first occurrence of each unique data element and store second entries in the mask to a second state to indicate other occurrences of each unique data element. 20. The system of claim 12 , wherein the logic is to associate a tag with each data element of the source vector such that multiple data elements of the source vector having the same value have the same tag. 21. A method comprising: reading a source vector and a mask responsive to a single instruction multiple data (SIMD) histogram instruction including an opcode and to identify the source vector, a destination vector, and the mask, the histogram SIMD instruction to cause computation of a population count of unique integer values in the source vector; performing an all-to-all comparison on the s

Assignees

Inventors

Classifications

  • Bit or string instructions · CPC title

  • to perform operations on memory · CPC title

  • using histogram techniques · CPC title

  • Movement instructions, e.g. MOVE, SHIFT, ROTATE, SHUFFLE · CPC title

  • using a plurality of independent parallel functional units · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9513905B2 cover?
In one embodiment, a processor may include a vector unit to perform operations on multiple data elements responsive to a single instruction, and a control unit coupled to the vector unit to provide the data elements to the vector unit, where the control unit is to enable an atomic vector operation to be performed on at least some of the data elements responsive to a first vector instruction to …
Who is the assignee on this patent?
Smelyanskiy Mikhail, Kumar Sanjeev, Kim Daehyun, and 7 more
What technology area does this patent fall under?
Primary CPC classification G06F9/30036. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Dec 06 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).