Paralleizing loops in the presence of possible memory aliases

US10241793B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10241793-B2
Application numberUS-201414200788-A
CountryUS
Kind codeB2
Filing dateMar 7, 2014
Priority dateMar 15, 2013
Publication dateMar 26, 2019
Grant dateMar 26, 2019

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

In one particular example, this disclosure provides an efficient mechanism to determine the degree of parallelization possible for a loop in the presence of possible memory aliases that cannot be resolved at compile-time. Hardware instructions are provided that test memory addresses at run-time and set a mode or register that enables a single instance of a loop to run the maximum number of SIMD (Single Instruction, Multiple Data) lanes to run in parallel that obey the semantics of the original scalar loop. Other hardware features that extend applicability or performance of such instructions are enumerated.

First claim

Opening claim text (preview).

What is claimed is: 1. A method to determine a degree of parallelization possible for a loop in a presence of possible memory aliases, comprising: providing hardware instructions that test memory addresses for pointer aliasing at run-time, at least in part, by determining a distance between the memory addresses, wherein the distance is measured in multiples of a data type; providing a scalar code that, when executed, performs a function; generating, from the scalar code, a vectorized code containing a same number of loops as the scalar code; and setting a mode or register that specifies a maximum number of Single Instruction, Multiple Data (SIMD) lanes on which to execute the vectorized code in parallel to perform the function of the scalar code, wherein the maximum number of SIMD lanes is the lesser of the distance and a number of SIMD lanes that are enabled. 2. The method of claim 1 , further comprising enabling the maximum number of SIMD lanes to obtain a maximum speed-up subject to obeying loop-carried dependencies of the scalar code. 3. The method of claim 1 , further comprising, determining, by a processor, that the maximum number of SIMD lanes is to perform the function of the scalar code. 4. The method of claim 1 , further comprising providing a hardware mechanism to specify the number of SIMD lanes that are enabled, wherein the hardware mechanism is one selected from the group consisting of: a mode register that contains the number of SIMD lanes that are enabled and a mask register that controls which SIMD lanes are enabled. 5. The method of claim 4 , further comprising setting the hardware mechanism to enable the maximum number of SIMD lanes dependent on a test of memory addresses at run-time. 6. The method of claim 4 , further comprising altering a number of iterations of the vectorized code dependent on the number of SIMD lanes that are enabled. 7. The method of claim 1 , wherein a SIMD Staged loop allows the vectorized code to run over vector lengths that are not multiples of a vectorization factor. 8. The method of claim 1 , wherein increments of induction variables are scaled by a number of vector lanes that are enabled in the vectorized code. 9. The method of claim 1 , wherein the method includes performing reductions, which further include summation, bitwise operations, minimum or maximum, across vector lanes currently enabled according to a mode register. 10. The method of claim 1 , wherein the method includes copying a value from a last vector lane used in a last iteration of the vectorized code to memory or to a known register location. 11. An apparatus for determining a degree of parallelization possible for a loop in a presence of possible memory aliases, the apparatus comprising: at least one processing element for executing hardware instructions that test memory addresses for pointer aliasing at run-time, at least in part, by determining a distance between the memory addresses, wherein the distance is measured in multiples of a data type; a scalar code that, when executed, performs a function; a vectorized code containing a same number of loops as the scalar code; and a mode or register that is set to specify a maximum number of Single Instruction, Multiple Data (SIMD) lanes on which to execute the vectorized code in parallel to perform the function of the scalar code, wherein the maximum number of SIMD lanes is the lesser of the distance and a number of SIMD lanes that are enabled. 12. The apparatus of claim 11 , wherein the mode or register specifies how many SIMD lanes are enabled, and the mode or register is selected from the group consisting of a mode register that contains the number of SIMD lanes that are enabled and a mask register that controls which SIMD lanes are enabled. 13. The apparatus of claim 12 , wherein the at least one processing element is operable to set the mode or register to enable the maximum number of SIMD lanes dependent on a test of memory addresses at run-time. 14. The apparatus of claim 12 , wherein the at least one processing element is operable to alter a number of iterations of the vectorized code dependent on the number of SIMD lanes that are enabled. 15. The apparatus of claim 11 , wherein the at least one processing element performs reductions, which further include summation, bitwise operations, minimum or maximum, across vector lanes currently enabled according to a mode register. 16. The apparatus of claim 11 , wherein the at least one processing element copies a value from a last vector lane used in a last iteration of the vectorized code to memory or to a known register location. 17. At least one machine readable non-transitory storage medium having instructions stored thereon for determining a degree of parallelization possible for a loop in a presence of possible memory aliases, wherein the instructions, when executed by at least one processor, causes the at least one processor to perform the following operations: execute hardware instructions that test memory addresses for pointer aliasing at run-time, at least in part, by determining a distance between the memory addresses, wherein the distance is measured in multiples of a data type; access a scalar code that, when executed, performs a function; generate, from the scalar code, a vectorized code containing a same number of loops as the scalar code; and set a mode or register that specifies a maximum number of Single Instruction, Multiple Data (SIMD) lanes on which to execute the vectorized code in parallel to perform the function of the scalar code, wherein the maximum number of SIMD lanes is the lesser of the distance and a number of SIMD lanes that are enabled. 18. The method of claim 1 , wherein the hardware instructions that test the memory addresses for pointer aliasing at run-time comprise instructors for: determining, at run-time, a difference between the memory addresses; and dividing the difference by a memory size of the data type to determine the distance, wherein each of the memory addresses holds a value of the data type. 19. The apparatus of claim 11 , wherein the hardware instructions that test the memory addresses for pointer aliasing at run-time comprise instructors for: determining, at run-time, a difference between the memory addresses; and dividing the difference by a memory size of the data type to determine the distance, wherein each of the memory addresses holds a value of the data type. 20. The at least one machine readable non-transitory storage medium of claim 17 , wherein the hardware instructions that test the memory addresses for pointer aliasing at run-time comprise instructors for: determining, at run-time, a difference between the memory addresses; and dividing the difference by a memory size of the data type to determine the distance, wherein each of the memory addresses holds a value of the data type.

Assignees

Inventors

Classifications

  • Instruction analysis, e.g. decoding, instruction word fields · CPC title

  • Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution · CPC title

  • controlled by a single instruction for multiple data lanes [SIMD] · CPC title

  • Software pipelining · CPC title

  • Iterative single instructions for multiple data lanes [SIMD] · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10241793B2 cover?
In one particular example, this disclosure provides an efficient mechanism to determine the degree of parallelization possible for a loop in the presence of possible memory aliases that cannot be resolved at compile-time. Hardware instructions are provided that test memory addresses at run-time and set a mode or register that enables a single instance of a loop to run the maximum number of SIMD…
Who is the assignee on this patent?
Analog Devices Tech, Analog Devices Global
What technology area does this patent fall under?
Primary CPC classification G06F9/30145. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 26 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).