Apparatus and method for efficiently accessing memory when performing a horizontal data reduction

US10409571B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-10409571-B1
Application numberUS-201815922833-A
CountryUS
Kind codeB1
Filing dateMar 15, 2018
Priority dateMar 15, 2018
Publication dateSep 10, 2019
Grant dateSep 10, 2019

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Apparatus and method for optimizing shader execution. For example, one embodiment of a graphics processing apparatus comprises: a plurality of execution units to execute shader programs; optimization detection circuitry and/or logic to identify one or more portions of shader program code to be optimized including one or more reduction operations which require read/write memory operations and associated barrier operations; and optimization circuitry and/or logic to optimize the shader program code by converting a plurality of the read/write memory operations to read/write register operations and removing one or more barrier operations to generate optimized shader program code; the execution units to execute the optimized shader program code.

First claim

Opening claim text (preview).

What is claimed is: 1. A graphics processing apparatus comprising: a plurality of execution units to execute shader programs; a detection circuitry to analyze a shader program to identify one or more portions of shader program code to be optimized including one or more reduction operations which require a plurality of read/write memory operations and a plurality of associated barrier operations; an optimization circuitry to optimize the one or more portions of shader program code identified by the detection circuitry, wherein the optimization circuitry converts the plurality of read/write memory operations to a plurality of read/write register operations and removes one or more barrier operations of the plurality of associated barrier operations to generate optimized shader program code; the plurality of execution units to concurrently execute the optimized shader program code; and a shader compiler to implement the detection circuitry and the optimization circuitry responsive to receiving a new shader program. 2. The graphics processing apparatus of claim 1 further comprising: a graphics driver to interface the graphics processing apparatus to a graphics application, wherein the graphics driver includes the shader compiler. 3. The graphics processing apparatus of claim 2 further comprising: a user mode driver integral to the graphics driver, wherein the user mode driver is to interface with the plurality of execution units and to schedule a plurality of threads for execution on the plurality of execution units. 4. The graphics processing apparatus of claim 1 further comprising: thread dispatch circuitry to dispatch a plurality of threads resulting from the execution of the optimized shader program code to the plurality of execution units. 5. The graphics processing apparatus of claim 4 wherein the detection circuitry is to predict a number of threads required to execute data in the one or more portions of the shader program code. 6. The graphics processing apparatus of claim 1 wherein the one or more reduction operations comprise one or more accumulation operations. 7. The graphics processing apparatus of claim 6 wherein the one or more accumulation operations comprise a series of iterations in which, in each iteration, N data elements at a start of an iteration are combined to generate N/2 data elements. 8. The graphics processing apparatus of claim 7 wherein the optimization circuitry is to convert the one or more accumulation operations so that the plurality of read/write memory operations are converted to the plurality of read/write register operations. 9. The graphics processing apparatus of claim 8 wherein the optimization circuitry is to remove a barrier operation for a particular iteration if no memory read operations are required for the particular iteration. 10. A method comprising: analyzing, by a detection circuitry, a shader program to identify one or more portions of shader program code to be optimized including one or more reduction operations which require a plurality of read/write memory operations and a plurality of associated barrier operations; optimizing, by an optimization circuitry, the one or more portions of shader program code identified by the detection circuitry including converting the plurality of read/write memory operations to a plurality of read/write register operations and removing one or more barrier operations of the plurality of associated barrier operations to generate optimized shader program code; and executing the optimized shader program code concurrently on a plurality of execution units, wherein the operations of analyzing by the detection circuitry and optimizing by the optimization circuitry are implemented by a shader compiler responsive to receiving a new shader program. 11. The method of claim 10 wherein the shader compiler is integral to a graphics driver, and wherein the graphics driver is to interface the optimized shader program code to the plurality of execution units. 12. The method of claim 10 further comprising: dispatching a plurality of threads resulting from the execution of the optimized shader program code to the plurality of execution units. 13. The method of claim 12 wherein the analyzing further comprises: predicting a number of threads required to execute data in the one or more portions of the shader program code. 14. The method of claim 13 further comprising: scheduling the number of threads for execution on the plurality of execution units. 15. The method of claim 10 wherein the one or more reduction operations comprise one or more accumulation operations. 16. The method of claim 15 wherein the one or more accumulation operations comprise a series of iterations in which, in each iteration, N data elements at a start of an iteration are combined to generate N/2 data elements. 17. The method of claim 16 wherein the optimizing further comprises: converting the one or more accumulation operations so that the plurality of read/write memory operations are converted to the plurality of read/write register operations. 18. The method of claim 17 wherein the optimizing further comprises: removing a barrier operation for a particular iteration if no memory read operations are required for the particular iteration. 19. A non-transitory machine-readable medium having program code stored thereon which, when executed by a machine, causes the machine to perform the operations of: analyzing, by a detection circuitry, a shader program to identify one or more portions of shader program code to be optimized including one or more reduction operations which require a plurality of read/write memory operations and a plurality of associated barrier operations; optimizing, by an optimization circuitry, the one or more portions of shader program code identified by the detection circuitry including converting the plurality of read/write memory operations to a plurality of read/write register operations and removing one or more barrier operations of the plurality of associated barrier operations to generate optimized shader program code; and executing the optimized shader program code concurrently on a plurality of execution units, wherein the operations of analyzing by the detection circuitry and optimizing by the optimization circuitry are implemented by a shader compiler responsive to receiving a new shader program. 20. The non-transitory machine-readable medium of claim 19 wherein the shader compiler is integral to a graphics driver, and wherein the graphics driver is to interface the optimized shader program code to the plurality of execution units. 21. The non-transitory machine-readable medium of claim 19 further comprising program code to cause the machine to perform the operation of: dispatching a plurality of threads resulting from the execution of the optimized shader program code to the plurality of execution units. 22. The non-transitory machine-readable medium of claim 21 wherein the analyzing further comprises: predicting a number of threads required to execute data in the one or more portions of the shader program code. 23. The non-transitory machine-readable medium of claim 22 further comprising program code to cause the machine to perform the operation of: scheduling the number of threads for execution on the plurality of execution units. 24. The non-transitory machine-readable medium of claim 19 wherein the one or more reduct

Assignees

Inventors

Classifications

  • Runtime code conversion or optimisation · CPC title

  • G06F8/443Primary

    Optimisation · CPC title

  • Exploiting fine grain parallelism, i.e. parallelism at instruction level (run-time instruction scheduling G06F9/3836) · CPC title

  • G06F8/4434Primary

    Reducing the memory space required by the program code · CPC title

  • Barrier synchronisation · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10409571B1 cover?
Apparatus and method for optimizing shader execution. For example, one embodiment of a graphics processing apparatus comprises: a plurality of execution units to execute shader programs; optimization detection circuitry and/or logic to identify one or more portions of shader program code to be optimized including one or more reduction operations which require read/write memory operations and as…
Who is the assignee on this patent?
Intel Corp
What technology area does this patent fall under?
Primary CPC classification G06F8/443. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 10 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).