Barriers and synchronization for machine learning at autonomous machines
US-2018307985-A1 · Oct 25, 2018 · US
US10409571B1 · US · B1
| Field | Value |
|---|---|
| Publication number | US-10409571-B1 |
| Application number | US-201815922833-A |
| Country | US |
| Kind code | B1 |
| Filing date | Mar 15, 2018 |
| Priority date | Mar 15, 2018 |
| Publication date | Sep 10, 2019 |
| Grant date | Sep 10, 2019 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Apparatus and method for optimizing shader execution. For example, one embodiment of a graphics processing apparatus comprises: a plurality of execution units to execute shader programs; optimization detection circuitry and/or logic to identify one or more portions of shader program code to be optimized including one or more reduction operations which require read/write memory operations and associated barrier operations; and optimization circuitry and/or logic to optimize the shader program code by converting a plurality of the read/write memory operations to read/write register operations and removing one or more barrier operations to generate optimized shader program code; the execution units to execute the optimized shader program code.
Opening claim text (preview).
What is claimed is: 1. A graphics processing apparatus comprising: a plurality of execution units to execute shader programs; a detection circuitry to analyze a shader program to identify one or more portions of shader program code to be optimized including one or more reduction operations which require a plurality of read/write memory operations and a plurality of associated barrier operations; an optimization circuitry to optimize the one or more portions of shader program code identified by the detection circuitry, wherein the optimization circuitry converts the plurality of read/write memory operations to a plurality of read/write register operations and removes one or more barrier operations of the plurality of associated barrier operations to generate optimized shader program code; the plurality of execution units to concurrently execute the optimized shader program code; and a shader compiler to implement the detection circuitry and the optimization circuitry responsive to receiving a new shader program. 2. The graphics processing apparatus of claim 1 further comprising: a graphics driver to interface the graphics processing apparatus to a graphics application, wherein the graphics driver includes the shader compiler. 3. The graphics processing apparatus of claim 2 further comprising: a user mode driver integral to the graphics driver, wherein the user mode driver is to interface with the plurality of execution units and to schedule a plurality of threads for execution on the plurality of execution units. 4. The graphics processing apparatus of claim 1 further comprising: thread dispatch circuitry to dispatch a plurality of threads resulting from the execution of the optimized shader program code to the plurality of execution units. 5. The graphics processing apparatus of claim 4 wherein the detection circuitry is to predict a number of threads required to execute data in the one or more portions of the shader program code. 6. The graphics processing apparatus of claim 1 wherein the one or more reduction operations comprise one or more accumulation operations. 7. The graphics processing apparatus of claim 6 wherein the one or more accumulation operations comprise a series of iterations in which, in each iteration, N data elements at a start of an iteration are combined to generate N/2 data elements. 8. The graphics processing apparatus of claim 7 wherein the optimization circuitry is to convert the one or more accumulation operations so that the plurality of read/write memory operations are converted to the plurality of read/write register operations. 9. The graphics processing apparatus of claim 8 wherein the optimization circuitry is to remove a barrier operation for a particular iteration if no memory read operations are required for the particular iteration. 10. A method comprising: analyzing, by a detection circuitry, a shader program to identify one or more portions of shader program code to be optimized including one or more reduction operations which require a plurality of read/write memory operations and a plurality of associated barrier operations; optimizing, by an optimization circuitry, the one or more portions of shader program code identified by the detection circuitry including converting the plurality of read/write memory operations to a plurality of read/write register operations and removing one or more barrier operations of the plurality of associated barrier operations to generate optimized shader program code; and executing the optimized shader program code concurrently on a plurality of execution units, wherein the operations of analyzing by the detection circuitry and optimizing by the optimization circuitry are implemented by a shader compiler responsive to receiving a new shader program. 11. The method of claim 10 wherein the shader compiler is integral to a graphics driver, and wherein the graphics driver is to interface the optimized shader program code to the plurality of execution units. 12. The method of claim 10 further comprising: dispatching a plurality of threads resulting from the execution of the optimized shader program code to the plurality of execution units. 13. The method of claim 12 wherein the analyzing further comprises: predicting a number of threads required to execute data in the one or more portions of the shader program code. 14. The method of claim 13 further comprising: scheduling the number of threads for execution on the plurality of execution units. 15. The method of claim 10 wherein the one or more reduction operations comprise one or more accumulation operations. 16. The method of claim 15 wherein the one or more accumulation operations comprise a series of iterations in which, in each iteration, N data elements at a start of an iteration are combined to generate N/2 data elements. 17. The method of claim 16 wherein the optimizing further comprises: converting the one or more accumulation operations so that the plurality of read/write memory operations are converted to the plurality of read/write register operations. 18. The method of claim 17 wherein the optimizing further comprises: removing a barrier operation for a particular iteration if no memory read operations are required for the particular iteration. 19. A non-transitory machine-readable medium having program code stored thereon which, when executed by a machine, causes the machine to perform the operations of: analyzing, by a detection circuitry, a shader program to identify one or more portions of shader program code to be optimized including one or more reduction operations which require a plurality of read/write memory operations and a plurality of associated barrier operations; optimizing, by an optimization circuitry, the one or more portions of shader program code identified by the detection circuitry including converting the plurality of read/write memory operations to a plurality of read/write register operations and removing one or more barrier operations of the plurality of associated barrier operations to generate optimized shader program code; and executing the optimized shader program code concurrently on a plurality of execution units, wherein the operations of analyzing by the detection circuitry and optimizing by the optimization circuitry are implemented by a shader compiler responsive to receiving a new shader program. 20. The non-transitory machine-readable medium of claim 19 wherein the shader compiler is integral to a graphics driver, and wherein the graphics driver is to interface the optimized shader program code to the plurality of execution units. 21. The non-transitory machine-readable medium of claim 19 further comprising program code to cause the machine to perform the operation of: dispatching a plurality of threads resulting from the execution of the optimized shader program code to the plurality of execution units. 22. The non-transitory machine-readable medium of claim 21 wherein the analyzing further comprises: predicting a number of threads required to execute data in the one or more portions of the shader program code. 23. The non-transitory machine-readable medium of claim 22 further comprising program code to cause the machine to perform the operation of: scheduling the number of threads for execution on the plurality of execution units. 24. The non-transitory machine-readable medium of claim 19 wherein the one or more reduct
Runtime code conversion or optimisation · CPC title
Optimisation · CPC title
Exploiting fine grain parallelism, i.e. parallelism at instruction level (run-time instruction scheduling G06F9/3836) · CPC title
Reducing the memory space required by the program code · CPC title
Barrier synchronisation · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.