What technology area does this patent fall under?

Primary CPC classification G06F8/443. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Sep 10 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Apparatus and method for efficiently accessing memory when performing a horizontal data reduction

US10409571B1 · US · B1

Patent metadata
Field	Value
Publication number	US-10409571-B1
Application number	US-201815922833-A
Country	US
Kind code	B1
Filing date	Mar 15, 2018
Priority date	Mar 15, 2018
Publication date	Sep 10, 2019
Grant date	Sep 10, 2019

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Apparatus and method for optimizing shader execution. For example, one embodiment of a graphics processing apparatus comprises: a plurality of execution units to execute shader programs; optimization detection circuitry and/or logic to identify one or more portions of shader program code to be optimized including one or more reduction operations which require read/write memory operations and associated barrier operations; and optimization circuitry and/or logic to optimize the shader program code by converting a plurality of the read/write memory operations to read/write register operations and removing one or more barrier operations to generate optimized shader program code; the execution units to execute the optimized shader program code.

First claim

Opening claim text (preview).

What is claimed is: 1. A graphics processing apparatus comprising: a plurality of execution units to execute shader programs; a detection circuitry to analyze a shader program to identify one or more portions of shader program code to be optimized including one or more reduction operations which require a plurality of read/write memory operations and a plurality of associated barrier operations; an optimization circuitry to optimize the one or more portions of shader program code identified by the detection circuitry, wherein the optimization circuitry converts the plurality of read/write memory operations to a plurality of read/write register operations and removes one or more barrier operations of the plurality of associated barrier operations to generate optimized shader program code; the plurality of execution units to concurrently execute the optimized shader program code; and a shader compiler to implement the detection circuitry and the optimization circuitry responsive to receiving a new shader program. 2. The graphics processing apparatus of claim 1 further comprising: a graphics driver to interface the graphics processing apparatus to a graphics application, wherein the graphics driver includes the shader compiler. 3. The graphics processing apparatus of claim 2 further comprising: a user mode driver integral to the graphics driver, wherein the user mode driver is to interface with the plurality of execution units and to schedule a plurality of threads for execution on the plurality of execution units. 4. The graphics processing apparatus of claim 1 further comprising: thread dispatch circuitry to dispatch a plurality of threads resulting from the execution of the optimized shader program code to the plurality of execution units. 5. The graphics processing apparatus of claim 4 wherein the detection circuitry is to predict a number of threads required to execute data in the one or more portions of the shader program code. 6. The graphics processing apparatus of claim 1 wherein the one or more reduction operations comprise one or more accumulation operations. 7. The graphics processing apparatus of claim 6 wherein the one or more accumulation operations comprise a series of iterations in which, in each iteration, N data elements at a start of an iteration are combined to generate N/2 data elements. 8. The graphics processing apparatus of claim 7 wherein the optimization circuitry is to convert the one or more accumulation operations so that the plurality of read/write memory operations are converted to the plurality of read/write register operations. 9. The graphics processing apparatus of claim 8 wherein the optimization circuitry is to remove a barrier operation for a particular iteration if no memory read operations are required for the particular iteration. 10. A method comprising: analyzing, by a detection circuitry, a shader program to identify one or more portions of shader program code to be optimized including one or more reduction operations which require a plurality of read/write memory operations and a plurality of associated barrier operations; optimizing, by an optimization circuitry, the one or more portions of shader program code identified by the detection circuitry including converting the plurality of read/write memory operations to a plurality of read/write register operations and removing one or more barrier operations of the plurality of associated barrier operations to generate optimized shader program code; and executing the optimized shader program code concurrently on a plurality of execution units, wherein the operations of analyzing by the detection circuitry and optimizing by the optimization circuitry are implemented by a shader compiler responsive to receiving a new shader program. 11. The method of claim 10 wherein the shader compiler is integral to a graphics driver, and wherein the graphics driver is to interface the optimized shader program code to the plurality of execution units. 12. The method of claim 10 further comprising: dispatching a plurality of threads resulting from the execution of the optimized shader program code to the plurality of execution units. 13. The method of claim 12 wherein the analyzing further comprises: predicting a number of threads required to execute data in the one or more portions of the shader program code. 14. The method of claim 13 further comprising: scheduling the number of threads for execution on the plurality of execution units. 15. The method of claim 10 wherein the one or more reduction operations comprise one or more accumulation operations. 16. The method of claim 15 wherein the one or more accumulation operations comprise a series of iterations in which, in each iteration, N data elements at a start of an iteration are combined to generate N/2 data elements. 17. The method of claim 16 wherein the optimizing further comprises: converting the one or more accumulation operations so that the plurality of read/write memory operations are converted to the plurality of read/write register operations. 18. The method of claim 17 wherein the optimizing further comprises: removing a barrier operation for a particular iteration if no memory read operations are required for the particular iteration. 19. A non-transitory machine-readable medium having program code stored thereon which, when executed by a machine, causes the machine to perform the operations of: analyzing, by a detection circuitry, a shader program to identify one or more portions of shader program code to be optimized including one or more reduction operations which require a plurality of read/write memory operations and a plurality of associated barrier operations; optimizing, by an optimization circuitry, the one or more portions of shader program code identified by the detection circuitry including converting the plurality of read/write memory operations to a plurality of read/write register operations and removing one or more barrier operations of the plurality of associated barrier operations to generate optimized shader program code; and executing the optimized shader program code concurrently on a plurality of execution units, wherein the operations of analyzing by the detection circuitry and optimizing by the optimization circuitry are implemented by a shader compiler responsive to receiving a new shader program. 20. The non-transitory machine-readable medium of claim 19 wherein the shader compiler is integral to a graphics driver, and wherein the graphics driver is to interface the optimized shader program code to the plurality of execution units. 21. The non-transitory machine-readable medium of claim 19 further comprising program code to cause the machine to perform the operation of: dispatching a plurality of threads resulting from the execution of the optimized shader program code to the plurality of execution units. 22. The non-transitory machine-readable medium of claim 21 wherein the analyzing further comprises: predicting a number of threads required to execute data in the one or more portions of the shader program code. 23. The non-transitory machine-readable medium of claim 22 further comprising program code to cause the machine to perform the operation of: scheduling the number of threads for execution on the plurality of execution units. 24. The non-transitory machine-readable medium of claim 19 wherein the one or more reduct

Assignees

Intel Corp

Inventors

Targowski Marek

Classifications

G06F9/45516
Runtime code conversion or optimisation · CPC title
G06F8/443Primary
Optimisation · CPC title
G06F8/445
Exploiting fine grain parallelism, i.e. parallelism at instruction level (run-time instruction scheduling G06F9/3836) · CPC title
G06F8/4434Primary
Reducing the memory space required by the program code · CPC title
G06F9/522
Barrier synchronisation · CPC title

Patent family

Related publications grouped by family.

View patent family 67845224

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10409571B1 cover?: Apparatus and method for optimizing shader execution. For example, one embodiment of a graphics processing apparatus comprises: a plurality of execution units to execute shader programs; optimization detection circuitry and/or logic to identify one or more portions of shader program code to be optimized including one or more reduction operations which require read/write memory operations and as…
Who is the assignee on this patent?: Intel Corp
What technology area does this patent fall under?: Primary CPC classification G06F8/443. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Sep 10 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Barriers and synchronization for machine learning at autonomous machines

Memory-based software barriers

Read/Write Modes for Reducing Power Consumption in Graphics Processing Units

Compute cluster preemption within a general-purpose graphics processing unit

Facilitating dynamic parallel scheduling of command packets at graphics processing units on computing devices

Frequently asked questions