Sparse convolutional neural network accelerator
US-10528864-B2 · Jan 7, 2020 · US
US12001944B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12001944-B2 |
| Application number | US-202217874876-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jul 27, 2022 |
| Priority date | Apr 28, 2017 |
| Publication date | Jun 4, 2024 |
| Grant date | Jun 4, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A mechanism is described for facilitating smart distribution of resources for deep learning autonomous machines. A method of embodiments, as described herein, includes detecting one or more sets of data from one or more sources over one or more networks, and introducing a library to a neural network application to determine an optimal point at which to apply frequency scaling without degrading performance of the neural network application at a computing device.
Opening claim text (preview).
What is claimed is: 1. An apparatus comprising: a graphics processor to: cause a neural network application to implement a library comprising machine learning primitives, wherein the machine learning primitives are usable to analyze a skew pattern observed in a distributed gradient synchronization implemented by the neural network application; determine, using the machine learning primitives of the library, a point to apply frequency scaling in the graphics processor without degrading performance of the neural network application, the point determined based on analysis of the skew pattern generated by the distributed gradient synchronization; and determine, using the library as implemented by the neural network application, a core frequency of the frequency scaling applied at the point, wherein the library is to account for skew characteristics associated with the distributed gradient synchronization to decide the core frequency. 2. The apparatus of claim 1 , wherein the point is determined through the distributed gradient synchronization using a tree-like structure such that local weight vectors start at one or more nodes represented as leaves of the tree-like structure and communicate up to a root of the tree-like structure. 3. The apparatus of claim 1 , wherein the graphics processor is further to introduce sparse matrix representation for weights to overlap communication and computation across multiple nodes associated with the neural network application to reduce communication costs. 4. The apparatus of claim 1 , wherein the graphics processor is further to automatically analyze failed execution of programs including or relevant to the neural network application to obtain insights on one or more faults of hardware performance counters. 5. The apparatus of claim 4 , wherein the graphics processor is further to provide one or more of successful execution information obtained from successful execution of the programs and failed execution information obtained from the failed execution of the programs to a trained network model to seek out one or more of the hardware performance counters that are regarded as faulty or outside a range of approval. 6. The apparatus of claim 1 , wherein the graphics processor is further to perform local error propagation by computing high precision and low precision for local weights and compute local errors at each of multiple nodes associated with the neural network application, wherein performing the local error propagation includes facilitating weight synchronization across the multiple nodes to track the local errors for accuracy and reduced communication. 7. The apparatus of claim 1 , wherein the apparatus comprises an autonomous machine including one or more of a vehicle, a device, and an equipment, wherein the autonomous machine comprises one or more processors including the graphics processor, wherein the graphics processor is co-located with an application processor on a common semiconductor package. 8. A method comprising: causing a neural network application to implement a library comprising machine learning primitives, wherein the machine learning primitives are usable to analyze a skew pattern observed in a distributed gradient synchronization implemented by the neural network application; determining, using the machine learning primitives of the library, a point to apply frequency scaling in a computing device hosting the neural network application without degrading performance of the neural network application at the computing device, the point determined based on analysis of the skew pattern generated by the distributed gradient synchronization; and determining, using the library as implemented by the neural network application, a core frequency of the frequency scaling applied at the point, wherein the library is to account for skew characteristics associated with the distributed gradient synchronization to decide the core frequency. 9. The method of claim 8 , wherein the point is determined through the distributed gradient synchronization using a tree-like structure such that local weight vectors start at one or more nodes represented as leaves of the tree-like structure and communicate up to a root of the tree-like structure. 10. The method of claim 8 , further comprising introducing sparse matrix representation for weights to overlap communication and computation across multiple nodes associated with the neural network application to reduce communication costs. 11. The method of claim 8 , further comprising automatically analyzing failed execution of programs including or relevant to the neural network application to obtain insights on one or more faults of hardware performance counters. 12. The method of claim 11 , further comprising providing one or more of successful execution information obtained from successful execution of the programs and failed execution information obtained from the failed execution of the programs to a trained network model to seek out one or more of the hardware performance counters that are regarded as faulty or outside a range of approval. 13. The method of claim 8 , further comprising performing local error propagation by computing high precision and low precision for local weights and compute local errors at each of multiple nodes associated with the neural network application, wherein performing the local error propagation includes facilitating weight synchronization across the multiple nodes to track the local errors for accuracy and reduced communication. 14. The method of claim 8 , wherein the computing device comprises an autonomous machine including one or more of a vehicle, a device, and an equipment, wherein the autonomous machine comprises one or more processors including a graphics processor, wherein the graphics processor is co-located with an application processor on a common semiconductor package. 15. A non-transitory machine-readable medium comprising instructions that when executed by a computing device, cause the computing device to perform operations comprising: causing a neural network application to implement a library comprising machine learning primitives, wherein the machine learning primitives are usable to analyze a skew pattern observed in a distributed gradient synchronization implemented by the neural network application; determining, using the machine learning primitives of the library, a point to apply frequency scaling in the computing device without degrading performance of the neural network application, the point determined based on analysis of the skew pattern generated by the distributed gradient synchronization; and determining, using the library as implemented by the neural network application, a core frequency of the frequency scaling applied at the point, wherein the library is to account for skew characteristics associated with the distributed gradient synchronization to decide the core frequency. 16. The non-transitory machine-readable medium of claim 15 , wherein the point is determined through the distributed gradient synchronization using a tree-like structure such that local weight vectors start at one or more nodes represented as leaves of the tree-like structure and communicate up to a root of the tree-like structure. 17. The non-transitory machine-readable medium of claim 15 , wherein the operations further comprise introducing sparse matrix representation for weights to overlap communication and computation across multiple nodes associated with the neural network application to reduce communication costs. 18. The non-transitory machi
Convolutional networks [CNN, ConvNet] · CPC title
Distributed learning, e.g. federated learning · CPC title
Supervised learning · CPC title
Quantised networks; Sparse networks; Compressed networks · CPC title
using electronic means · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.