Efficient data layouts for convolutional neural networks
US-2018096226-A1 · Apr 5, 2018 · US
US12039435B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12039435-B2 |
| Application number | US-202217845794-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jun 21, 2022 |
| Priority date | Dec 30, 2017 |
| Publication date | Jul 16, 2024 |
| Grant date | Jul 16, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
An apparatus to facilitate acceleration of machine learning operations is disclosed. The apparatus comprises at least one processor to perform operations to implement a neural network and accelerator logic to perform communicatively coupled to the processor to perform compute operations for the neural network.
Opening claim text (preview).
What is claimed is: 1. An apparatus comprising: at least one processor to perform operations to implement a neural network and to perform forward propagation compute and backward propagation compute for the neural network; and accelerator circuitry communicatively coupled to the at least one processor, the accelerator circuitry to: receive, from the at least one processor, input matrix data for each layer of the neural network used during performance of the forward propagation compute, the input matrix data received during performance of the forward propagation compute by the at least one processor; store the input matrix data in a memory of the accelerator circuitry; compute a transpose for the input matrix data; receive a weight matrix from the at least one processor; compute, in parallel with the backward propagation compute performed by the at least one processor that is separate from the accelerator circuitry, weight gradients by multiplying the transpose of the input matrix data with the weight matrix; and compute, in parallel with normalization operations of the neural network performed by the at least one processor, mean and variance calculations for the normalization operations. 2. The apparatus of claim 1 , wherein the at least one processor comprises a graphics processing unit (GPU) communicably coupled to a central processing unit (CPU), and wherein the accelerator circuitry comprises a Differentiable Neural Computer (DNC). 3. The apparatus of claim 2 , wherein the DNC comprising an external memory coupled to the CPU and the GPU to store knowledge data for the neural network. 4. The apparatus of claim 3 , wherein the CPU performs transformation compute operations for the neural network. 5. The apparatus of claim 4 , wherein the CPU performs a zero copy operation to facilitate a transfer of data between the CPU and the GPU. 6. The apparatus of claim 2 , wherein the accelerator circuitry comprises a scheduler to analyze CPU and GPU resources and a compute graph of the neural network, assign nodes of the compute graph to the CPU and the GPU resources and schedule the compute graph for processing at the CPU and the GPU resources. 7. The apparatus of claim 6 , wherein analyzing the CPU and GPU resources and the compute graph comprises determining a computation cost of operators to be performed at the CPU and the GPU. 8. The apparatus of claim 7 , wherein assigning the nodes of the compute graph to the CPU and the GPU resources comprises determining a shortest path for each of the operators based on the computation cost. 9. The apparatus of claim 7 , wherein the computation cost is determined based on individually processing the operators at the CPU and the GPU. 10. The apparatus of claim 7 , wherein the computation cost is determined based on simultaneously processing the operators at the CPU and the GPU. 11. A method comprising: receiving, by accelerator circuitry from at least one processor that implements a neural network, input matrix data for each layer of the neural network used during performance of a forward propagation compute for the neural network performed at the at least one processor, the input matrix data received during performance of the forward propagation compute by the at least one processor; storing the input matrix data in a memory of the accelerator circuitry; computing, by the accelerator circuitry, a transpose for the input matrix data; receiving, by the accelerator circuitry, a weight matrix from the at least one processor; computing, by the accelerator circuitry in parallel with backward propagation compute performed by the at least one processor that is separate from the accelerator circuitry, weight gradients by multiplying the transpose of the input matrix data with the weight matrix; and computing, by the accelerator circuitry in parallel with normalization operations of the neural network performed the at least one processor, mean and variance calculations for the normalization operations. 12. The method of claim 11 , wherein the at least one processor comprises a graphics processing unit (GPU) communicably coupled to a central processing unit (CPU), and wherein the accelerator circuitry comprises Differentiable Neural Computer (DNC), wherein the accelerator circuitry comprises a scheduler to analyze CPU and GPU resources and a compute graph of the neural network, assign nodes of the compute graph to the CPU and the GPU resources and schedule the compute graph for processing at the CPU and the GPU resources, and wherein analyzing the CPU and the GPU resources and the compute graph comprises determining a computation cost of operators to be performed at the CPU and the GPU. 13. The method of claim 12 , wherein the computation cost is determined based on individually processing the operators at the CPU and the GPU. 14. The method of claim 12 , wherein the computation cost is determined based on simultaneously processing the operators at the CPU and the GPU. 15. The method of claim 12 , wherein assigning the nodes of the compute graph to the CPU and the GPU resources comprises determining a shortest path for each of the operators based on the computation cost. 16. A system comprising: a memory; at least one processor communicably coupled to the memory, the at least one processor to perform operations to implement a neural network and to perform forward propagation compute and backward propagation compute for the neural network; and accelerator circuitry communicatively coupled to the memory and the at least one processor, the accelerator circuitry to: receive, from the at least one processor, input matrix data for each layer of the neural network used during performance of the forward propagation compute, the input matrix data received during the performance of the forward propagation compute by the at least one processor; store the input matrix data in accelerator circuitry memory; compute a transpose for the input matrix data; receive a weight matrix from the at least one processor; compute, in parallel with the backward propagation compute performed by the at least one processor that is separate from the accelerator circuitry, weight gradients by multiplying the transpose of the input matrix data with the weight matrix; and compute, in parallel with normalization operations of the neural network performed by the at least one processor, mean and variance calculations for the normalization operations. 17. The system of claim 16 , wherein the at least one processor comprises a graphics processing unit (GPU) communicably coupled to a central processing unit (CPU), and wherein the accelerator circuitry comprises a Differentiable Neural Computer (DNC). 18. The system of claim 17 , wherein the DNC comprising an external memory coupled to the CPU and the GPU to store knowledge data for the neural network. 19. The system of claim 18 , wherein the CPU performs transformation compute operations for the neural network. 20. The system of claim 19 , wherein the CPU performs a zero copy operation to facilitate a transfer of data between the CPU and the GPU. 21. A non-transitory computer-readable medium having stored thereon executable computer program instructions that, when executed by one or more processors, cause the one or more processors to perform operations comprising: receiving, by accelerator circuitry from at least one processor of the one or more processors that implements a neural network, input matrix data for each layer of the neural netwo
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title
Neural networks · CPC title
Arrangements for program control, e.g. control units (program control for peripheral devices G06F13/10) · CPC title
Machine learning · CPC title
Backpropagation, e.g. using gradient descent · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.