Efficient data layouts for convolutional neural networks

US10489680B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10489680-B2
Application numberUS-201715724142-A
CountryUS
Kind codeB2
Filing dateOct 3, 2017
Priority dateOct 4, 2016
Publication dateNov 26, 2019
Grant dateNov 26, 2019

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems and methods for efficient implementation of a convolutional layer of a convolutional neural network are disclosed. In one aspect, weight values of kernels in a kernel stack of a convolutional layer can be reordered into a tile layout with tiles of runnels. Pixel values of input activation maps of the convolutional layer can be reordered into an interleaved layout comprising a plurality of clusters of input activation map pixels. The output activation maps can be determined using the clusters of the input activation map pixels and kernels tile by tile.

First claim

Opening claim text (preview).

What is claimed is: 1. A system for executing a convolutional neural network (CNN), the system comprising: non-transitory memory configured to store: a convolutional layer of a convolutional neural network, wherein the convolutional layer comprises kernels in a kernel stack, wherein the kernels of the kernel stack are in a basic kernel layout, wherein weight values of the kernels of the kernel stack are reordered from the basic kernel layout into a tile kernel layout comprising a plurality of kernel tiles, wherein a kernel tile comprises a plurality of kernel runnels, and wherein a kernel runnel comprises a number of the weight values of the kernels of the kernel stack; and a hardware processor in communication with the non-transitory memory, the hardware processor programmed by executable instructions to: receive input activation maps of the convolutional layer, wherein the input activation maps are in a basic input activation map layout; reorder pixel values of the input activation maps from the basic input activation map layout into an interleaved input activation map layout comprising a plurality of clusters of input activation map pixels; and determine output activation maps of the convolutional layer from the plurality of kernel tiles and the plurality of clusters of input activation map pixels, wherein the output activation maps are in an interleaved output activation map layout comprising a plurality of clusters output activation map pixels. 2. The system of claim 1 , wherein the weight values of the kernels of the kernel stack are reordered from the basic kernel layout into the tile kernel layout by, iteratively: traversing along a width dimension of the kernel stack; traversing along a height dimension of the kernel stack; traversing along a width dimension of a kernel of the kernel stack; and traversing along a height dimension of the kernel of the kernel stack. 3. The system of claim 1 , wherein a first kernel runnel of the kernel tile corresponds a first kernel stack width boundary, and wherein a last kernel runnel of the kernel tile corresponds to a second kernel stack width boundary subsequent of the first kernel stack width boundary. 4. The system of claim 1 , wherein to reorder the pixel values of the input activation maps from the basic input activation map layout into the interleaved input activation map layout, the hardware processor is programmed to, iteratively: traverse along a dimension of a number of input activation maps; traverse along a width dimension of an input activation map; and traverse along a height dimension of input activation map. 5. The system of claim 1 , wherein the hardware processor is programmed to: reorder pixel values of the output activation maps from the interleaved output activation map layout into a basic output activation map layout. 6. The system of claim 5 , wherein to reorder the pixel values of the output activation maps from the interleaved output activation map into the basic output activation map layout, the hardware processor is programmed to, iteratively: traversing along a width dimension of the interleaved output activation map; and traversing along a height dimension of the interleaved output activation map. 7. The system of claim 1 , wherein to determine the output activation maps of the convolutional layer from the plurality of kernel tiles and the plurality of clusters of input activation map pixels, the hardware processor is programmed to: perform fused-multiply-add operations tile by tile on the plurality of kernel tiles and the plurality of clusters of input activation map pixels. 8. The system of claim 7 , wherein to perform the fused-multiply-add operations tile by tile on the plurality of kernel tiles and the plurality of clusters of input activation map pixels comprises, iteratively: for each output activation map pixel: set a value of the output activation map pixel to a value of zero; and for each kernel runnel of each kernel tile of the plurality of the kernel tiles, perform a fused-multiply-add operation on the each kernel runnel, an input activation map pixel corresponding to the kernel runnel and the output activation map pixel, and the output activation map pixel. 9. The system of claim 7 , wherein to perform the fused-multiply-add operations tile by tile on the plurality of kernel tiles and the plurality of clusters of input activation map pixels, the hardware processor is programmed to, iteratively: for each output activation map pixel: set a value of the output activation map pixel to a value of zero; and for each kernel runnel of each kernel tile of the plurality of the kernel tiles, perform a fused-multiply-add operation on the each kernel runnel, at least one input activation map pixel corresponding to the kernel runnel and the output activation map pixel, and the output activation map pixel. 10. The system of claim 9 , wherein the at least one input activation map pixel comprises two input activation map pixels. 11. The system of claim 1 , wherein a size of the kernel runnel in bits and a size of the input activation map runnel in bits are the same. 12. The system of any claim 11 , wherein the size of the kernel runnel in bits and a size of the output activation map runnel in bits are the same. 13. The system of claim 11 , wherein the size of the kernel runnel in bits and a size of a register of the hardware processor in bits are the same. 14. The system of claim 13 , wherein the size of the register is 128 bits. 15. The system of claim 1 , wherein the hardware processor comprises a single instruction, multiple data processor. 16. The system of claim 15 , wherein the single instruction, multiple data processor comprises a vector processor. 17. The system of claim 1 , wherein the kernels of the kernel stack in the basic kernel layout are arranged in a plurality of kernel stack channels, wherein a number of the plurality of kernel stack channels and a number of the input activation maps are the same, and wherein a number of kernels of a kernel stack channel and a number of the output activation maps are the same. 18. The system of claim 1 , wherein a kernel stack width of the kernel stack and a number of the output activation maps are the same. 19. The system of claim 1 , wherein the kernels of the kernel stack in the basic kernel layout are arranged in a plurality of kernel stack filter banks, wherein a number of the plurality of kernel stack filter banks and a number of the output activation maps are the same, and wherein a number of kernels of a kernel stack filter bank and a number of the input activation maps are the same. 20. The system of claim 1 , wherein a kernel stack height of the kernel stack and a number of the input activation maps are the same.

Assignees

Inventors

Classifications

  • Interfaces, programming languages or software development kits, e.g. for simulating neural networks · CPC title

  • using neural networks · CPC title

  • Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters · CPC title

  • Hierarchical techniques, i.e. dividing or merging pattern sets so as to obtain a dendrogram · CPC title

  • G06N3/084Primary

    Backpropagation, e.g. using gradient descent · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10489680B2 cover?
Systems and methods for efficient implementation of a convolutional layer of a convolutional neural network are disclosed. In one aspect, weight values of kernels in a kernel stack of a convolutional layer can be reordered into a tile layout with tiles of runnels. Pixel values of input activation maps of the convolutional layer can be reordered into an interleaved layout comprising a plurality …
Who is the assignee on this patent?
Magic Leap Inc
What technology area does this patent fall under?
Primary CPC classification G06N3/084. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Nov 26 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).