Implementing fine grain data coherency of a shared memory region
US-2020327048-A1 · Oct 15, 2020 · US
US11625587B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11625587-B2 |
| Application number | US-202016745675-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jan 17, 2020 |
| Priority date | Jan 3, 2020 |
| Publication date | Apr 11, 2023 |
| Grant date | Apr 11, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
An artificial intelligence integrated circuit is provided. The artificial intelligence integrated circuit includes a flash memory, a dynamic random access memory (DRAM), and a memory controller. The flash memory is configured to store a logical-to-physical mapping (L2P) table that is divided into a plurality of group-mapping (G2P) tables. The memory controller includes a first processing core and a second processing core. The first processing core receives a host access command from a host. When a specific G2P table corresponding to a specific logical address in the host access command is not stored in the DRAM, the first processing core determines whether the second processing core has loaded the specific G2P table from the flash memory to the DRAM according to the values in a first column in a first bit map and in a second column of a second bit map.
Opening claim text (preview).
What is claimed is: 1. An artificial intelligence integrated circuit, comprising: a command processor, configured to analyze a command queue to generate one or more tasks; a plurality of processing elements, each processing element being disposed in parallel; a task constructor, configured to receive the task from the command processor to generate a plurality of threads to control the processing elements; a level-1 (L1) cache; and a level-2 (L2) cache; wherein each processing element comprises: a plurality of arithmetic logic units (ALUs), configured to perform arithmetic and logic operations; a plurality of deep-learning accelerators, configured to perform hardware multiplication-addition operations, activation functions, and pooling; a common register file, configured to store data and intermediate results of operations performed by the ALUs and deep-learning accelerators; and an access controller, configured to control data access to the L1 cache and the L2 cache; wherein the access controller is configured to control the L1 cache and L2 cache to dynamically prefetch data stored in a memory unit external to the artificial intelligence integrated circuit, and the prefetched data is for use by matrix multiplication-addition operations performed by the deep-learning accelerators; wherein the L1 cache comprises a first preload circuit and the L2 cache comprises a second preload circuit, and the first preload circuit and the second preload circuit prefetch data from the L2 cache and the memory unit, respectively; wherein when the access controller is tasked to write first data to the L1 cache, the first preload circuit sends the first data to a first data compressor for a first data compression process to generate second data, and the first data compressor writes the second data to the L2 cache; wherein the second preload circuit sends the second data to a second data compressor for a second data compression process to generate third data, and the second data compressor writes the third data to the memory unit. 2. The artificial intelligence integrated circuit as claimed in claim 1 , wherein the memory unit is a dynamic random access memory. 3. The artificial intelligence integrated circuit as claimed in claim 1 , wherein the memory unit is a host buffer memory of a host that is electrically connected to the artificial intelligence integrated circuit. 4. The artificial intelligence integrated circuit as claimed in claim 1 , wherein the first data compression process is tasked to compress the first data using a compression algorithm for expanded matrix data to generate the second data, and the second data compression process is tasked to compress the second data using a residue-based image-compression algorithm and a sparse-matrix-compression algorithm to generate the third data. 5. The artificial intelligence integrated circuit as claimed in claim 1 , wherein when the access controller is tasked to read the third data stored in the memory unit, the second preload circuit sends the third data to a second decompression circuit to perform a second data decompression process on the third data to obtain the second data, wherein the first preload circuit directly transmits the second data to a first decompression circuit in each processing element to perform a first data decompression process on the second data to obtain the first data, and stores the first data in the common register file of each processing element. 6. The artificial intelligence integrated circuit as claimed in claim 1 , wherein the artificial intelligence integrated circuit supports application programming interfaces (API) of OpenCL, CUDA, and DirectCompute. 7. The artificial intelligence integrated circuit as claimed in claim 1 , wherein the artificial intelligence integrated circuit does not comprise a three-dimensional (3D) graphics rendering module. 8. The artificial intelligence integrated circuit as claimed in claim 4 , wherein the deep-learning accelerator in each processing element comprises: a matrix multiplication-addition calculator, configured to perform a matrix multiplication-addition calculation on the first data to obtain a first matrix calculation result; an activation-function circuit, configured to perform activation on the first matrix calculation result to generate a second matrix calculation result; and a pooling circuit, configured to perform pooling on the second matrix calculation result to generate a final result, and to store the final result in the common register file. 9. The artificial intelligence integrated circuit as claimed in claim 8 , wherein in response to the first data for matrix convolution calculation stored in the common register file being ready, the deep-learning accelerator loads the first data to a register file in the deep-learning accelerator, and loads the first data from the register file to the matrix multiplication-addition calculator to perform matrix multiplication-addition operations. 10. The artificial intelligence integrated circuit as claimed in claim 8 , wherein the first preload circuit and the second preload circuit can be set to a hardware mode or a software mode, wherein in response to the first preload circuit and the second preload circuit being set to the hardware mode, the first preload circuit and the second preload circuit performs address prediction using the previously fetched data, and respectively prefetch data from the L2 cache and the memory unit according to the predicted address, wherein in response to the first preload circuit and the second preload circuit being set to the software mode, the first preload circuit and the second preload circuit respectively fetch data from the L2 cache and the memory unit according to hint information from software. 11. The artificial intelligence integrated circuit as claimed in claim 8 , wherein the matrix multiplication-addition calculator supports matrix multiplication in any matrix size and accelerated multiplication of sparse matrices, and determines calculations of loops according to size and sparsity of matrices. 12. The artificial intelligence integrated circuit as claimed in claim 8 , wherein the activation-function circuit supports rectified linear unit (ReLU), sigmod, and tanh functions. 13. The artificial intelligence integrated circuit as claimed in claim 8 , wherein the pooling circuit performs mean pooling or max pooling on the second matrix calculation result to generate the final result.
Quantised networks; Sparse networks; Compressed networks · CPC title
Convolutional networks [CNN, ConvNet] · CPC title
using a plurality of independent parallel functional units · CPC title
using electronic means · CPC title
to perform operations on data operands · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.