Storage organization for transposing a matrix using a streaming engine
US-10942741-B2 · Mar 9, 2021 · US
US12045617B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12045617-B2 |
| Application number | US-202217670611-A |
| Country | US |
| Kind code | B2 |
| Filing date | Feb 14, 2022 |
| Priority date | Jul 15, 2013 |
| Publication date | Jul 23, 2024 |
| Grant date | Jul 23, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Software instructions are executed on a processor within a computer system to configure a steaming engine with stream parameters to define a multidimensional array. The stream parameters define a size for each dimension of the multidimensional array and a specified width for two selected dimensions of the array. Data is fetched from a memory coupled to the streaming engine responsive to the stream parameters. A stream of vectors is formed for the multidimensional array responsive to the stream parameters from the data fetched from memory. When either selected dimension in the stream of vectors exceeds a respective specified width, the streaming engine inserts null elements into each portion of a respective vector for the selected dimension that exceeds the specified width in the stream of vectors. Stream vectors that are completely null are formed by the streaming engine without accessing the system memory for respective data.
Opening claim text (preview).
The invention claimed is: 1. A method comprising: receiving a set of parameters that defines: a first size of an array in a first dimension; a second size of the array in the first dimension; a third size of the array in a second dimension; and a fourth size of the array in the second dimension; until the third size is met: retrieving, by a circuit, from a memory, a respective first set of data along the first dimension, wherein the respective first set of data has the first size, wherein the circuit includes a set of multiplexers; determining whether to align the respective first set of data using the set of multiplexers; and appending, using the set of multiplexers, a respective first set of null elements to the respective first set of data without accessing the memory to produce a respective second set of data having the second size; and after the third size is met and until the fourth size is met: producing, using the set of multiplexers, a respective second set of null elements having the second size without accessing the memory; and providing, by the circuit, the array having the second size in the first dimension and the fourth size in the second dimension including the respective second sets of data and the respective second sets of null elements. 2. The method of claim 1 , wherein the set of parameters specifies a null value for the respective first sets of null elements and the respective second sets of null elements. 3. The method of claim 1 , wherein each null value of the respective first sets of null elements and the respective second sets of null elements has a value of zero. 4. The method of claim 1 , wherein: the retrieving of the respective first set of data includes: generating a first set of addresses; translating the first set of addresses using a table look-aside buffer to generate a second set of addresses; and retrieving the respective first set of data from the memory using the second set of addresses; and the appending of the respective first set of null elements to the respective first set of data does not include an access of the table look-aside buffer. 5. The method of claim 4 , wherein the producing of the respective second set of null elements does not include an access of the table look-aside buffer. 6. The method of claim 1 , wherein the providing of the array includes providing a stream of vectors corresponding to the array. 7. The method of claim 6 , wherein the set of parameters specifies an element size of the array, a number of elements in each vector of the stream of vectors, and a number of vectors in the stream of vectors. 8. The method of claim 1 , wherein the receiving of the set of parameters, the retrieving of the respective first sets of data, the appending, the producing of the respective second sets of null data, and the providing of the array are performed in response to a stream open command that specifies a register storing the set of parameters from among a set of registers. 9. The method of claim 1 comprising generating the respective first set of null elements by: maintaining a count based on the first size of the array in the first dimension; and generating a null element based on the count reaching zero. 10. The method of claim 1 comprising generating the respective second set of null elements by: maintaining a count based on the third size of the array in the second dimension; and generating a null element based on the count reaching zero. 11. The method of claim 1 , wherein: the memory is an L2 cache memory; and the retrieving of the respective first sets of data retrieves data from the L2 cache memory by bypassing an L1 cache memory coupled to the L2 cache memory. 12. A device comprising: a register configured to store a set of parameters that defines: a first size of an array in a first dimension; a second size of the array in the first dimension; a third size of the array in a second dimension; and a fourth size of the array in the second dimension; a memory; and a circuit coupled to the register and the memory that includes an alignment network, wherein the circuit is configured to: until the third size is met: retrieve, from the memory, a respective first set of data along the first dimension, wherein the respective first set of data has the first size; determine whether to utilize the alignment network to reorder the respective first set of data; and append a respective first set of null elements to the respective first set of data without accessing the memory to produce a respective second set of data having the second size; and after the third size is met and until the fourth size is met: utilize the alignment network to produce a respective second set of null elements having the second size without accessing the memory; and provide the array, such that the array has the second size in the first dimension and the fourth size in the second dimension and includes the respective second sets of data and the respective second sets of null elements. 13. The device of claim 12 , wherein the set of parameters specifies a null value for the respective first sets of null elements and the respective second sets of null elements. 14. The device of claim 12 , wherein each null value of the respective first sets of null elements and the respective second sets of null elements has a value of zero. 15. The device of claim 12 , wherein: the circuit includes a table look-aside buffer; the circuit is configured to retrieve the respective first set of data by: generating a first set of addresses; translating the first set of addresses using the table look-aside buffer to generate a second set of addresses; and retrieving the respective first set of data from the memory using the second set of addresses; and the circuit is configured to append the respective first set of null elements to the respective first set of data without an access of the table look-aside buffer. 16. The device of claim 15 , wherein the circuit is configured to produce the respective second set of null elements without an access of the table look-aside buffer. 17. The device of claim 12 , wherein the circuit is configured to provide the array as a stream of vectors. 18. The device of claim 17 , wherein the set of parameters specifies an element size of the array, a number of elements in each vector of the stream of vectors, and a number of vectors in the stream of vectors. 19. The device of claim 12 , wherein the circuit is configured to generate the respective first set of null elements by: maintaining a count based on the first size of the array in the first dimension; and generating a null element based on the count reaching zero. 20. A device comprising: a processor core comprising a register configured to store a set of parameters that defines: a first size of an array in a first dimension; a second size of the array in the first dimension; a third size of the array in a second dimension; and a fourth size of the array in the second dimension; a level one (L1) cache memory coupled to the processor core; a level two (L2) cache memory coupled to the L1 cache memory; and a circuit coupled between the processor core and the L2 cache memory in parallel with the L1 cache memory, wherein the circuit is configured to: receive the set of parameters from the processor core; until the third size is met: retrieve, from L2 cache memory, a respective first set of data along the first dimension, wherein the res
controlled by a single instruction for multiple data lanes [SIMD] · CPC title
using a mask · CPC title
Instructions to perform operations on packed data, e.g. vector, tile or matrix operations · CPC title
Matrix or vector computation {, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization (matrix transposition G06F7/78)} · CPC title
of multiple operands or results {(addressing multiple banks G06F12/06)} · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.