Multistage collector for outputs in multiprocessor systems

US9595074B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9595074-B2
Application numberUS-201213611325-A
CountryUS
Kind codeB2
Filing dateSep 12, 2012
Priority dateSep 16, 2011
Publication dateMar 14, 2017
Grant dateMar 14, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Aspects include a multistage collector to receive outputs from plural processing elements. Processing elements may comprise (each or collectively) a plurality of clusters, with one or more ALUs that may perform SIMD operations on a data vector and produce outputs according to the instruction stream being used to configure the ALU(s). The multistage collector includes substituent components each with at least one input queue, a memory, a packing unit, and an output queue; these components can be sized to process groups of input elements of a given size, and can have multiple input queues and a single output queue. Some components couple to receive outputs from the ALUs and others receive outputs from other components. Ultimately, the multistage collector can output groupings of input elements. Each grouping of elements (e.g., at input queues, or stored in the memories of component) can be formed based on matching of index elements.

First claim

Opening claim text (preview).

We claim: 1. A method of increasing processing throughput in a multiprocessor system having a plurality of computation units each processing different computation tasks asynchronously, comprising: asynchronously receiving outputs from a plurality of said computation units of the multiprocessor system, each of the outputs comprising an index element and one or more constituent elements associated with that index element, wherein an index element describes a computation task to be performed for the one or more constituent elements; grouping at least some constituent elements of said asynchronously received outputs into packets by comparing their respective index elements and grouping into individual packets those constituent elements associated with matching index elements, said individual packets being associated with respective matching index elements and having a first predetermined size; grouping at least some individual packets of constituent elements into larger individual packets until packets of a predetermined maximum size have been assembled by comparing respective index elements of packets of similar size and grouping packets associated with matching index elements into said larger individual packets; and outputting individual packets of said maximum size as output packets, whereby processing throughput of output packets by said multiprocessor system is increased. 2. The method of increasing processing throughput of claim 1 , further comprising buffering constituent elements of a packet received at a first time and combining the buffered constituent elements with constituent elements from a packet received at a second time having an index element matching the index element of the packet received at the first time. 3. The method of increasing processing throughput of claim 1 , wherein the grouping is performed by collector units arranged in an interconnected series, each of the collector units having a memory and being operable to identify collections to evict from its memory, responsive to a collection eviction process. 4. The method of increasing processing throughput of claim 3 , wherein the interconnected series of collector units is arranged in an inverted hierarchy, beginning with a layer of collector units receiving smaller packets and terminating with one or more collectors outputting one or more larger packets, each containing constituent data elements of a plurality of the smaller packets. 5. The method of increasing processing throughput of claim 4 , further comprising applying backpressure between collector units in different layers of the inverted hierarchy to regulate progress of packets through the plurality of collectors. 6. The method of increasing processing throughput of claim 1 , wherein one or more of the descriptions of computation task to be performed comprises computer graphics ray tracing information identifying a shape, and the constituent elements comprise identifiers for rays to be tested for intersection with the shape. 7. The method of increasing processing throughput of claim 6 , further comprising selecting each description of a computation task from a plurality of pre-defined types of computation tasks comprising testing a ray for intersection with one or more shapes identified by the constituent elements. 8. The method of increasing processing throughput of claim 6 , wherein each description of computation to be performed comprises a reference to a memory location, in a region of a memory reserved for storing a defined kind of shape data used during graphical rendering of a scene defined using the shape data. 9. The method of increasing processing throughput of claim 8 , wherein the defined kind of shape data is selected from bounding volume shape data and primitive shape data. 10. The method of increasing processing throughput of claim 1 , wherein each description of computation to be performed comprises a reference to a memory location. 11. A computing system, comprising: a plurality of computation clusters, each for outputting discretized results of performing computation tasks asynchronously, each discretized result comprising a collection index describing a respective computation task to be performed and a data element for use during performance of the computation task described by the collection index; and a plurality of collectors, some of the collectors coupled to asynchronously receive the discretized output outputted from respective computation clusters of the plurality, the collectors interoperating to gather the data elements from multiple discretized outputs into progressively larger collections, each collector comprising an index matcher that matches two or more collection indexes to identify common collection indexes, and a grouper configured to group data elements related to the same collection index for output as a group with a predetermined maximum size in conjunction with that collection index. 12. The computing system of claim 11 , wherein collectors of the plurality are operable to activate a stall line that prohibits one or more collectors from outputting a collection of discretized outputs. 13. The computing system of claim 12 , wherein collectors of the plurality are operable to compact information from the discretized outputs by collecting non-redundant information from multiple discretized outputs, and to output a compacted collection of information on an output that is conditioned based on monitoring the stall line. 14. The computing system of claim 11 , wherein each computation cluster comprises a SIMD ALU, a port for reading to and writing from a memory subsystem, and an output port. 15. The computing system of claim 11 , wherein each collector ingests discretized results of up to a first size and produces outputs fewer in number and larger than the first size. 16. The computing system of claim 11 , wherein the plurality of collectors are arranged in an inverted hierarchy, comprising a first layer of collectors, each collector coupled to a respective output port from a computation cluster of the plurality, and comprising a memory and a packing unit operable to receive discretized outputs from the coupled output port and to collect each discrete output into a collection according to an index associated with the discretized output, and one or more subsequent layers of collector, each coupled to receive increasingly larger collections of the discrete outputs, wherein each of the collectors is operable to identify collections to evict from its memory, responsive to a collection eviction process. 17. The system of claim 16 , further comprising a distributor coupled to a final collector of the inverted hierarchy, and operable to distribute data elements from received groups of data elements among the plurality of computation clusters according to which of the computation clusters is to execute further computation involving each data element. 18. The computing system of claim 16 , wherein the collection eviction process comprises each collector unit independently evicting collections in its memory that are full. 19. The computing system of claim 11 , wherein each discrete output comprises a results vector, with a number of results up to a SIMD vector width of the computation cluster outputting that discrete output.

Assignees

Inventors

Classifications

  • Parallel processing · CPC title

  • G06T1/60Primary

    Memory management · CPC title

  • G06T1/20Primary

    Processor architectures; Processor configuration, e.g. pipelining · CPC title

  • G06T15/06Primary

    Ray-tracing · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9595074B2 cover?
Aspects include a multistage collector to receive outputs from plural processing elements. Processing elements may comprise (each or collectively) a plurality of clusters, with one or more ALUs that may perform SIMD operations on a data vector and produce outputs according to the instruction stream being used to configure the ALU(s). The multistage collector includes substituent components each…
Who is the assignee on this patent?
Mccombe James Alexander, Clohset Steven John, Redgrave Jason Rupert, and 2 more
What technology area does this patent fall under?
Primary CPC classification G06T1/60. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 14 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).