Deep learning inference efficiency technology with early exit and speculative execution
US-2024104916-A1 · Mar 28, 2024 · US
US2017220719A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2017220719-A1 |
| Application number | US-201615011724-A |
| Country | US |
| Kind code | A1 |
| Filing date | Feb 1, 2016 |
| Priority date | Feb 1, 2016 |
| Publication date | Aug 3, 2017 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Described herein are a processor and a method of operating the processor to simulate a many-core target machine. The processor includes a plurality of processing cores arranged in a predetermined manner and a global target clock counter (GTCC) configured to count a number of simulated clock cycles in the target machine. A global stall controller (GSC) configured to halt execution of all the processing cores based on a determination of at least one processing core being in a fault condition; and wherein the processor acquires a base clock per instruction (CPI) of a target machine, the CPI corresponding to an average number of clock cycles required by the target machine to execute a single instruction, translates an application of the target machine to a compact executable trace to be executed by the processor, and adjusts a speed of simulation by adjusting an update rate of the global target clock counter.
Opening claim text (preview).
What is claimed is: 1 . A device for simulating a many-core target machine, the device comprising: a processor including: a plurality of processing cores arranged in a predetermined manner; a global target clock counter (GTCC) configured to count a number of simulated clock cycles in the target machine; a global stall controller (GSC) configured to halt execution of all the processing cores based on a determination of at least one processing core being in a fault condition; and wherein the processor is configured to: acquire a base clock per instruction (CPI) of a target machine, the CPI corresponding to an average number of clock cycles required by the target machine to execute a single instruction, translate an application of the target machine to a compact executable trace to be executed by the processor, determine whether to query an off-chip memory based on detecting a cache miss event, determine whether to adjust a simulation speed based on receiving a control signal from a router, and adjust dynamically, a speed of simulation of the processor by adjusting an update rate of the global target clock counter. 2 . The device of claim 1 , wherein the CPI of the target machine is acquired by simulating on a timing simulator, a benchmark of the target machine, the simulation being performed by ignoring the cache miss event. 3 . The device of claim 1 , wherein the processor is further configured to profile each instruction of the target application to generate a profiled image of the application, the profiled image including an object for each unique instruction of the target application, and wherein each instruction of the target application is mapped to a unique address in the profiled image via a hash function. 4 . The device of claim 3 , wherein the processor is further configured to refine the profiled image to generate instructions for the processor to execute. 5 . The device of claim 1 , wherein a first core of the plurality of cores is a master core configured to execute a master thread of the application of the target machine. 6 . The device of claim 5 , wherein the other cores of the plurality of cores are worker cores configured to execute parallel portions of the application of the target machine. 7 . The device of claim 6 , wherein the plurality of cores are arranged in a ring-network. 8 . The device of claim 1 , wherein each processing core of the plurality of cores is configured to evaluate an amount of time required by the target machine to execute an instruction. 9 . The device of claim 1 , wherein the processor is further configured to set the update rate of the GTCC to an initial value that is based on a number of target clock cycles required to execute a predetermined number of instructions, and a number of host cycles required to execute a single instruction. 10 . The device of claim 9 , wherein the processor is further configured to: reduce the simulation speed by half the initial value, based on the GSC receiving the request, and increase the simulation speed two-folds the initial value based on the GSC not receiving the request in a predetermined amount of time. 11 . A method for simulating a many-core target machine, the method being performed by a processor, the method comprising: acquiring a base clock per instruction (CPI) of a target machine, the CPI corresponding to an average number of clock cycles required by the target machine to execute a single instruction, translating an application of the target machine to a compact executable trace to be executed by the processor, determining whether to query an off-chip memory based on detecting a cache miss event, determining, by the processor whether to adjust a simulation speed based on receiving a control signal from a router, and adjusting dynamically, by the processor, a speed of simulation by adjusting an update rate of a global target clock counter (GTCC). 12 . The method of claim 11 , further comprising: profiling each instruction of the target application to generate a profiled image of the application, the profiled image including an object for each unique instruction of the target application, and wherein each instruction of the target application is mapped to a unique address in the profiled image via a hash function. 13 . The method of claim 12 , further comprising: refining the profiled image to generate instructions for the processor to execute. 14 . The method of claim 11 , further comprising: setting by the processor, the update rate of the GTCC to an initial value that is based on a number of target clock cycles required to execute a predetermined number of instructions, and a number of host cycles required to execute a single instruction. 15 . The method of claim 14 , further comprising: reducing the simulation speed by half the initial value, based on a global stall controller (GSC) included in the processor, receiving the control signal, and increasing the simulation speed two-folds the initial value based on the GSC not receiving the control signal in a predetermined amount of time. 16 . A non-transitory computer readable medium having stored thereon a program that when executed by a computer, causes the computer to execute a method of simulating a many-core target machine, the method comprising: acquiring a base clock per instruction (CPI) of a target machine, the CPI corresponding to an average number of clock cycles required by the target machine to execute a single instruction, translating an application of the target machine to a compact executable trace to be executed by the processor, determining whether to query an off-chip memory based on detecting a cache miss event, determining, whether to adjust a simulation speed based on receiving a control signal from a router, and adjusting dynamically, a speed of simulation by adjusting an update rate of a global target clock counter (GTCC). 17 . The non-transitory computer readable medium of claim 16 , the method further comprising: profiling each instruction of the target application to generate a profiled image of the application, the profiled image including an object for each unique instruction of the target application, and wherein each instruction of the target application is mapped to a unique address in the profiled image via a hash function. 18 . The non-transitory computer readable medium of claim 16 , the method further comprising: refining the profiled image to generate instructions for the processor to execute. 19 . The non-transitory computer readable medium of claim 16 , the method further comprising: setting the update rate of the GTCC to an initial value that is based on a number of target clock cycles required to execute a predetermined number of instructions, and a number of host cycles required to execute a single instruction. 20 . The non-transitory computer readable medium of claim 16 , the method further comprising: reducing the simulation speed by half the initial value, based on a global stall controller (GSC) included in the processor, receiving the control signal, and increasing the simulation speed two-folds the initial value based on the GSC not receiving the control signal in a predetermined amount of time.
Processors · CPC title
Timing analysis or timing optimisation · CPC title
Design verification, e.g. functional simulation or model checking · CPC title
where hardware is a sequential transfer control unit, e.g. microprocessor, peripheral processor or state-machine · CPC title
using additional hardware · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.