Pcie error reporting and throttling
US-2017091013-A1 · Mar 30, 2017 · US
US10078543B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10078543-B2 |
| Application number | US-201615167601-A |
| Country | US |
| Kind code | B2 |
| Filing date | May 27, 2016 |
| Priority date | May 27, 2016 |
| Publication date | Sep 18, 2018 |
| Grant date | Sep 18, 2018 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A switched fabric hierarchy (e.g., a PCIe hierarchy) may utilize hardware, firmware, and/or software for filtering duplicative or otherwise undesirable correctable error messages from reaching a root complex. An operating system of the root complex may detect a persistent stream or storm of correctable errors from a particular endpoint and activate filtering of correctable errors from that endpoint. A filtering device may receive filtering commands and parameters from the operating system, implement the filtering, and monitor further correctable errors from the offending device. While an offending device is being filtered, correctable error messages from the offending device may be masked from the operating system, while correctable error messages from other devices in the switched fabric hierarchy may be transmitted. At such time as the filtering device may detect that conditions for ending filtering of a device are met, the filtering device may cease filtering of the offending device and return monitoring responsibilities to the operating system.
Opening claim text (preview).
What is claimed is: 1. A system, comprising: a computing node configured as a root complex in a switched fabric hierarchy; one or more endpoint nodes configured as endpoints in the switched fabric hierarchy; a correctable error (“CE”) management module configured, at least in part, to: receive a plurality of error messages associated with the one or more endpoint nodes; detect, among the plurality of error messages, a CE storm associated with an offending device, the offending device associated with a first endpoint node of the one or more endpoint nodes; and identify the offending device as a target for CE filtering; and a CE filtering module configured, at least in part, to: at least in part in response to the error management module's identification of the offending device as a target for CE filtering, prevent transmission to the root complex of at least a portion of the plurality of error messages that are associated with the offending device. 2. The system of claim 1 , wherein detecting a CE storm comprises detecting an error event threshold. 3. The system of claim 1 , further comprising a CE containment module configured, at least in part, to: receive a filtering activation command from the CE management module; receive at least one CE containment instruction from the CE management module; and transmit at least one CE filtering command to the CE filtering module. 4. The system of claim 3 , wherein the at least one CE containment instruction comprises at least one of: a routing identifier (“RID”) of the offending device; an indication of how often the CE containment module should query CE events associated with the offending device; and at least one CE management threshold. 5. The system of claim 3 , wherein the at least one CE filtering command comprises at least one of: an instruction to begin filtering CE messages associated with the offending device; an instruction to cease filtering CE messages associated with the offending device. 6. The system of claim 3 , wherein the CE containment module is further configured to: monitor a prevalence of CE events associated with the offending device; and when the prevalence of CE events associated with the offending device is lower than a CE event threshold provided by the CE management module, instruct the CE filtering module to cease filtering CE messages associated with the offending device. 7. The system of claim 3 , wherein the CE management module is implemented within the root complex. 8. The system of claim 3 , wherein the CE containment module is implemented as firmware within the root complex. 9. The system of claim 3 , wherein the CE containment module is implemented as firmware within a filtering device separate from the root complex. 10. The system of claim 3 , wherein the CE filtering module is implemented as hardware within a filtering device separate from the root complex. 11. The system of claim 3 , wherein the CE filtering module is implemented within the offending device. 12. The system of claim 1 , wherein detecting a CE storm comprises detecting repetitive error messages from a particular one of the one or more endpoint nodes. 13. A method, comprising: receiving, by a CE management module, a plurality of error messages associated with the one or more endpoint nodes, wherein the CE management module is associated with a root complex in a switched fabric hierarchy; detecting, by the CE management module, a CE storm associated with an offending device, wherein the offending device is associated with a first endpoint node of one or more endpoint nodes in a switched fabric hierarchy; identifying, by the CE management module, the offending device as a target for CE filtering; filtering, by a CE filtering module, correctable error messages associated with the offending device, the filtering at least in part in response to the error management module's identification of the offending device as a target for CE filtering, wherein the filtering comprises preventing transmission to the root complex of at least a portion of the plurality of error messages that are associated with the offending device. 14. The method of claim 13 , further comprising: by the CE containment module: receiving a filtering activation command from the CE management module; receiving at least one CE containment instruction from the CE management module, the CE containment instruction comprising: an RID of the offending device; a CE event query frequency; and at least one CE management threshold; and transmitting a begin filtering command to the CE filtering module, wherein the begin filtering command instructs the CE filtering module to begin the filtering of CE messages associated with the offending device. 15. The method of claim 14 , further comprising: by the CE containment module: determining a number N, where N represents a number of desired iterations for querying a filter catch signal; querying the filter catch signal associated with the offending device according to the CE event query frequency, the querying occurring at least N times, wherein a CE event counter is incremented in response to receiving an indication, during an iteration of querying of the filter catch signal, that a CE error event has occurred since an immediately previous iteration of querying the filter catch signal; comparing a value of the CE event counter with the value of a CE event threshold; when the value of the CE event counter is lower than the value of the CE event threshold, transmitting a stop filtering command to the CE filtering module, wherein the stop filtering command instructs the CE filtering module to cease the filtering CE messages associated with the offending device. 16. The method of claim 15 , further comprising: transmitting, by the CE containment module to the CE management module, a summary of CE events associated with the offending device. 17. The method of claim 15 , wherein the CE event threshold is one of the at least one CE management thresholds. 18. An apparatus, comprising: one or more endpoint devices configured as endpoints in a switched fabric hierarchy; a computing device configured as a root complex in the switched fabric hierarchy, the computing node comprising: a root complex processor; a root complex memory, the root complex memory comprising program instructions that when executed by the root complex processor cause the processor to: implement a CE management module configured, at least in part, to: receive a plurality of error messages associated with the one or more endpoint nodes; detect, among the plurality of error messages, a CE storm associated with an offending device, the offending device being one of the one or more endpoint devices; identify the offending device as a target for CE filtering; and a filtering device comprising: a filtering processor; a filtering memory, the filtering memory comprising: a plurality of filtering registers associated with the one or more endpoint devices; firmware instructions that when executed by the filtering processor cause the filtering processor to: receive a filtering activation command from the CE management module; begin filtering CE messages from the offending device by manipulating a filtering value of an offending device register to begin filtering CE messages from the offending device, wherein the offending device register is one of the plurality of filtering registers that is associated with the offending device. 19. The apparatus of claim 18 , w
Error filtering or prioritizing based on a policy defined by the user or on a policy defined by a hardware/software module, e.g. according to a severity level · CPC title
by exceeding a count or rate limit, e.g. word- or bit count limit · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.