System and method to reduce host interrupts for non-critical errors
US-2021263868-A1 · Aug 26, 2021 · US
US11726873B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11726873-B2 |
| Application number | US-202117556550-A |
| Country | US |
| Kind code | B2 |
| Filing date | Dec 20, 2021 |
| Priority date | Dec 20, 2021 |
| Publication date | Aug 15, 2023 |
| Grant date | Aug 15, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A system, method and apparatus to optimize repair in a memory module based on hardware errors identified by microprocessors and a configurable error handling policy. For example, the error handling policy can have a configuration file identifying an amount of repair resources available in the memory module as manufactured. Repair status data can be stored in the memory module to determine repair resources currently available for repair. Further, the error handling policy can be configured with a list of high risk memory addresses prioritized for repair. The list can be used to schedule proactive repair in response to memory errors that would otherwise not be repaired during a typical restarting of the computer system having the memory module.
Opening claim text (preview).
What is claimed is: 1. A method, comprising: storing, in a computing system, data representative of a configurable error handling policy; detecting, by a processor of the computing system, a memory error in a memory module, the memory module manufactured to have a first amount of repair resources; generating, by the processor, first data about the memory error; writing, into registers in the processor, the first data; identifying a second amount of repair resources currently available in the memory module to implement repairs; and determining, by the computing system based on the second amount of repair resources, the first data, and the error handling policy, whether to perform a post production repair at a memory address having the memory error. 2. The method of claim 1 , further comprising: receiving, in the computing system, an input identifying the first amount of repair resources available in the memory module as manufactured; writing the input to a configuration file of the error handling policy; and storing, in the computing system, historic data identifying repairs performed in the memory module; wherein the identifying of the second amount of repair resources currently available in the memory module to implement repairs is based on the first amount of repair resources and the historic data of repairs performed in the memory module. 3. The method of claim 2 , wherein the historic data of repairs performed in the memory module is stored in a non-volatile memory of the memory module. 4. The method of claim 3 , wherein the non-volatile memory is configured to support Serial Presence Detect (SPD). 5. The method of claim 4 , further comprising: receiving, in the computing system, a list identifying a plurality of memory addresses; and writing the list to the configuration file of the error handling policy; wherein the determining of whether to perform the post production repair is further based at least in part on whether the memory address is in the plurality of memory addresses. 6. The method of claim 5 , further comprising: predicting the plurality of memory addresses based on risk assessment. 7. The method of claim 6 , wherein the plurality of memory addresses are predicted to have memory errors based on a pattern of operations in the computing system. 8. The method of claim 5 , wherein the post production repair is selected for the memory address in response to the memory address being in the plurality of memory addresses and before the memory error is determined to be non-recoverable. 9. The method of claim 5 , wherein the post production repair is not selected for the memory address in response to a determination that there are insufficient repair resources available to repair a plurality of errors in the memory module for a subsequent restart of the computing system. 10. The method of claim 5 , wherein the determining of whether to perform the post production repair is performed in a Baseboard Management Controller (BMC) connected to the processor; and the method further comprises: generating second data about the memory error based at least in part on the first data in the registers; and storing the second data in a storage device of the Baseboard Management Controller (BMC), wherein the determining of whether to perform the post production repair is based on the second data. 11. An apparatus, comprising: a Baseboard Management Controller (BMC) having a storage device configured to store data representative of a first error handling policy having a configuration file; a memory module having a non-volatile memory and a volatile memory, the memory module manufactured to have a first amount of repair resources; and a microprocessor coupled to the memory module and the Baseboard Management Controller (BMC), the microprocessor configured via instructions to, in response to an error in the memory module and prior to restarting of the apparatus: store, in registers of the microprocessor and in response to the error in the memory module, first data about the error; decode the first data about the error to generate second data about the error; and communicate with the Baseboard Management Controller (BMC) to store the second data into the storage device of the Baseboard Management Controller (BMC); wherein the Baseboard Management Controller (BMC) is configured to determine, based on a second amount of repair resources currently available in the memory module to implement repairs, the second data, and the first error handling policy having the configuration file, whether to perform a post production repair at a memory address having the error. 12. The apparatus of claim 11 , wherein the microprocessor is further configured to determine whether to perform the post production repair at the memory address having the error based on a list of memory addresses specified for a second error handling policy processed using an operating system executed by the microprocessor. 13. The apparatus of claim 12 , wherein the microprocessor is further configured to predict the list of memory addresses based on risk assessment and an operation pattern of the microprocessor. 14. The apparatus of claim 12 , wherein the microprocessor is further configured via instructions in a Basic Input/Output System (BIOS) of the apparatus to store, in the non-volatile memory of the memory module, historic data of post production repairs performed in the memory module; the configuration file identifies first repair resources as manufactured in the memory module; and the apparatus is configured to identify, based on the configuration file and the historic data, second repair resources in the memory module available to perform the post production repair at the memory address. 15. The apparatus of claim 14 , wherein the non-volatile memory is configured to implement Serial Presence Detect (SPD); and the registers are configured to implement Machine Check Architecture (MCA). 16. The apparatus of claim 15 , wherein the second error handling policy processed using the operating system executed by the microprocessor is configured to select from memory addresses having uncorrectable errors for repair; and the first error handling policy implemented in the Baseboard Management Controller (BMC) is configured to select from memory addresses having non-recoverable uncorrectable errors for repair. 17. A non-transitory computer readable storage medium storing instructions which, when executed by a microprocessor in a computing device, causes the computing device to perform a method, comprising: generating, based on decoding first data in registers in the microprocessor about a memory error in a memory module in the computing device, second data about the memory error, the second data containing a memory address of the memory error, the memory module manufactured to have a first amount of repair resources; storing, in a non-volatile memory, the second data; and determining, based on a second amount of repair resources currently available in the memory module to implement repairs, the second data, and a configurable error handling policy, whether to perform a post production repair at the memory address of the memory error. 18. The non-transitory computer readable storage medium of claim 17 , wherein the method further comprises: configuring the error handling policy to identify a list of memory addresses, wherein the determining of whether to perform the post production repair is based at least in part on whether the memory address is in the list. 19
in a storage system, e.g. in a DASD or network based storage system (drivers for digital recording or reproducing units G06F3/06; circuits for error detection or correction within digital recording or reproducing units G11B20/18; for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS], H04L67/1097) · CPC title
Error or fault detection not based on redundancy (power supply failures G06F1/30; network fault management H04L41/06) · CPC title
Dumping, i.e. gathering error/state information after a fault for later diagnosis · CPC title
Remedial or corrective actions (recovery from an exception in an instruction pipeline G06F9/3861; by retry G06F11/1402; for recovering from a failure of a protocol instance or entity H04L69/40) · CPC title
in sector programmable memories, e.g. flash disk (G06F11/1072 takes precedence) · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.