Cloud scale server reliability management

US2021286667A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2021286667-A1
Application numberUS-202117332302-A
CountryUS
Kind codeA1
Filing dateMay 27, 2021
Priority dateMay 27, 2021
Publication dateSep 16, 2021
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

An embodiment of an electronic apparatus may comprise one or more substrates, and a controller coupled to the one or more substrates, the controller including circuitry to provide management of a connected hardware subsystem with respect to one or more of reliability, availability and serviceability, and coordinate the management of the connected hardware subsystem with respect to one or more of reliability, availability and serviceability between the connected hardware subsystem and a host. Other embodiments are disclosed and claimed.

First claim

Opening claim text (preview).

What is claimed is: 1 . An electronic apparatus, comprising: one or more substrates; and a controller coupled to the one or more substrates, the controller including circuitry to: provide management of a connected hardware subsystem with respect to one or more of reliability, availability and serviceability, and coordinate the management of the connected hardware subsystem with respect to one or more of reliability, availability and serviceability between the connected hardware subsystem and a host. 2 . The apparatus of claim 1 , wherein the circuitry is further to: proactively notify the host of a temporary failure event for the connected hardware subsystem. 3 . The apparatus of claim 2 , wherein the circuitry is further to: release a hardware resource of the connected hardware subsystem related to the temporary failure event in response to a communication from the host that indicates that the host has undertaken a failure mitigation action for the notified temporary failure event; and initiate a self-repair action for the released hardware resource. 4 . The apparatus of claim 3 , wherein the circuitry is further to: notify the host that the released hardware resource may be reclaimed if the self-repair action is successful. 5 . The apparatus of claim 3 , wherein the circuitry is further to: notify the host that the temporary failure event is a permanent failure event if the self-repair action is unsuccessful. 6 . The apparatus of claim 2 , wherein the connected hardware subsystem corresponds to a memory subsystem, and wherein the circuitry is further to: release a portion of memory of the memory subsystem related to the temporary failure event in response to a communication from the host that indicates that the host has temporarily mapped out the portion of memory; and initiate a post-package repair for the released portion of memory. 7 . The apparatus of claim 6 , wherein the circuitry is further to: notify the host that the released portion of memory may be reused if the post-package repair is successful; and notify the host that the portion of memory of the memory subsystem is to remain mapped out if the post-package repair is unsuccessful. 8 . An electronic system, comprising: a controller; and memory communicatively coupled to the controller, wherein the memory stores firmware instructions that when executed by the controller cause the controller to: provide management of a connected hardware subsystem with respect to one or more of reliability, availability and serviceability, and coordinate the management of the connected hardware subsystem with respect to one or more of reliability, availability and serviceability between the connected hardware subsystem and a host. 9 . The system of claim 8 , wherein the memory stores further firmware instructions that when executed by the controller cause the controller to: proactively notify the host of a temporary failure event for the connected hardware subsystem. 10 . The system of claim 9 , wherein the memory stores further firmware instructions that when executed by the controller cause the controller to: release a hardware resource of the connected hardware subsystem related to the temporary failure event in response to a communication from the host that indicates that the host has undertaken a failure mitigation action for the notified temporary failure event; and initiate a self-repair action for the released hardware resource. 11 . The system of claim 10 , wherein the memory stores further firmware instructions that when executed by the controller cause the controller to: notify the host that the released hardware resource may be reclaimed if the self-repair action is successful. 12 . The system of claim 10 , wherein the memory stores further firmware instructions that when executed by the controller cause the controller to: notify the host that the temporary failure event is a permanent failure event if the self-repair action is unsuccessful. 13 . The system of claim 9 , wherein the connected hardware subsystem corresponds to a memory subsystem, and wherein the memory stores further firmware instructions that when executed by the controller cause the controller to: release a portion of memory of the memory subsystem related to the temporary failure event in response to a communication from the host that indicates that the host has temporarily mapped out the portion of memory; and initiate a post-package repair for the released portion of memory. 14 . The system of claim 13 , wherein the memory stores further firmware instructions that when executed by the controller cause the controller to: notify the host that the released portion of memory may be reused if the post-package repair is successful; and notify the host that the portion of memory of the memory subsystem is to remain mapped out if the post-package repair is unsuccessful. 15 . A method of managing a subsystem, comprising: providing management of a connected hardware subsystem with respect to one or more of reliability, availability and serviceability; and coordinating the management of the connected hardware subsystem with respect to one or more of reliability, availability and serviceability between the connected hardware subsystem and a host. 16 . The method of claim 15 , further comprising: proactively notifying the host of a temporary failure event for the connected hardware subsystem. 17 . The method of claim 16 , further comprising: releasing a hardware resource of the connected hardware subsystem related to the temporary failure event in response to a communication from the host that indicates that the host has undertaken a failure mitigation action for the notified temporary failure event; and initiating a self-repair action for the released hardware resource. 18 . The method of claim 17 , further comprising: notifying the host that the released hardware resource may be reclaimed if the self-repair action is successful; and notifying the host that the temporary failure event is a permanent failure event if the self-repair action is unsuccessful. 19 . The method of claim 16 , wherein the connected hardware subsystem corresponds to a memory subsystem, further comprising: releasing a portion of memory of the memory subsystem related to the temporary failure event in response to a communication from the host that indicates that the host has temporarily mapped out the portion of memory; and initiating a post-package repair for the released portion of memory. 20 . The method of claim 19 , further comprising: notifying the host that the released portion of memory may be reused if the post-package repair is successful; and notifying the host that the portion of memory of the memory subsystem is to remain mapped out if the post-package repair is unsuccessful.

Assignees

Inventors

Classifications

  • Means for error signaling, e.g. using interrupts, exception flags, dedicated error registers · CPC title

  • Remedial or corrective actions (recovery from an exception in an instruction pipeline G06F9/3861; by retry G06F11/1402; for recovering from a failure of a protocol instance or entity H04L69/40) · CPC title

  • Error or fault detection not based on redundancy (power supply failures G06F1/30; network fault management H04L41/06) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2021286667A1 cover?
An embodiment of an electronic apparatus may comprise one or more substrates, and a controller coupled to the one or more substrates, the controller including circuitry to provide management of a connected hardware subsystem with respect to one or more of reliability, availability and serviceability, and coordinate the management of the connected hardware subsystem with respect to one or more o…
Who is the assignee on this patent?
Intel Corp
What technology area does this patent fall under?
Primary CPC classification G06F11/0793. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Sep 16 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).