Managing power in a high performance computing system for resiliency and cooling

US10429909B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10429909-B2
Application numberUS-201615148242-A
CountryUS
Kind codeB2
Filing dateMay 6, 2016
Priority dateJun 1, 2015
Publication dateOct 1, 2019
Grant dateOct 1, 2019

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

An apparatus and method thermally manage a high performance computing system having a plurality of nodes with microprocessors. To that end, the apparatus and method monitor the temperature of at least one of a) the environment of the high performance computing system and b) at least a portion of the high performance computing system. In response, the apparatus and method control the processing speed of at least one of the microprocessors on at least one of the plurality of nodes as a function of at least one of the monitored temperatures.

First claim

Opening claim text (preview).

What is claimed is: 1. A method of managing errors of a high performance computing system, the method comprising: detecting an error condition of the high performance computing system, the high performance computing system having a plurality of nodes with microprocessors and a cooling system operable in a first cooling mode and a second cooling mode different than the first cooling mode; thermally conducting heat away from at least one of the plurality of nodes under the first cooling mode in response to the detected error condition exceeding a first threshold; thermally conducting heat away from at least one of the plurality of nodes under the second cooling mode in response to the detected error condition exceeding a second threshold; and reducing a processing speed of at least one of the microprocessors on at least one of the plurality of nodes in response to the detected error condition exceeding the first threshold and such that the detected error condition is maintained at or below the second threshold so as to prolong cooling under the first cooling mode, wherein the detected error condition comprises a temperature reading of at least one of the nodes, wherein the first threshold comprises a first temperature threshold, wherein the second threshold comprises a second temperature threshold greater than the first temperature threshold and wherein the processing speed of at least one of the microprocessors on at least one of the plurality of nodes is reduced in response to the temperature reading exceeding the first temperature threshold to maintain the temperature reading at or below the second temperature threshold so as to prolong cooling under the first cooling mode. 2. The method as defined by claim 1 wherein the error condition includes at least one of a correctable error and a temperature reading of at least one of the nodes. 3. The method as defined by claim 2 wherein the correctable error includes at least one of a memory correctable error and a network correctable error. 4. The method as defined by claim 1 wherein the at least one microprocessor's processing speed normally is at a current level, further wherein reducing comprises: permitting the processing speed to maintain current levels; and reducing the processing speed from current levels after detecting a plurality of error conditions. 5. The method as defined by claim 1 wherein detecting comprises detecting a plurality of error conditions of the high performance computing system, and wherein the reducing comprises reducing the processing speed as a function of the plurality of error conditions. 6. The method as defined by claim 1 further comprising, after detecting the error condition, hot swapping at least a portion of the high performance computing system, or stopping execution of at least a portion of the high performance computing system. 7. The method as defined by claim 1 further comprising executing a task on a given node of the plurality of nodes, wherein the detecting comprises detecting an error condition on the given node, and wherein the reducing comprises postponing reduction of the processing speed of at least one of the microprocessors on the given node until after the task is completed. 8. The method of claim 1 , wherein the thermally conducting of the heat away from at least one of the plurality of nodes under the first mode in response to the temperature reading exceeding the first temperature threshold is by directing a liquid coolant through coils. 9. The method of claim 8 , wherein the thermally conducting of the heat away from at least one of the plurality of nodes under the second mode in response to the temperature reading exceeding the second temperature threshold is by directing a liquid coolant through particular coils and spraying water onto exterior surfaces of the particular coils. 10. The method of claim 8 , wherein the thermally conducting of the heat away from at least one of the plurality of nodes under the second mode in response to the temperature reading exceeding the second temperature threshold is by directing a liquid coolant through particular coils and passing the particular coils through a refrigerant. 11. The method of claim 1 , wherein the thermally conducting of the heat away from at least one of the plurality of nodes under the first mode in response to the temperature reading exceeding the first temperature threshold is by directing a liquid coolant through coils and spraying water onto exterior surfaces of the coils. 12. The method of claim 11 , wherein the thermally conducting of the heat away from at least one of the plurality of nodes under the second mode in response to the temperature reading exceeding the second temperature threshold is by directing a liquid coolant through particular coils and passing the particular coils through a refrigerant. 13. The method of claim 1 further comprising taking multiple temperature readings over time of the at least one of the nodes and automatically controlling the processing speed of the at least one of the microprocessors on at least one of the plurality of nodes based upon the multiple temperature readings. 14. The method of claim 1 wherein the high performance computing system is within a room having an air temperature, an environment comprising a region of the room, the temperature reading being the air temperature at the region of the room. 15. The method of claim 1 further comprising acquiring the temperature reading by monitoring a respective temperature of both (a) an environment of the high performance computing system and (b) at least a portion of the high performance computing system. 16. The method of claim 15 , wherein reducing the processing speed of at least one of the microprocessors on at least one of the plurality of nodes is a function of both the monitored temperatures (a) and (b). 17. The method of claim 1 wherein the reducing of the processing speed of the at least one microprocessor is for a prescribed period of time, the method further comprising increasing the speed of the at least one microprocessor after the prescribed period of time has elapsed. 18. The method of claim 1 wherein the reducing of the processing speed of the at least one microprocessor occurs at least until the temperature reading decreases to a prescribed temperature, the method further comprising increasing the processing speed of the at least one microprocessor after the temperature reading has decreased to the prescribed temperature.

Assignees

Inventors

Classifications

  • Monitoring arrangements for monitoring environmental properties or parameters of the computing system or of the computing system component, e.g. monitoring of power, currents, temperature, humidity, position, vibrations (thermal management in cooling arrangements of a computing system G06F1/206) · CPC title

  • where the computing system is implementing multitasking (multiprogramming arrangements G06F9/46; allocation of resources G06F9/50) · CPC title

  • G06F1/206Primary

    comprising thermal management · CPC title

  • by lowering clock frequency · CPC title

  • Error detection; Error correction; Monitoring (error detection, correction or monitoring in information storage based on relative movement between record carrier and transducer G11B20/18; monitoring, i.e. supervising the progress of recording or reproducing G11B27/36; in static stores G11C29/00) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10429909B2 cover?
An apparatus and method thermally manage a high performance computing system having a plurality of nodes with microprocessors. To that end, the apparatus and method monitor the temperature of at least one of a) the environment of the high performance computing system and b) at least a portion of the high performance computing system. In response, the apparatus and method control the processing …
Who is the assignee on this patent?
Hewlett Packard Entpr Dev Lp
What technology area does this patent fall under?
Primary CPC classification G06F1/206. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Oct 01 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).