Preventing unnecessary data recovery

US9898360B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-9898360-B1
Application numberUS-201514980633-A
CountryUS
Kind codeB1
Filing dateDec 28, 2015
Priority dateFeb 25, 2014
Publication dateFeb 20, 2018
Grant dateFeb 20, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method that prevents unnecessary data recovery includes receiving, at a data processing device, a status of a resource of a distributed system. When the status of the resource indicates a resource failure, the method includes executing instructions on the data processing device to determine whether the resource failure is correlated to any other resource failures within the distributed system. When the resource failure is correlated to other resource failures within the distributed system, the method includes delaying execution on the data processing device of a remedial action associated with the resource. However, when the resource failure is uncorrelated to other resource failures within the distributed system, the method includes initiating execution on the data processing device of the remedial action associated with the resource.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: receiving, at a data processing device, a status of a resource of a distributed system; when the status of the resource indicates a resource failure, executing instructions on the data processing device to determine whether the resource failure is correlated to any other resource failures within the distributed system based on a system hierarchy of the distributed system, the system hierarchy comprising system domains, each system domain having an active state or an inactive state, the resource belonging to at least one system domain, wherein the resource failure is correlated to other resource failures when a statistically significant number of resources having failures reside in a same system domain; when the resource failure is correlated to other resource failures within the distributed system, delaying execution on the data processing device of a remedial action associated with the resource; and when the resource failure is uncorrelated to other resource failures within the distributed system, initiating execution on the data processing device of the remedial action associated with the resource. 2. The method of claim 1 , further comprising, when the resource comprises non-transitory memory, initiating data reconstruction as the remedial action for any data stored on the non-transitory memory. 3. The method of claim 2 , wherein the data comprises chunks of a file, the file divided into stripes comprising data chunks and non-data chunks. 4. The method of claim 1 , further comprising, when the resource comprises a computer processor, migrating or restarting a job previously executing on a failed computer processor to an operational computer processor. 5. The method of claim 1 , further comprising determining the resource failure as correlated to other resource failures when the resource resides in an inactive system domain. 6. The method of claim 1 , wherein the system hierarchy comprises system levels comprising: a first system level corresponding to host machines of data processing devices, non-transitory memory devices, or network interface controllers, each host machine having a system domain; a second system level corresponding to power deliverers, communication deliverers, or cooling deliverers of racks housing the host machines, each power deliverer, communication deliverer, or cooling deliverer of the rack having a system domain; a third system level corresponding to power deliverers, communication deliverers, or cooling deliverers of cells having associated racks, each power deliverer, communication deliverer, or cooling deliverer of the cell having a system domain; and a fourth system level corresponding to a distribution center module of the cells, each distribution center module having a system domain. 7. A method comprising: receiving, at a data processing device, a status of a resource of a distributed system; when the status of the resource indicates a resource failure, executing instructions on the data processing device to determine whether the resource failure is correlated to any other resource failures within the distributed system based on a system hierarchy of the distributed system, the system hierarchy comprising system domains, each system domain having an active state or an inactive state, the resource belonging to at least one system domain; when the resource failure is correlated to other resource failures within the distributed system, delaying execution on the data processing device of a remedial action associated with the resource until after a first threshold period of time; and when the resource failure is uncorrelated to the other resource failures within the distributed system, initiating execution on the data processing device of the remedial action associated with the resource after a second threshold period of time, wherein the first threshold period of time is greater than the second threshold period of time. 8. The method of claim 7 , wherein the second threshold period of time is between about 15 minutes and about 30 minutes. 9. A recovery system for a distributed system, the recovery system comprising: a data processing device in communication with resources of the distributed system, the data processing device receiving a status of a resource of the distributed system; when the status of the resource indicates a resource failure, the data processing device executing instructions to determine whether the resource failure is correlated to any other resource failures within the distributed system based on a system hierarchy of the distributed system, the system hierarchy comprising system domains, each system domain having an active state or an inactive state, the resource belonging to at least one system domain, wherein the resource failure is correlated to other resource failures when a statistically significant number of resources having failures reside in a same system domain; when the resource failure is correlated to other resource failures within the distributed system, the data processing device delaying execution of a remedial action associated with the resource; and when the resource failure is uncorrelated to other resource failures within the distributed system, the data processing device initiating execution of the remedial action associated with the resource. 10. The recovery system of claim 9 , wherein, when the resource comprises non-transitory memory, the data processing device initiates data reconstruction as the remedial action for any data stored on the non-transitory memory. 11. The recovery system of claim 10 , wherein the data comprises chunks of a file, the file divided into stripes comprising data chunks and non-data chunks. 12. The recovery system of claim 9 , wherein, when the resource comprises a computer processor, the data processing device migrates or restarts a job previously executing on a failed computer processor to an operational computer processor. 13. The recovery system of claim 9 , wherein the data processing device determines the resource failure as correlated to other resource failures, when the resource resides in an inactive system domain. 14. The recovery system of claim 9 , wherein the system hierarchy comprises system levels comprising: a first system level corresponding to host machines of data processing devices, non-transitory memory devices, or network interface controllers, each host machine having a system domain; a second system level corresponding to power deliverers, communication deliverers, or cooling deliverers of racks housing the host machines, each power deliverer, communication deliverer, or cooling deliverer of the rack having a system domain; a third system level corresponding to power deliverers, communication deliverers, or cooling deliverers of cells having associated racks, each power deliverer, communication deliverer, or cooling deliverer of the cell having a system domain; and a fourth system level corresponding to a distribution center module of the cells, each distribution center module having a system domain. 15. A recovery system for a distributed system, the recovery system comprising: a data processing device in communication with resources of the distributed system, the data processing device receiving a status of a resource of the distributed system; when the status of the resource indicates a resource failure, the data processing device executing instructions to determine whether the resource failure is correlated to any other resource failures within the distributed system based on a system hierarchy of the distributed system, the sys

Assignees

Inventors

Classifications

  • in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems · CPC title

  • by exceeding a time limit, i.e. time-out, e.g. watchdogs · CPC title

  • Root cause analysis, i.e. error or fault diagnosis (in a hardware test environment G06F11/22; in a software test environment G06F11/36) · CPC title

  • Remedial or corrective actions (recovery from an exception in an instruction pipeline G06F9/3861; by retry G06F11/1402; for recovering from a failure of a protocol instance or entity H04L69/40) · CPC title

  • Reconstruction on already foreseen single or plurality of spare disks · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9898360B1 cover?
A method that prevents unnecessary data recovery includes receiving, at a data processing device, a status of a resource of a distributed system. When the status of the resource indicates a resource failure, the method includes executing instructions on the data processing device to determine whether the resource failure is correlated to any other resource failures within the distributed system…
Who is the assignee on this patent?
Google Inc, Google Llc
What technology area does this patent fall under?
Primary CPC classification G06F11/0793. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Feb 20 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).