System and method for monitoring and detecting faulty storage devices

US9766965B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9766965-B2
Application numberUS-201514952190-A
CountryUS
Kind codeB2
Filing dateNov 25, 2015
Priority dateNov 25, 2015
Publication dateSep 19, 2017
Grant dateSep 19, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

In an enterprise environment that includes multiple data centers each having a number of first servers, computer-implemented methods and systems are provided for detecting faulty storage device(s) that are implemented as redundant array of independent disks (RAID) in conjunction with each of the first servers. Each first server monitors lower-level health metrics (LHMs) for each of the storage devices that characterize read and write activity of each storage device over a period of time. The LHMs are used to generate high-level health metrics (HLMs) for each of the storage devices that are indicative of activity of each storage device over the period of time. Second server(s) of a monitoring system can use the HLMs to determine whether each of the storage devices have been inactive or active, and can generate a fault indication for any storage devices that were determined to be inactive while storage device(s) at the same first server were determined to be active.

First claim

Opening claim text (preview).

What is claimed: 1. A system, comprising: a plurality of data centers, wherein each data center comprises: a plurality of first servers, wherein each first server is associated with redundant array of independent disks (RAID) that are implemented in conjunction with that first server, and wherein each first server is configured to: monitor lower-level health metrics that characterize read and write activity for each storage device at that server over a period of time of an observation interval; and process the lower-level health metrics to generate high-level health metrics for each storage device that are indicative of activity of each storage device over the period of time of the observation interval; a local metric collection database configured to receive the lower-level health metrics from each of the servers for that data center; a database that is configured to receive and store the lower-level health metrics from each of the local metric collection databases for each of the data centers; a monitoring system comprising at least one second server being configured to: determine, for each of the storage devices based on one or more of the high-level health metrics for that storage device, whether each particular storage device has been inactive over an extended period of time; determine, for each of the storage devices that are determined to have been inactive over the extended period of time, if any of the other storage devices at the same first server have been determined to have been active over the same extended period of time; and generate a fault indication for each storage device that was determined to have be inactive over the extended period of time while another storage device at the same first server was determined to have been active during the same extended period of time. 2. The system of claim 1 , wherein each first server comprises: main memory comprising: an operating system comprising a kernel that exposes the lower-level health metrics; and a proc file system that serves as an interface to the kernel, wherein the lower-level health metrics from the kernel are exposed to user space through the proc file system that is maintained in the main memory; and wherein each first server is configured to: monitor lower-level health metrics for each of the storage devices by sampling the lower-level health metrics at regular intervals from the proc file system via a collection daemon that runs at the first server. 3. The system of claim 1 , wherein the lower-level health metrics comprise: a cumulative number of reads made by each storage device during the period of time of the observation interval; a cumulative number of writes completed by each storage device during the period of time of the observation interval; a cumulative volume of reads made by each storage device during the period of time of the observation interval; a cumulative volume of writes completed by each storage device during the period of time of the observation interval; a total time spent reading by each storage device during the period of time of the observation interval; and a total time spent writing by each storage device during the period of time of the observation interval. 4. The system of claim 1 , wherein the high-level health metrics comprise: a number of read and write operations per second at each storage device during the period of time of the observation interval; read and write volumes per second at each storage device during the period of time of the observation interval; read and write queue sizes at each storage device during the period of time of the observation interval; read and write request service time at each storage device during the period of time of the observation interval; and a percent utilization of each storage device during the period of time of the observation interval. 5. The system of claim 4 , wherein each first server is configured to: derive the number of read operations per second at each storage device by dividing the number of read operations by the period of time of the observation interval; derive the number of write operations per second at each storage device by dividing the number of write operations by the period of time of the observation interval; derive a utilization of each storage device by computing a ratio of the time the particular storage device is busy performing I/O operations to the observation interval; derive a percent utilization of each storage device by multiplying the utilization of each storage device by one-hundred; derive an average service time for a read request at each storage device by dividing the utilization of each storage device by the number of read operations per second; derive the average service time for a write request at each storage device by dividing the utilization of each storage device by the number of write operations per second; derive a read volume per second at each storage device by dividing a read volume by the period of time of the observation interval; derive a write volume per second at each storage device by dividing a write volume by the period of time of the observation interval; derive an average read queue size at each storage device by multiplying a read request arrival rate by an average wait time for a read request, wherein the read request arrival rate is equal to an inverse of a time between consecutive read requests; and derive an average write queue size at each storage device by multiplying a write request arrival rate by an average wait time for a write request, wherein the write request arrival rate is equal to an inverse of a time between consecutive write requests. 6. The system of claim 1 , wherein the second server is configured to determine whether each particular storage device has been inactive during the extended period of time based on evaluation of a particular one of the high-level health metrics for that storage device that indicates that this particular storage device has been inactive over the extended period of time. 7. The system of claim 1 , wherein the second server of the monitoring system is configured to determine whether each particular storage device has been inactive during the extended period of time based on a combination of the high-level health metrics for that particular storage device. 8. The system of claim 1 , wherein the second server is configured to: determine, for each of the storage devices that are determined to have been inactive over the extended period of time, if a majority of the other storage devices at the same first server have been determined to have been active over the same extended period of time; and when the second server determines that the majority of the other storage devices at the same first server have been active over the extended period of time: generate the fault indication for each storage device that was determined to have be inactive over the extended period of time while the majority of the other storage devices at the same first server were determined to have been active during the same extended period of time, wherein each fault indication indicates that a particular storage device has failed via a device identifier that identifies that particular storage device. 9. The system of claim 1 , wherein the second server is further configured to communicate an alert message for each of the particular storage devices for which a fault indication was generated, wherein each alert message provides an alert to relevant service owners about the failure a particular storage device. 10. A computer-implemented method for detecting one or more faulty storage devices in redundant array of independent disks (RAID) that is implem

Assignees

Inventors

Classifications

  • Reconfiguring to eliminate the error (group management mechanisms in a peer-to-peer network H04L67/1044) · CPC title

  • Reliability or availability analysis · CPC title

  • Means for error signaling, e.g. using interrupts, exception flags, dedicated error registers · CPC title

  • switching over of hardware resources · CPC title

  • in a storage system, e.g. in a DASD or network based storage system (drivers for digital recording or reproducing units G06F3/06; circuits for error detection or correction within digital recording or reproducing units G11B20/18; for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS], H04L67/1097) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9766965B2 cover?
In an enterprise environment that includes multiple data centers each having a number of first servers, computer-implemented methods and systems are provided for detecting faulty storage device(s) that are implemented as redundant array of independent disks (RAID) in conjunction with each of the first servers. Each first server monitors lower-level health metrics (LHMs) for each of the storage …
Who is the assignee on this patent?
Salesforce Com Inc
What technology area does this patent fall under?
Primary CPC classification G06F11/0772. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 19 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).