What technology area does this patent fall under?

Primary CPC classification G06F11/0772. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Sep 19 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).

System and method for monitoring and detecting faulty storage devices

US9766965B2 · US · B2

Patent metadata
Field	Value
Publication number	US-9766965-B2
Application number	US-201514952190-A
Country	US
Kind code	B2
Filing date	Nov 25, 2015
Priority date	Nov 25, 2015
Publication date	Sep 19, 2017
Grant date	Sep 19, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

In an enterprise environment that includes multiple data centers each having a number of first servers, computer-implemented methods and systems are provided for detecting faulty storage device(s) that are implemented as redundant array of independent disks (RAID) in conjunction with each of the first servers. Each first server monitors lower-level health metrics (LHMs) for each of the storage devices that characterize read and write activity of each storage device over a period of time. The LHMs are used to generate high-level health metrics (HLMs) for each of the storage devices that are indicative of activity of each storage device over the period of time. Second server(s) of a monitoring system can use the HLMs to determine whether each of the storage devices have been inactive or active, and can generate a fault indication for any storage devices that were determined to be inactive while storage device(s) at the same first server were determined to be active.

First claim

Opening claim text (preview).

What is claimed: 1. A system, comprising: a plurality of data centers, wherein each data center comprises: a plurality of first servers, wherein each first server is associated with redundant array of independent disks (RAID) that are implemented in conjunction with that first server, and wherein each first server is configured to: monitor lower-level health metrics that characterize read and write activity for each storage device at that server over a period of time of an observation interval; and process the lower-level health metrics to generate high-level health metrics for each storage device that are indicative of activity of each storage device over the period of time of the observation interval; a local metric collection database configured to receive the lower-level health metrics from each of the servers for that data center; a database that is configured to receive and store the lower-level health metrics from each of the local metric collection databases for each of the data centers; a monitoring system comprising at least one second server being configured to: determine, for each of the storage devices based on one or more of the high-level health metrics for that storage device, whether each particular storage device has been inactive over an extended period of time; determine, for each of the storage devices that are determined to have been inactive over the extended period of time, if any of the other storage devices at the same first server have been determined to have been active over the same extended period of time; and generate a fault indication for each storage device that was determined to have be inactive over the extended period of time while another storage device at the same first server was determined to have been active during the same extended period of time. 2. The system of claim 1 , wherein each first server comprises: main memory comprising: an operating system comprising a kernel that exposes the lower-level health metrics; and a proc file system that serves as an interface to the kernel, wherein the lower-level health metrics from the kernel are exposed to user space through the proc file system that is maintained in the main memory; and wherein each first server is configured to: monitor lower-level health metrics for each of the storage devices by sampling the lower-level health metrics at regular intervals from the proc file system via a collection daemon that runs at the first server. 3. The system of claim 1 , wherein the lower-level health metrics comprise: a cumulative number of reads made by each storage device during the period of time of the observation interval; a cumulative number of writes completed by each storage device during the period of time of the observation interval; a cumulative volume of reads made by each storage device during the period of time of the observation interval; a cumulative volume of writes completed by each storage device during the period of time of the observation interval; a total time spent reading by each storage device during the period of time of the observation interval; and a total time spent writing by each storage device during the period of time of the observation interval. 4. The system of claim 1 , wherein the high-level health metrics comprise: a number of read and write operations per second at each storage device during the period of time of the observation interval; read and write volumes per second at each storage device during the period of time of the observation interval; read and write queue sizes at each storage device during the period of time of the observation interval; read and write request service time at each storage device during the period of time of the observation interval; and a percent utilization of each storage device during the period of time of the observation interval. 5. The system of claim 4 , wherein each first server is configured to: derive the number of read operations per second at each storage device by dividing the number of read operations by the period of time of the observation interval; derive the number of write operations per second at each storage device by dividing the number of write operations by the period of time of the observation interval; derive a utilization of each storage device by computing a ratio of the time the particular storage device is busy performing I/O operations to the observation interval; derive a percent utilization of each storage device by multiplying the utilization of each storage device by one-hundred; derive an average service time for a read request at each storage device by dividing the utilization of each storage device by the number of read operations per second; derive the average service time for a write request at each storage device by dividing the utilization of each storage device by the number of write operations per second; derive a read volume per second at each storage device by dividing a read volume by the period of time of the observation interval; derive a write volume per second at each storage device by dividing a write volume by the period of time of the observation interval; derive an average read queue size at each storage device by multiplying a read request arrival rate by an average wait time for a read request, wherein the read request arrival rate is equal to an inverse of a time between consecutive read requests; and derive an average write queue size at each storage device by multiplying a write request arrival rate by an average wait time for a write request, wherein the write request arrival rate is equal to an inverse of a time between consecutive write requests. 6. The system of claim 1 , wherein the second server is configured to determine whether each particular storage device has been inactive during the extended period of time based on evaluation of a particular one of the high-level health metrics for that storage device that indicates that this particular storage device has been inactive over the extended period of time. 7. The system of claim 1 , wherein the second server of the monitoring system is configured to determine whether each particular storage device has been inactive during the extended period of time based on a combination of the high-level health metrics for that particular storage device. 8. The system of claim 1 , wherein the second server is configured to: determine, for each of the storage devices that are determined to have been inactive over the extended period of time, if a majority of the other storage devices at the same first server have been determined to have been active over the same extended period of time; and when the second server determines that the majority of the other storage devices at the same first server have been active over the extended period of time: generate the fault indication for each storage device that was determined to have be inactive over the extended period of time while the majority of the other storage devices at the same first server were determined to have been active during the same extended period of time, wherein each fault indication indicates that a particular storage device has failed via a device identifier that identifies that particular storage device. 9. The system of claim 1 , wherein the second server is further configured to communicate an alert message for each of the particular storage devices for which a fault indication was generated, wherein each alert message provides an alert to relevant service owners about the failure a particular storage device. 10. A computer-implemented method for detecting one or more faulty storage devices in redundant array of independent disks (RAID) that is implem

Assignees

Salesforce Com Inc

Inventors

Waheed Abdul

Classifications

G06F11/142
Reconfiguring to eliminate the error (group management mechanisms in a peer-to-peer network H04L67/1044) · CPC title
G06F11/008
Reliability or availability analysis · CPC title
G06F11/0772Primary
Means for error signaling, e.g. using interrupts, exception flags, dedicated error registers · CPC title
G06F11/2033
switching over of hardware resources · CPC title
G06F11/0727
in a storage system, e.g. in a DASD or network based storage system (drivers for digital recording or reproducing units G06F3/06; circuits for error detection or correction within digital recording or reproducing units G11B20/18; for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS], H04L67/1097) · CPC title

Patent family

Related publications grouped by family.

View patent family 58720702

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9766965B2 cover?: In an enterprise environment that includes multiple data centers each having a number of first servers, computer-implemented methods and systems are provided for detecting faulty storage device(s) that are implemented as redundant array of independent disks (RAID) in conjunction with each of the first servers. Each first server monitors lower-level health metrics (LHMs) for each of the storage …
Who is the assignee on this patent?: Salesforce Com Inc
What technology area does this patent fall under?: Primary CPC classification G06F11/0772. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Sep 19 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Techniques for visualizing storage cluster system configurations and events

Accelerated data recovery in a storage system

Techniques for storing and distributing metadata among nodes in a storage cluster system

Techniques for error handling in parallel splitting of storage commands

Storage control apparatus, method of controlling storage system, and computer-readable storage medium storing storage control program

Frequently asked questions