Data storage system with metadata check-pointing

US11941278B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11941278-B2
Application numberUS-202117520537-A
CountryUS
Kind codeB2
Filing dateNov 5, 2021
Priority dateJun 28, 2019
Publication dateMar 26, 2024
Grant dateMar 26, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A data storage system includes multiple head nodes and data storage sleds. Volume data is replicated between a primary and one or more secondary head nodes for a volume partition and is further flushed to a set of mass storage devices of the data storage sleds. Volume metadata is maintained in a primary and one or more secondary head nodes for a volume partition and is updated in response to volume data being flushed to the data storage sleds. Also, the primary and secondary head nodes store check-points of volume metadata to the data storage sleds, wherein in response to a failure of a primary or secondary head node for a volume partition, a replacement secondary head node for the volume partition recreates a secondary replica for the volume partition based, at least in part, on a stored volume metadata checkpoint.

First claim

Opening claim text (preview).

What is claimed is: 1. A data storage system, comprising: a plurality of head nodes; a plurality of mass storage devices, wherein for a volume partition stored in the data storage system, a first and second head node of the plurality of head nodes are configured to: store data for a replica of the volume partition in a log-structured storage of the respective first or second head node, wherein the log-structured storage comprises a volume data portion and a metadata portion; and wherein the first head node is configured to store, to one or more of the plurality of mass storage devices, a copy of the metadata portion of the volume partition; a failure detection agent configured to: detect a failed one of the plurality of head nodes based on a failure of the failed head node to respond to a ping from the failure detection agent; and indicate to a plurality of remaining ones of the plurality of head nodes that the failed head node has failed, wherein the plurality of remaining ones of the plurality of head nodes are each configured to: identify volume partitions for which replicas are stored on the failed head node; and initiate, for the identified volume partitions, the designation of a replacement replica for the identified volume partitions on respective ones of the remaining head nodes. 2. The data storage system of claim 1 , wherein the plurality of remaining ones of the plurality of head nodes are further configured to: generate a log-structured storage for the replacement replica based on one or more copies of the metadata portions stored on the one or more mass storage devices. 3. The data storage system of claim 1 , wherein the first head node is configured to perform a metadata checkpoint operation, wherein the storing of the copy of the metadata portion of the log-structured storage to the one or more mass storage devices is part of the metadata checkpoint operation performed by the first head node, and wherein the first head node is configured to independently perform the metadata checkpoint operation, independent from performing a flush operation. 4. The data storage system of claim 1 , wherein the ping comprises: a verification that an active network connection exists to a respective head node being pinged. 5. The data storage system of claim 1 , wherein the ping comprises: a query to an operating system of a respective head node being pinged. 6. The data storage system of claim 1 , wherein the ping comprises: a set of queries directed to individual replicas stored on a respective head node being pinged. 7. The data storage system of claim 1 , wherein the ping comprises: a request for performance information directed to a respective head node being pinged, wherein a failure to provide the requested performance information is interpreted as an indication of a failure at the respective head node being pinged. 8. The data storage system of claim 1 , wherein the first head node is configured to: perform said store a copy of the metadata portion of the log-structured storage for the primary replica based on an amount of metadata stored in the first head node, but not yet copied to the mass storage devices, exceeding a threshold amount of stored but not yet copied metadata, and perform a flush operation based on an amount of volume data stored in the log-structured storage for the primary replica exceeding a threshold amount of stored volume data. 9. The data storage system of claim 8 , wherein the first head node is further configured to perform a flush operation, wherein to perform the flush operation, the first head node is configured to: read data stored for the volume partition from the volume data portion of the log-structured storage of the first head node; cause the data read from the volume data portion of the log-structured storage of the first head node to be written to a set of the mass storage devices; and update the metadata portion of the log-structured storage of the first head node to indicate one or more locations at which the data read from the volume data portion is stored on the set of mass storage devices. 10. The data storage system of claim 9 , wherein the first head node is configured to perform said storing the copy of the metadata portion, for the replica of the volume partition, to the mass storage devices independently of performing the flush operation for the replica of the volume partition. 11. A method, comprising: storing data for respective replicas of respective volume partitions in log-structured storages of respective head nodes of a data storage system, wherein the log-structured storages of the head nodes comprise a volume data portion and a metadata portion; storing, to one or more mass storage devices of the data storage system, respective copies of the metadata portions of the volume partitions; detecting a failed one of the plurality of head nodes based on a failure of the failed head node to respond to a ping; and initiating, for one or more identified volume partitions having a replica stored on the failed head node, one or more replacement replicas for the one or more identified volume partitions, wherein the one or more replacement replicas are implemented on one or more respective remaining head nodes of the data storage system that were not detected to be failed, and wherein volume metadata for the one or more replacement replicas is re-mirrored from one or more of the respective copies of the metadata portions stored on the one or more mass storage devices of the data storage system. 12. The method of claim 11 , further comprising: generating a log-structured storage for the one or more replacement replicas based on the one or more copies of the metadata portions stored on the one or more mass storage devices of the data storage system. 13. The method of claim 11 , wherein the ping comprises: a verification that an active network connection exists to a respective head node being pinged. 14. The method of claim 11 , wherein the ping comprises: a query to an operating system of a respective head node being pinged. 15. The method of claim 11 , wherein the ping comprises: a set of queries directed to individual replicas stored on a respective head node being pinged. 16. The method of claim 11 , wherein the ping comprises: a request for performance information directed to a respective head node being pinged, wherein a failure to provide the requested performance information is interpreted as an indication of a failure at the respective head node being pinged. 17. A non-transitory, computer-readable medium storing program instructions that, when executed on or across one or more processors, cause the one or more processors to: detect a failed one of a plurality of head nodes of a data storage system based on a failure of the failed head node to respond to a ping; and initiate, for one or more identified volume partitions having a replica stored on the failed head node, generation of one or more replacement replicas for the one or more identified volume partitions, wherein the one or more replacement replicas are implemented on one or more respective remaining head nodes of the data storage system that were not detected to be failed, and wherein volume metadata for the replacement replicas is re-mirrored from one or more of respective copies of metadata portions stored on one or more mass storage devices of the data storage system. 18. The non-transitory computer-readable media of claim 17 , wherein the program instructions, when exec

Assignees

Inventors

Classifications

  • G06F3/0644Primary

    Management of space entities, e.g. partitions, extents, pools · CPC title

  • in relation to availability · CPC title

  • by allocating resources to storage systems · CPC title

  • Replication mechanisms · CPC title

  • Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS] · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11941278B2 cover?
A data storage system includes multiple head nodes and data storage sleds. Volume data is replicated between a primary and one or more secondary head nodes for a volume partition and is further flushed to a set of mass storage devices of the data storage sleds. Volume metadata is maintained in a primary and one or more secondary head nodes for a volume partition and is updated in response to vo…
Who is the assignee on this patent?
Amazon Tech Inc
What technology area does this patent fall under?
Primary CPC classification G06F3/0644. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 26 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).