Remote one-sided persistent writes
US-2019102087-A1 · Apr 4, 2019 · US
US11042501B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11042501-B2 |
| Application number | US-202016838752-A |
| Country | US |
| Kind code | B2 |
| Filing date | Apr 2, 2020 |
| Priority date | Mar 26, 2018 |
| Publication date | Jun 22, 2021 |
| Grant date | Jun 22, 2021 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Distributed storage systems, devices, and associated methods of data replication are disclosed herein. In one embodiment, a server in a distributed storage system is configured to write, with an RDMA enabled NIC, a block of data from a memory of the server to a memory at another server via an RDMA network. Upon completion of writing the block of data to the another server, the server can also send metadata representing a memory location and a data size of the written block of data in the memory of the another server via the RDMA network. The sent metadata is to be written into a memory location containing data representing a memory descriptor that is a part of a data structure representing a pre-posted work request configured to write a copy of the block of data from the another server to an additional server via the RDMA network.
Opening claim text (preview).
We claim: 1. A method for data replication in a distributed storage system having a plurality of storage nodes interconnected by an remote direct memory access (“RDMA”) network, the storage nodes individually having a processor, a memory, and an RDMA enabled network interface card (“RNIC”) operatively coupled to one another, the method comprising: receiving, from a first RNIC at a first storage node, a block of data from a first memory at the first storage node to a second memory at a second storage node via a second RNIC interconnected to the first RNIC in the RDMA network, the second memory having a data structure representing a pre-posted work request for writing a copy of the block of data to a third storage node; and metadata representing a memory location and a data size of the block of data in the second memory received from the first memory via the second RNIC, wherein the metadata is received into a memory region of the second memory holding data of a memory descriptor of the data structure representing the pre-posted work request in the second memory; and upon completion of receiving the metadata, writing, from the second RNIC, a copy of the block of data to a third memory at the third storage node via a third RNIC interconnected to the second RNIC in the RDMA network, thereby achieving replication of the block of data in the distributed storage system without using the processors at the second and third storage nodes. 2. The method of claim 1 , further comprising: receiving, from a client device, data representing an update to a data object stored in the distributed storage system; storing, in the first memory, a copy of the received data as the block of data at the first storage node; and executing, at the first RNIC, another data structure representing another work request for writing the block of data to the second storage node, the another work request having parameters representing the memory address and data size of the block of data to be written in the second memory of the second storage node. 3. The method of claim 1 wherein modifying the file descriptor includes: receiving, at the second storage node, the metadata from the first RNIC; identifying, at the second RNIC, the memory location containing data representing the memory descriptor; and updating, at the second memory, the memory descriptor at the identified memory location with the received metadata from the first RNIC. 4. The method of claim 1 wherein writing the copy of the block of data to the third memory includes: determining whether the modification of the memory descriptor is completed; and in response to determining that the modification of the memory descriptor is completed, at the second RNIC, automatically triggering writing, from the second storage node, the copy of the block of data to the third memory without using the processor at the second storage node. 5. The method of claim 1 wherein writing the copy of the block of data to the third memory includes: determining whether the modification of the memory descriptor is completed; and in response to determining that the modification of the memory descriptor is completed, at the second RNIC, automatically triggering a conditional execution work request at the second storage node; and upon execution of the conditional execution work request at the second RNIC, writing, from the second storage node, the copy of the block of data to the third memory without using the processor at the second storage node. 6. The method of claim 1 wherein writing the copy of the block of data to the third memory includes: determining whether the modification of the memory descriptor is completed; and in response to determining that the modification of the memory descriptor is completed, at the second RNIC, automatically triggering a conditional execution work request at the second storage node, the conditional execution work request being pre-posted at the second storage node as a trigger for executing the pre-posted work request for writing the copy of the block of data to the third storage node; and upon execution of the conditional execution work request at the second RNIC, writing, from the second storage node, the copy of the block of data to the third memory without using the processor at the second storage node. 7. The method of claim 1 , further comprising: upon completion of modifying the memory descriptor, sending, from the second RNIC, another metadata representing another memory address and data size of the written copy of the block of data in the third memory, to the third storage node via the third RNIC. 8. The method of claim 1 , further comprising: at the third storage node, determining whether the third storage node is a last storage node in a replication group that includes the first, second, and third storage nodes; and in response to determining that the third storage node is a last storage node in a replication group, transmitting a notification representing an acknowledgement of receiving a copy of the block of data at the second and third storage node to the first storage node upon receiving the sent metadata from the second storage node. 9. The method of claim 8 , further comprising: at the first storage node, upon receiving the notification representing the acknowledgement from the third storage node, sending additional metadata to the second storage node for committing the block of data to a persistent storage in the second storage node, the additional metadata including data representing a replication group ID, a memory offset of a memory region and a destination memory region, and a size of the block of data being copied; and at the second storage node, upon receiving the additional metadata, updating another memory descriptor of another data structure representing another pre-posted work request for writing the copy of the block of data to the persistent storage at the second storage node; and upon completion of updating the another memory descriptor, automatically triggering, by the second RNIC, the another pre-posted work request to write the copy of the block of data to the persistent storage at the second storage node. 10. A method for data replication in a distributed storage system having a plurality of storage nodes interconnected by an remote direct memory access (“RDMA”) network, the storage nodes individually having a processor, a memory, and an RDMA enabled network interface card (“RNIC”) operatively coupled to one another, the method comprising: receiving, from a first RNIC at a first storage node, metadata to a memory at a second storage node via a second RNIC interconnected to the first RNIC in the RDMA network, the metadata representing memory offsets of a source region and a destination region and a data size of data to be moved, wherein the received metadata modifies data of a memory descriptor in the memory, the memory descriptor being a part of a data structure representing a pre-posted work request for writing a block of data from a first memory region to a second memory region in the memory of the second storage node; and upon receiving the metadata, automatically triggering writing, by the second RNIC, a block of data having the data size from the source region of the memory to the destination region of the memory at the second storage node according to the corresponding memory offsets included in the metadata without using the processor for the writing operation at the second storage node. 11. The method of claim 10 wherein modifying the memory descriptor includes: identifying, at the second RNIC, the memory location containing data representing the memory descriptor; and updating, at th
Improving I/O performance · CPC title
Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor · CPC title
Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS] · CPC title
in relation to data integrity, e.g. data losses, bit errors · CPC title
Replication or mirroring of data, e.g. scheduling or transport for data synchronisation between network nodes · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.