Scale-out non-uniform memory access

US9734063B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9734063-B2
Application numberUS-201514634391-A
CountryUS
Kind codeB2
Filing dateFeb 27, 2015
Priority dateFeb 27, 2014
Publication dateAug 15, 2017
Grant dateAug 15, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A computing system that uses a Scale-Out NUMA (“soNUMA”) architecture, programming model, and/or communication protocol provides for low-latency, distributed in-memory processing. Using soNUMA, a programming model is layered directly on top of a NUMA memory fabric via a stateless messaging protocol. To facilitate interactions between the application, OS, and the fabric, soNUMA uses a remote memory controller—an architecturally-exposed hardware block integrated into the node's local coherence hierarchy.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer system forming a node of a multi-node distributed system that is not constrained to be globally cache-coherent, the computer system comprising: a processor that executes memory operations; a local cache that stores data against which the processor executes local memory operations of the memory operations; a remote memory controller, coupled to at least a part of the local cache and to the processor, against which the processor executes remote memory operations of the memory operations, wherein the remote memory controller interacts with the processor using locally cache-coherent interactions; and an interface between the remote memory controller and a network interface wherein the remote memory controller issues stateless requests from the remote memory controller to a remote node of the multi-node distributed system via the network interface and receives stateless replies from the remote node via the network interface, the stateless requests and the stateless replies being stateless in that the remote node can process the stateless requests from data provided with the stateless requests and local configuration state. 2. The computer system of claim 1 , wherein the remote memory controller is hardwired. 3. The computer system of claim 1 , wherein the remote memory controller is configured with logic for converting memory operations into sets of exchange operations, wherein each exchange operation comprises a stateless request sent from the remote memory controller to the remote node and a stateless reply received from the remote node in response to the stateless request. 4. The computer system of claim 1 , wherein the stateless requests are one-sided memory operations that access a partitioned global address space that spans multiple nodes of the multi-node distributed system. 5. A computer-implemented method for low-latency distributed memory, comprising: under control of one or more computer systems configured with executable instructions, enabling remote memory requests through locally cache-coherent interactions being transmitted via a remote memory controller, wherein the remote memory controller is configured to interface directly with an on-die network interface; converting, at the remote memory controller, application commands into remote requests, wherein the remote requests are transmitted to the on-die network interface and wherein the remote requests are stateless requests to a remote node, the stateless requests being stateless in that the remote node can process the stateless requests from data provided with the stateless requests and local configuration state; and initiating, by the remote memory controller, remote memory access transactions in response to an application remote memory request. 6. The computer-implemented method of claim 5 , further comprising: providing at least three hardwired data processing elements; and interconnecting the least three hardwired data processing elements with a work queue, a completion queue, and the on-die network interface. 7. The computer-implemented method of claim 5 , further comprising: enabling, via a queue pair, an application to schedule remote memory operations; and receiving completion notifications of a completion of the remote memory operations. 8. The computer-implemented method of claim 7 , wherein the queue pair comprises a work queue and a completion queue. 9. The computer-implemented method of claim 5 , further comprising servicing, at the remote memory controller, remote memory access originating at a local node and requests originating from the remote node. 10. The computer-implemented method of claim 8 , wherein the queue pair comprises a work queue and a completion queue, the computer-implemented method further comprising polling, by the remote memory controller, the work queue to detect the application remote memory requests. 11. A computer system forming a node of a multi-node distributed system that is not constrained to be globally cache-coherent, the computer system comprising: a processor that executes memory operations; a local cache that stores data against which the processor executes local memory operations of the memory operations; a remote memory controller, coupled to at least a part of the local cache and to the processor, against which the processor executes remote memory operations of the memory operations, wherein the remote memory controller interacts with the processor using locally cache-coherent interactions and wherein the remote memory controller comprises: (a) a context identifier to create a global address space that spans multiple nodes of the multi-node distributed system; (b) a context segment, the context segment being a range of the node's local address space that is globally accessible to other nodes of the multi-node distributed system; (c) a queue pair being usable by an application to schedule remote memory operations and to receive notification of completion of the memory operations; and (d) a local buffer being usable as a source for, or a destination of, the remote memory operations; and an interface between the remote memory controller and a network interface wherein the remote memory controller issues stateless requests from the remote memory controller to a remote node of the multi-node distributed system via the network interface and receives stateless replies from the remote node via the network interface, the stateless requests and the stateless replies being stateless in that the remote node can process the stateless requests from data provided with the stateless requests and local configuration state. 12. The computer system of claim 11 , wherein the remote memory controller further comprises: a first interface, wherein the first interface is a coherent memory interface to a private L1 cache; and a second interface, wherein the second interface is the network interface and is an interface to an on-die router. 13. The computer system of claim 11 , wherein the remote memory controller further comprises at least three hardwired data processing elements, wherein the at least three hardwired data processing elements comprise: a first data processing element configured to control request generation; a second data processing element configured to control remote request processing; and a third data processing element configured to control request completion. 14. The computer system of claim 13 , wherein each of the at least three hardwired data processing elements is operably interconnected to distinct queues of the network interface. 15. The computer system of claim 13 , wherein memory requests of the at least three data processing elements are configured to access a cache via a memory management unit. 16. The computer system of claim 13 , wherein the remote memory controller is further configured to: unroll multi-line requests in hardware; and generate a sequence of line-sized read or write transactions. 17. The computer system of claim 13 , wherein the remote memory controller and a corresponding private L1 cache are fully integrated into a coherence domain of the node.

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9734063B2 cover?
A computing system that uses a Scale-Out NUMA (“soNUMA”) architecture, programming model, and/or communication protocol provides for low-latency, distributed in-memory processing. Using soNUMA, a programming model is layered directly on top of a NUMA memory fabric via a stateless messaging protocol. To facilitate interactions between the application, OS, and the fabric, soNUMA uses a remote mem…
Who is the assignee on this patent?
Ecole Polytechnique Fed Lausanne Epfl, École Polytechnique Fédérale De Lausanne (Epfl)
What technology area does this patent fall under?
Primary CPC classification G06F12/0813. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Aug 15 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).